Tram Ho

# Introduction

Hello everyone, today I’m going to babble a little about the Pose Classification. As you know, the problem of body movement or detecting points on the human body is an important problem in this ML industry, because the applicability of this problem is quite diverse: such as detection moves in the supermarket, simulates physical therapy exercises in health, supports PT in gym exercises, bla … bla …

I write this article to summarize what I have learned, if there are shortcomings, please ignore them

This article revolves around Google’s Mediapipe library, if there is time I will mention other libraries later.

# Pose Detection and Pose Tracking

Before going into the Pose Classification section, let’s take a look at google how it is possible to detect spots on the human body.

Their solution is based on a paper they suggested, keyword: BlazePose ( https://export.arxiv.org/pdf/2006.10204 ). Based on this solution, they will extract 33 points corresponding to 33 parts of the human body or 25 points corresponding to the upper part of the body in 3 dimensional space (x, y, z) from an RGB video.

BlazePose briefly explains, basically BlazePose is an improvement of the Stacked Hourglass network ( https://export.arxiv.org/pdf/1603.06937 )

## Stacked Hourglass Network

Network structure Stacked Hourglass:

The idea of ​​this network is that instead of having a super large encoder-decoder, each Hourglass is responsible for returning a heat-map that predicts the parts of the body. Since this is a superimposed model, the following Hourglass can learn from the guy’s results first.

How do you detect human movement via heat-map? Different from data on human faces (72 landmark keypoints, …), the data on people’s movements is more diverse, so it is difficult to find points on the human body based on the coordinates. Scientists have devised a method to use heat-map to represent an area in the image. Heat-map helps to retain information about that area and our job is to find the peak (brightest point) in the area. For example, with a 256×256 image the heat-map could have a size of 64×64. See the picture for easy understanding:

As the author of the paper mentioned, they will calculate the loss in each predict, which helps them monitor not only the final returns, but also the output of each Hourglass. For example, while moving the body there will be parts that are hidden in front of the camera, it is difficult to distinguish whether the arm is facing to the left or the right. Using the position prediction result of the previous one as input, the model not only notices these positions but also predicts new positions at the same time.

### Hourglass Module

Now let’s take a look at the structure of a Hourglass

As you can see from the picture, this is an encode-decoder architecture that downsample the features first and then upsample to retrieve the information and convert it to heat-map. Each layer of the encoder is connected to a layer of the corresponding decoder. And each layer is built based on residual block and bottleneck architecture of resnet, if you do not know about residual block, please read this link ( https://towardsdatascience.com/residual-blocks-building-blocks-of-resnet- fd90ca15d6ec ).

Left: Residual Layer. Right: Bottleneck Block

Bottleneck makes computations easier, corresponding to memory savings.

Now let’s try to enlarge a box in the image above

Each box in the picture is the bottleneck class I mentioned. After each bottleneck there will be a pooling class to remove unnecessary features.

However, the first layer is a bit different, this class uses Convolution 7×7, not 3×3.

As shown above, in the first layer, the first input goes through the combination of Convolution 7×7, BatchNorm, Activation Relu, and continues through the bottleneck layer. Here the output from the bottleneck class will pass through 2 parallel branches. One goes to the MaxPooling class and does the feature extraction, the other connects to the corresponding class in the decoder.

The other 2, 3, and 4 boxes have the same structure and are different from the first box.

The ultimate goal of feature extraction is to create feature maps, which contain the most image-specific information but minimal spatial information. This part is 3 small boxes located between the encode and decoder.

The input after going through the 4 layers of the encoder and a bottom class returns feature maps that are ready to go through the decoder.

As I mentioned earlier, the other branch will go through the bottleneck class and do element-wise adding to the output of the upsample class of the main branch. This was repeated about 4 times until the end.

In the last layer, we can observe the prediction of Hourglass module. This is also known as immediate supervision, you will calculate the loss at the end of each Hourglass stage instead of calculating the loss of the entire model.

The output of a Hourglass module goes through Convolution 1×1, then split into 2 parallel branches. One is used for prediction and the other returns the result as input to the next Hourglass module. Finally, we perform element-wise addition between the input of the network (heatmap) and both outputs of the Hourglass module. P / S: Predictive results go through Convolution 1×1 to get the correct shape, then add each element.

Finally, to build a Stacked Hourglass Network, we need to iterate over these Hourglass modules over and over.

## BlazePose

Ok, back to Google’s BlazePose algorithm, you can find out by reading the article “Learn about BlazePose” by Pham Van Toan

It’s not like I’m lazy, guys, it’s just someone who wrote it first.

# Pose classification

Ok, you guys understand BlazePose already. After using Pose Landmark Model (BlazePose GHUM 3D) to detect motion on an image, it returns 33 points on the body as shown below:

## Python Solution API

Thankfully, Google provides a Python API solution where you only need to import mediapipe library, code a few lines to run. Mediapipe provides a solution of detect pose on still images and videos. Below is the code for still images

Import library used to process opencv images and mediapipe library. Then set 2 variables to use the mediapipe functions

With still images, you have to adjust the parameters for the Pose class as follows

There is a model in the library of mediapipe, so you don’t need to train again.

If you want to print the detect output then remember to mediapose returns the coordinates as a ratio to the image so you need to multiply by the height and width to plot the correct coordinates on the image.

If you want to draw points or lines on an image, use the draw_landmarks () function.

Ok for video, use cv2 (opencv) to capture video, create while True loop to process each frame one by one, anyone who works a lot with opencv must be familiar.

Basically, it is not much different from static image processing, only when calling the instance of mediapipe pose does not need the `static_image_mode` . On the contrary, there is an additional `min_tracking_confidence` used to `min_tracking_confidence` each frame person, the value is between 0 and 1, the higher it is, the better it is but the delay (more processing time).

Based on BlazaPose, mediapipe returns a point with x, y, z coordinates. x, y then as I said above is returned as a ratio of width and height and z is the depth of the center of gravity of the body. The closer the value is to the camera, the z is returned as a ratio to the width like x. z is only predictable in full-body mechanism (33 points) and upper-body mechanism (25 points) is not supported.

## Prepare dataset

Ok, let’s return to Pose Classification, they simply use KNN (K-nearest neightbors) to group the same actions. About what is KNN invite readers ( https://machinelearningcoban.com/2017/01/08/knn/ ).

First you need to prepare training data. For example, in push-ups there are two states: up and down as shown below. So I need to prepare images of both of these states.

Google supports the code to gen the data, all you have to do is understand it.

You need to create a folder with the following structure:

The photo here is an image of the push-up and push-down status.

Call the bootstrap_helper instance from the BootstrapHelper () class. This class is responsible for detecting 33 points on the body, drawing points and linking in lines on the image, outputting a csv file.

Check the status of these folders

Erase the action detected error by comparing images with csv files

## Pose Embedding

Embed 33 points on the body by calculating the distance between points such as: left shoulder distance – left hip, right shoulder – right hip, left knee – left heel, right knee – right heel, …

## Pose Classifier

Call the instance of class PoseClassifier (), which classifies the class by comparing the similarity between the length of points on the human body of the image and the database.

## Pose Smoothing

When detecting 33 points on the body, the points will move wildly in each frame even though there is no difference with the naked eye, so I need to smooth the prediction data in the frames using the EMA (Exponential) algorithm. moving average). The theory can be found here: ( https://www.tohaitrieu.net/exponential-moving-average-ema/ ).

## Check TRA

Ok, test with a video that is not in the current dataset by synthesizing all the code above. The code below I removed the repetition counter and visualizer for easy viewing. You can add to get the most intuitive results.

# Epilogue

Stop, I’m lazy, don’t write anymore. Thank you for reading here

https://export.arxiv.org/pdf/2006.10204

https://export.arxiv.org/pdf/1603.06937

https://towardsdatascience.com/using-hourglass-networks-to-understand-human-poses-1e40e349fa15