CenterNet: Keypoint Triplets for Object Detection – A new direction in the Object Detection problem

Tram Ho

I. Introduction

You must have heard a lot about the Object Detection problem in Computer Vision . Models that are widely used today such as: Yolo, Single Shot Detector, Faster-RCNN, … are using the same anchor boxes technique to determine the size of the object on the input image. However, to use this technique, we need to generate a large amount of bounding boxes based on a predefined set of anchor boxes together with manually adjusted offset indexes (offset – here you can understand it is simply the parameter that adjusts the width of the anchor boxes), then removes the redundant bounding boxes based on the IoU index. So the question arises Why not design a model that only generates less bounding boxes to reduce processing time but still increase accuracy? .

And CenterNet was born to solve this question. A pretty good note is: There are two papers of the same name CenterNet solving the Object Detection problem, both published in 2019, at the same conference. These are: CenterNet: Objects as Points and CenterNet: Keypoint Triplets for Object Detection . Both of these methods rely on keypoints to generate bounding boxes to identify the object. However, in the limit of today’s article, I only mention paper CenterNet: Keypoint Triplets for Object Detection due to the higher accuracy in the report.

II. CenterNet: Keypoint Triplets for Object Detection.

Actually, the idea of ​​identifying objects by keypoints is not new. In the past, there have been a number of studies, most notably Corner-Net that solves the object detection problem based on this method. Each object is defined by Corner-Net based on two keypoints that feature bottom right and top left corners. However, the identification of the subject based on two angles is quite sensitive, easily confused with the edge features if present in the input image. Furthermore, the ability to determine whether those two keypoints are two corners of the same object is problematic. Due to the weak ability to learn global information and the corner keypoints that are often outside the object, it is difficult to group two corners that are the same object. To solve this problem, CenterNet learned more global information and one of the ways is to add a keypoint for the center point (center) of the object. Thus, CenterNet defines an object based on three keypoints: Top left, Bottom right and Center. And this is also the reason for the name Keypoint Triplets (3 keypoints).

1. Overview of Centernet architecture


Figure 1: CenterNet architecture

Centernet is based on the CornerNet architecture as the basis. Go through a CNN backbone that extracts features from the image, then uses two special pooling classes to generate heatmaps for the Corner and Center keypoints . The two layers are: Cascade Corner Pooling and Center Pooling . It is these two layers that help the model to improve both accuracy and FD (false discovery compared to CornerNet by overcoming the CornerNet shortcomings I mentioned above.

In the article the author mentioned the concept of FD (false discovery) and Heatmap (heat map).

  • If the average precision (AP) is the value evaluating the accuracy of an object detector model such as SSD, Faster-RCNN, … at IoU = [0.05: 0.05: 0.5] on a given dataset Like MS-COCO, … In contrast, FD is a measure of incorrect bounding boxes or has an IoU ratio below a given threshold Example: CornerNet reaches 32.7% at IoU = 0.05. This means that for every 100 objects there are 32.7 objects with an IoU ratio below 0.05.

Figure 2: Illustration of the heat map (Heatmap)

  • Heatmap or good map. Each point on the heat map corresponds to a keypoint with a score of the probability that the keypoint is the center of an object. Example with an input image of sizes (W, H, 3). After going through a backbone network with stride = R, we will have a heat map of size (W / R, H / R, C). If a point on the heatmap has a value of 1 for a keypoint, then 0 for a background point.

With a brief grasp of that, let’s explore the architecture of CenterNet deeply together. Let’s go🕺

1.1. Center Pooling

Figure 3: Center Pooling Layer (photo a)

The middle point of an object cannot contain all of that object’s identity. Example: The midpoint of a person is usually located on the body while the head which is the most important part helps us to recognize the object. To solve this problem, the author has proposed the Center Pooling class to help the model learn more information about the entire object. This pooling class takes the input of a feature map extracted over a CNN network, then if we determine which pixel is a center keypoint, we add this value with the maximum value both vertically and horizontally from that pixel. Thanks to the addition of the greatest value, the learning model adds to the characteristics of the whole object.

Now we make use of the center keypoints identified by the following steps:

  • Generate k bounding boxes using algorithm used in paper CornerNet
  • Select k center keypoints with the highest probability score
  • Use the offsets corresponding to those keypoints to locate those center keypoints on the input image
  • Defines an area between each bounding box
  • Check if the center keypoint is in this area then keep the bounding box, otherwise remove it. The accuracy of each of these bounding boxes equals the accuracy of the three keypoints that define it. Note: The size of the area between each bounding box affects the results of detection. If the middle area is small in a small bounding box, it can easily lead to a lower recall because the center keypoint is easily located outside of this area. The large middle area for large bounding boxes easily leads to low precision because keypoints that are not center keypoints are also in this area should be retained.

After this step we obtain the keypoint center along with bounding boxes. However, these bounding boxes are not really accurate and need to be adjusted by combining with Cascade Corner Pooling.

1.2. Cascade Corner Pooling

Figure 4: Cascade Corner Pooling (photo c)

Cascade Corner Pooling was born to overcome the Corner Pooling’s weak ability to learn global information in CornerNet. This pooling class finds the corner keypoint by finding the maximum value on a boundary. Then along that maximum value look inside the object to find the internal maximum value, then add these two maximum values ​​together. For example, if the topmost boundary is viewed along the bottom, the leftmost boundary is vertical to the right, ….. This way, the boundary infomation and the object’s information boundary

2. Loss function

2.1. Focal Loss

Focal loss is actually improved from the cross entropy loss with the improvement to limit the imbalance between the positive (bounding box containing the object) and negative (bounding box containing the background). Because normally the number of negatives is much larger than the positive number.

We can see that through the formula of the cross entropy loss function:

CE = c = first M y o , c l o g ( p o , c ) sum limits_ {c = 1} ^ M y_ {o, c} log (p_ {o, c}) c = 1 Σ M y o, c l o g (p o, c )


  • M: class number
  • p: probability to predict that object o belongs to class c
  • y: 0 if class c is truly the class of object o

The cross entropy loss formula considers a positive object or negative object the same. Therefore, it is easy to learn negative objects because they are much more numerous than positive objects, so they strongly influence the loss function. To overcome this imbalance, a new loss function is proposed, the Balanced Cross Entropy :

CE = c = first M α c y o , c l o g ( p o , c ) sum limits_ {c = 1} ^ M {- alpha_ {c} y} _ {o, c} log (p_ {o, c}) c = 1 Σ M α c y o, c l o g (p o, c )

Inside α c alpha_ {c} α c = first f c + a frac1 {f_ {c} + a} f c + a 1 , fc f_ {c} f c is the frequency of class c. Here we add very small positive a to avoid the case of zero sample.

Using this function, classes that appear less often have a greater impact on the loss function than with traditional cross entropy. However, this approach is not really thorough. Hence the Focal loss was born. For convenience of calculation, put y o , c l o g ( p o , c ) y_ {o, c} log (p_ {o, c}) y o, c l o g (p o, c ) = p t PT} p t .

The formula Focal Loss : FL ( p t PT} p t ) = – α t ( first p t ) γ l o g ( p t ) alpha_ {t} (1 – p_ {t}) ^ gamma log (p_ {t}) α t ( 1 p t ) Γ l o g (p t )

Due to the majority of objects, the probability of predicting p is often high because gradient descent tends to learn that. However, thanks to the addition of factors (1 – p t ) γ p_ {t}) ^ gamma p t ) γ so such objects don’t seem to have much impact on the loss function.

2.2. CenterNet Loss

In this paper, the author defines a loss function to train the model:

L = L d e t c o L_ {det} ^ {co} L d e t c o +L d e t c e L_ {det} ^ {ce} L d e t c e + α alpha αL p u l l c o L_ {pull} ^ {co} L p u l l c o + β beta βL puSH co L_ {push} ^ {co} L p u s h c o + γ gamma γ (L off co L_ {off} ^ {co} L o f f c o +L off ce L_ {off} ^ {ce} L o f f c f )


  • L d e t c o L_ {det} ^ {co} L d e t c o andL d e t c e L_ {det} ^ {ce} L d e t c e This is the focal loss used to define the corner and center keypoint
  • L p u S H c o L_ {push} ^ {co} L p u s h c o “push” loss is used to maximize maximum distance for embedding vectors of different objects.
  • L p u l l c o L_ {pull} ^ {co} L p u l l c o The “pull” loss is used to optimize the spacing of the embedding vector for the same object.
  • L o f f c o L_ {off} ^ {co} L o f f c o andL o f f c e L_ {off} ^ {ce} L o f f c f is L1 loss used to predict the indices for the center and corner keypoint corresponding to the indexes α , γ , β alpha, gamma, beta α , γ , β depends on the respective loss function.

III Conclusion

CenterNet is a relatively new idea besides anchor-based models that have been around for a long time. In this paper, the author proposed to define the object based on three keypoints: two corners and one center keypoint and achieved outstanding accuracy compared to other methods. Thank you for watching my post. Have any questions, please comment below.


Share the news now

Source : Viblo