True to the impressive title of the article: “YOLO9000: Better, Faster, Stronger” , YOLOv2 inherits and develops from YOLOv1 with a series of new changes and improvements to produce an upgraded version that is both good and good. better, faster, and more powerful . These changes include reusing previous work, and creating new methods. The YOLOv2 enhancement model achieves SOTA results on PASCAL VOC and COCO datasets, outperforming other methods such as Faster R-CNN + ResNet and SSD while still being much faster:
- At 67 FPS, YOLOv2 has an accuracy of 76.8 mAP on the VOC 2007 test dataset.
- At 40 FPS, YOLOv2 has an accuracy of 78.6 mAP.
Next, the authors propose a method to train YOLOv2 simultaneously on the detection and classification dataset. With this method, the model is simultaneously trained on the COCO (detection) and ImageNet (classification) data sets, resulting in a YOLO9000 version with the ability to detect more than 9000 different objects, all in real time. real.
2. Algorithm details
Again, YOLOv1 had some disadvantages compared to the leading detection systems at the time:
- YOLO has quite high Localization Errors – it has difficulty locating objects accurately.
- YOLO also has a rather low Recall compared to the region proposal methods.
Therefore, YOLOv2 mainly focuses on improving recall and localization while maintaining classification accuracy, thereby improving the accuracy of the model. These changes include reusing previous work, while also generating new ideas, and are listed below:
- Batch Normalization : Using batch normalization greatly improves convergence without the need to use regularization. By adding batch normalization to all convolutional layers, performance is improved by 2% mAP. In addition, it also helps regularize the model, removing dropout from the model without overfitting.
- Using High Resolution Classifier : YOLOv1 trains Classifier Network at resolution
224 × 224 224 times 224 and increase the resolution
YOLOv2 solves the above drawback. First, we fine tune the classification network at the resolution
448 × 448 448 times 448 out of 10 epochs on the ImageNet dataset. This gives the network time to adjust its filters to work better on high-resolution input images. After that, we just fine tune this network for detection. This high resolution classification network increases mAP by almost 4%.
- Using Anchor Box to predict Bounding Box : YOLOv1 directly predicts the coordinates of the bounding box using the Fully Connected Layers immediately following the Convolutional Feature Extractor.
YOLOv2 improves on this by reusing the Anchor Box idea in Faster R-CNN. This makes it easier for the network to predict bounding boxes. Then, YOLOv2 will discard the last 2 Fully Connected Layers of YOLOv1, because predicting the bounding box from the anchor boxes and confidence score requires only Convolutional Layers. YOLOv2 also removes the Pooling Layer so that the output of Convolutional Layers has a higher resolution.
Another advantage of using anchor boxes is that we eliminate the constraint that each cell can only predict one object (class) like in YOLOv1. Instead, we will predict the class and objectness for every anchor box. This will increase the number of detected objects, since each cell will predict more objects.
In addition, YOLOv2 adjusts the network to predict on the input image size
416 × 416 416 times 416 instead of
Similar to YOLOv1, predicting objectness will still predict the IOU of the ground truth and proposed box, and predicting the class is still predicting the conditional probability that the class knows that exists. an object:
P r ( OLD l a S S i ) = P r ( OLD l a S S i O b j e c t ) × P r ( O b j e c t ) Pr(Class_i) = Pr(Class_i | Object) times Pr(Object)
P r ( O bj ec t )
Using anchor boxes in YOLOv2 will increase the number of predicted bounding boxes to more than 1000 boxes/image (much more than YOLOv1 with only 98 boxes per image). This reduces accuracy by a small amount. Specifically:
- YOLOv1 reached 69.5 mAP, recall = 81%.
- With the anchor box, YOLOv2 reached 69.2 mAP, recall = 88%.
Thus, although the mAP of YOLOv2 decreased compared to YOLOv1, recall increased significantly.
- Estimating Anchor Boxes : When using anchor boxes for YOLOv2, two problems arise. The first problem is that the initial size of the anchor box is chosen at random. Although the network can learn to fit the boxes properly, however, if the selected anchor boxes are of good enough size, the network’s learning becomes easier and thus predicts good detections.
It has been found that in most data sets, bounding boxes are usually sized according to certain proportions and sizes. For example, the bounding box of a normal person will have an aspect ratio (width / height ratio) of 1:3, or the bounding box of a car viewed from the front often has an aspect ratio of 1:1.
So, instead of choosing the initial size of the anchor box randomly, we will use the K-Means Clustering algorithm on the set of bounding boxes of the training set to automatically find the sizes of the anchor boxes, which will be the anchors. box represents the common bounding box sizes in the training set.
The mechanism of the k-means algorithm in predicting the anchor box is as follows:
- Initially, we will initialize randomly
k k anchor boxes do
k k centroids (center of cluster) first.
- For each anchor box, we calculate
I O U IOU of each bounding box with that anchor box.
- Because we want the anchor boxes to have
I O U IOU is good for the bounding box, so we define the distance metric as follows:
Explain the above formula: We have
0 I O U first 0 leq IOU leq 1 . Bounding box has
- After calculating d(box, centroid) for each anchor box, we divide the bounding boxes into the corresponding centroids and then update again.
k k centroids.
- Repeat the above steps until the algorithm converges.
The figure above is an illustration of the K-Means Clustering algorithm in predicting the above anchor boxes, with
k = 5 k = 5 . Each cluster is a different color, corresponding to the color of the center (which is the size of the anchor box).
The figure above is anchor box clustering on the VOC and COCO datasets. The figure on the left shows the value
I O U IOUAverage with values
k k is different. Value
k = 5 k = 5 was chosen because there is a good tradeoff between high recall vs model complexity. The figure on the right shows the cluster centers of two data sets VOC and COCO. We see that COCO has a larger size variation than VOC. After performing K-Means, the cluster centers (which are the anchor box sizes) are significantly different from the manually selected anchor boxes.
The table above compares the average IOU with the closest anchor box on the 2007 VOC dataset, obtained from the above K-Means Clustering algorithm and selecting the anchor boxes manually. Only with
k = 5 k = 5 anchor boxes, Cluster IOU (2nd row) gave the same results as Anchor Boxes (3rd row) with 9 anchor boxes. (
- Initially, we will initialize randomly
- Directly predicting the center coordinates of the bounding box : This is the second problem we encounter when using anchor boxes, especially in the first loops. That instability mainly comes from predicting the center coordinates
( x , y ) (x, y) of the bounding boxes. Recall that in Faster R-CNN’s Region Proposal Network (RPN) , it will predict two values
The problem is that the other formula has no constraints at all. Eg
t x = first t_x = 1 will shift the bounding box to the right by an interval equal to the width of the bounding box,
So, instead of predicting the center coordinates of the bounding box from the anchor box, YOLOv2 will use the same approach as YOLOv1, which is to directly predict the center coordinates of the bounding box according to the position of each grid cell in the feature. map instead of predicting by anchor box position. This helps to constrain the bounding box’s coordinates to the segment
[ 0 , first ] [0, 1] . To do that, we will use the logistic activation function (
The picture above is an illustration of what we just said above. We will predict the size (width and height) of the bounding box (light blue image) according to the size of the anchor box (dotted rectangle) obtained from the clustering algorithm above. The center coordinates of the bounding box will be predicted according to the position of the cell on the feature map using the sigmoid activation function.
YOLOv2 predicts 5 bounding boxes per cell in the feature map. Network will predict 5 coordinates for each bounding box:
t x , t y , t w , t H , t o t_x, t_y, t_w, t_h, t_o . If that cell has coordinates
The final formula is to predict the confidence score of the bounding box
b b .
Since we limit the position of the predicted bounding box, the learning becomes simpler, thereby making the network more stable. Using anchor box clustering together with directly predicting the coordinates of the bounding box by cell of the feature map increases the mAP by almost 5% compared to the YOLO version that predicts the position of the bounding box according to the position of the anchor box. .
- Use more detailed features : YOLOv2 predicts detection on feature map dimensions
13 × 13 13 times 13 – enough to predict large objects. Moreover, it overcomes the drawback of YOLOv1 – the difficulty in predicting small objects – by using many feature maps of different sizes (inspired by Faster R-CNN and SSD). This will help improve the prediction of small objects from finer grained features.In particular, YOLOv2 will add a feature map of size.
The passthrough layer has the effect of connecting (concatenate) the feature map
13 × 13 13 times 13 with feature map
Normally, joining two feature maps is only possible when they have the same width and height. In the article, the author simply said (verbally): “The passthrough layer concatenates the higher resolution features with the low resolution features by stacking adjacent features into different channels instead of spatial locations, similar to the identity mappings in ResNet. This turns the
26 × 26 × 512 26 times 26 times 512 feature map into a
Through research, the above feature map resizing technique is called Reorg . This is essentially just a technique to reorganize the memory to turn the feature map
n × n × c first n times n times c_1 Fort
The main idea of this technique is as follows. Suppose we want to reduce the length and width of each side
2 2 times the number of channels must be increased
4 4 times. This transformation is not at all like the resize operation in image processing. For easy visualization, you can see the figure below:
The image above is a size feature map
4 × 4 4 times 4 . To bring the feature map to size
Thus, using the Reorg technique has turned the feature map
26 × 26 × 512 26 times 26 times 512 to feature map
- Training on different sized images : Since YOLOv2 uses only Convolutional Layers and Pooling Layers, it can resize input images during the algorithm run. Therefore, YOLOv2 can adapt well to many different sized input images. The author trained the network on many different image sizes to increase the adaptability of YOLOv2 to various image sizes. This means that YOLOv2 can make predictions at different resolutions.
The above table compares detection systems on the PASCAL VOC 2007 test dataset. We can see that YOLOv2 is both faster and more accurate than previous detection systems. YOLOv2 will run faster with small sized images. At the highest resolution, YOLOv2 achieves the greatest accuracy with 78.6 mAP while still achieving speeds greater than real time (40 FPS). Moreover, it can run at different resolutions without too much trade-off between speed and accuracy.
YOLOv2 gives more accurate prediction results, and its speed is also faster. To do that, its network architecture has changed significantly compared to YOLOv1.
Instead of using a custom network based on the GoogLeNet architecture, YOLOv2 uses a new classification model as the base network, named Darknet-19 . It includes 19 Convolutional Layers and 5 Maxpooling Layers. The following figure depicts the specific architecture of Darknet-19.
Darknet-19 only needs 5.58 billion operations to process an image, while with YOLOv1’s architecture is 8.52 billion operations, while still achieving 72.9% top-1 accuracy and 91.2% top-5 accuracy on the dataset. whether ImageNet. Thus, it can be seen that the speed of YOLOv2 is significantly increased (34%) compared to YOLOv1.
We summarize the ideas in the table below:
Most of the ideas listed in the table above increase mAP significantly, except for switching to a Fully Convolutional Network with Anchor Box and using a new backbone network. Switching to Anchor Box increased recall while keeping mAP almost the same (from 69.5 to 69.2), while using a new backbone network reduced computational costs by 34%. (mAP from 69.2 to 69.6).
Although detection systems are getting faster and more accurate, they are still constrained by a small set of objects. The datasets for object detection are very limited compared to other tasks like classification and tagging. The most common detection dataset contains thousands to hundreds of image views, with the number of labels ranging from tens to hundreds. The classification data sets are much more extensive, consisting of millions of images with tens or hundreds of thousands of classes.
Increasing the size and number of classes for the detection dataset is not simple at all because labeling images for detection is much more expensive than labeling for classification or tagging (because in addition to labeling images for detection). class, we also have to assign exact bounding box coordinates, which is extremely time consuming). Therefore, it is unlikely that the detection dataset will be as large and large as the classification dataset in the near future.
Therefore, the author proposes the following two solutions:
- Propose a method to exploit a large number of existing classification data, use it to expand the scope of object recognition for the detection system. This method makes it possible to combine different data sets together.
- Propose a concurrent training algorithm that makes it possible to train object detectors on both classification and detection datasets. This method uses:
- The detection images have been labeled to learn information about the detect: predicting bounding box coordinates to accurately predict the object, objectness (object existence or not), how to classify common objects .
- Classification pictures to increase vocabulary – expand the number of predictable classes – thereby making YOLOv2 even more powerful.
To accomplish the above two things, during training, the algorithm will mix images from two classification and detection datasets together:
- When the network sees an image labeled for detection, it backpropagates the error across the entire loss function.
- When the network sees an image labeled for classification, it just backpropagates the error from the classification error components of the loss function.
At this point, a new problem emerges, that the detection datasets only have common objects and generalized labels, such as “dog”, “person”, “boat”. For example, the COCO dataset below with 80 classes has a common meaning:
Meanwhile, the classification dataset has both more labels and more depth. For example, the ImageNet dataset (picture below) (22k class) has more than one hundred dog breeds, such as “Norfolk terrier”, “Yorkshire terrier”, “Bedlington terrier”, etc. So we need to find a way to match the most related labels to be able to train concurrently on the data sets.
Most classification methods use the Softmax Layer function to classify objects, which assumes each image has only one label. However, we cannot apply that function to the ImageNet dataset because an image can have more than one label, such as “Norfolk terrier” and “dog”. We can use multi-label models to solve this problem, but it is at a disadvantage in the COCO detection dataset, because the images in this dataset have only one label. Thus, we can see that there is a contradiction between the detection and classification datasets.
2.3.1. Hierarchical classification
The labels of the ImageNet dataset are obtained from WordNet – a language database used to structure concepts and their relationships to each other. In WordNet, “Norfolk terrier” and “Yorkshire terrier” are both hyponym (a word with a more specific meaning and within the meaning of another word) of “terrier” – which is a category in “hungting dog” ” – is a category of “dog”.
Most classification methods assume a flat structure – words have equal, independent and separate meanings from each other, with no word depth (no word is within the meaning of another word). However, to be able to combine data sets, we need to build a structure for classes.
WordNet is structured as a directed graph, not a tree, because the language is very complex. For example, “dog” belongs to the “canine family” and “domestic animal”, meaning “dog” belongs to two different branches.
We will rely on the structure of WordNet to build a hierarchical tree from the concepts in the ImageNet dataset, using only visual nouns, with the root node being “physical object”. The way to do this is as follows:
- First, we will add branches that have a single path from the root node.
- With the concepts left behind, we add paths to make the tree grow as little as possible. For example, if a concept has two paths from the root, one path adds three edges to the tree, the other only adds one edge to the tree, then we will choose the shorter path.
We call the above implementation the WordTree – a hierarchical model for visual concepts:
To perform the classification on WordTree, we predict the probability of an object based on the product of the conditional probabilities, going from that node to the root node (we assume the image contains the object, so Pr(physical object) = 1). For example:
So we have introduced and detailed the YOLOv2 and YOLO9000 and their improvements over the first YOLO version. YOLOv2 gives better results and faster speed than other recognition systems on different detection datasets. Moreover, it can process images with different sizes without having to trade off much between speed and accuracy.
YOLOv2 is further improved by concurrent training between the detection and classification dataset, thereby producing the YOLO9000 version with the ability to predict more than 9000 objects. We use WordTree to combine data from many different sources and optimize techniques to train simultaneously on ImageNet and COCO datasets. YOLO9000 marks a big step forward in bridging the gap between data set detection and classification.