YOLOv1 – Identify objects with only one look – Origin

Thursday, 08/09/2022

Tram Ho

1. Introduction

At the time YOLO was launched by Facebook AI Research (FAIR) , previous works on Object Detection used classifiers and adjusted it to be more suitable for performing object recognition. YOLO uses a completely new approach, treating object recognition as a regression problem for bounding boxes and class probabilities respectively.

YOLO uses only a single neural network to directly predict bounding boxes and class probabilities from the entire image by a single evaluation . Therefore, we only need to look at the image once to be able to predict the objects in the image and their positions. That is also the explanation for the fairy tale YOLO – You Only Look Once (you only need to look once) . It is for that reason that the processing speed of YOLO is extremely fast. The whole detection process is a single network so it can be optimized end-to-end.

Detection systems at that time often used a classifier to evaluate objects at different positions and scales in a test image. Some prominent systems at the time were Deformable Parts Models (DPM) with a method of using sliding windows, where classifiers would be run at evenly spaced positions across the entire image, or The famous models of the R-CNN family use the region proposal method to first generate bounding boxes that are likely to contain objects in an image, and then run a classifier on the proposed box to classify the object. After classification, post-processing is used to refine the bounding boxes, remove duplicate detections, etc. This complex pipeline mechanism is often very slow and difficult to optimize because each individual component needs to be evaluated. train separately.

YOLO, in contrast to the methods above, is extremely simple. It uses a single CNN to simultaneously predict bounding boxes and class probabilities for those boxes. Furthermore, YOLO is trained on the entire image – as opposed to the methods above that train only on small areas of the image – thus directly optimizing search performance.

The figure above is an example of YOLO’s detection system. Image processing with YOLO is simple and easy. The above model includes 3 steps:

Resize input image to size
$448 \times 448$ .
Run a CNN on the photo.
Refine model results.

YOLO’s unified model has several advantages over traditional object detection methods as follows:

Firstly, YOLO has extremely fast processing speed . Since YOLO treats object detection as a regression problem, we don’t need to resort to a complicated pipeline mechanism. We simply run the neural network on a new image at test time to predict detections. YOLO’s base network has a speed of 45 fps, and the fast version of YOLOv1 (Fast YOLO) has a speed of more than 150 fps. This makes it possible to handle streaming video in real-time with latency less than 25 milliseconds. YOLO achieves more than twice the mAP accuracy of other real-time systems.
Second, YOLO argues globally about the picture when making predictions . Unlike the sliding window or region recommendation methods mentioned above, YOLO looks at the entire image during training and testing, so it implicitly understands contextual information about the images. classes as well as their appearance. So the background errors of YOLO are significantly smaller than other methods, such as Fast R-CNN.
Third, also the most novel: YOLO has the ability to learn generalizable representations of objects . When training YOLO with real images and testing with art-work, YOLO outperforms other leading detection methods like DPM or R-CNN. Because YOLO is highly generalizable, it rarely fails when introduced to new fields or using unusual inputs.

The image above is the result of YOLO’s prediction based on art-work and real photos from the internet. It gives most accurate predictions, even though it thinks a person is an airplane. (Picture 7 from left to right, top to bottom.)

However, YOLO is still behind other leading detection systems in terms of accuracy. While it was able to quickly identify objects in the photo, it had difficulty locating some objects, especially small ones.

2. Details

2.1. How it works

YOLO splits an input image into a grid of size

$S \times S$ . The value of S is chosen as

$7 in paper. If the center of an object falls on a cell, that cell is responsible for finding that object.$

The image above is an example of an input image divided into a grid of size

$S \times S$ . In YOLO, the value

$S is chosen by$

$7 .$

Each grid cell will predict

$B$ bounding boxes and confidence score for each of those boxes. We will take a closer look at these two concepts below.

Confidence score will reflect two parameters:

The model’s confidence level in predicting that the box contains the object.
How accurate is the predicted box (i.e. how well it matches the ground-truth box).

From the above two ideas, we define confidence score more rigorously as follows:

$IOU_{pred}^{truth}$

$P r (O bj ec t) \times I O U_{p re d}^{t r u t}$

From the above formula, we can draw a few observations as follows:

If no object exists in that cell then
$P r (O bj ec t) = 0 confidence$ score $= 0 .$
Conversely, if that cell contains objects
$P r (O bj ec t) = 1$ , so we expect confidence score $= I O U between the predicted box and the ground truth box.$

Each bounding box will contain 5 predictions:

$x, y, w, h$ , confidence.

$(x, y)$ are the coordinates of the center (offset) of the bounding box relative to the position of the grid cell, so the value of $x, y will fall on the segment$
$[0, 1] .$
$w, h$ are the width and height of the bounding boxes, normalized to the width and height of the original image, so their values will fall into the range $[0, 1] .$
Confidence represents value
$I O U$ between the predicted box and the ground truth box.

Each grid cell will also predict

$C$ conditional probability of classes:

$Pr(Class_i mid Object)$

$P r (Cl a s s O bj ec t)$

These probabilities are conditional on the grid cell containing the object.

NOTE : YOLO only predicts one set of class probabilities per grid cell, regardless of the number of bounding boxes

$B$ is how much.

At test time, we multiply the conditional probability of each class by the confidence prediction of each box as follows:

$Pr(Class_i mid Object) times Pr(Object) times IOU_{pred}^{truth} = Pr(Class_i) times IOU_{pred}^{truth}$

$P r (Cl a s s O bj ec t) \times P r (O bj ec t) \times I O U_{p re d}^{t r u t} = P r (Cl a s s) \times I O U_{p re d}^{t r u t}$

The above formula gives us confidence scores of each class for each box. This formula tells us:

Probability of
$class_i$ $work____$ appear in that box
The match of the predicted box to the object.

We summarize the operation process of YOLOv1 as shown below:

Thus, in summary, each grid cell will predict

$B$ bounding boxes and confidence score for each of those boxes, plus

$C class probabilities. The prediction is encoded into a tensor of size$

$S \times S \times (B \times 5 + C) . When performing YOLO on the PASCAL VOC dataset,$

$S = 7, B = 2 . PASCAL VOC has$

$20 classes are labeled, so$

$C = 20 . Infer the final prediction is a tensor of size$

$7 \times 7 \times (2 \times 5 + 20) = 7 \times 7 \times 30 .$

2.2. Network Design

The YOLOv1 network architecture is inspired by the GoogLeNet model for image classification. It consists of 24 Convolutional Layers used to extract features from the image, followed by 2 Fully Connected Layers to predict output probabilities and coordinates. Instead of using inception modules in GoogLeNet, YOLO just used reduction layers of size

$1 \times 1$ followed by Convolutional Layers of size

$3 \times 3 . The last layer of the network is the predictions in the form of tensors of size$

$7 \times 7 \times 30 .$

The Convolutional Layers are pretrained on the ImageNet dataset for image classification with resolution

$224 \times 224$ , and then double the resolution for detection.

2.3. Training

YOLOv1 uses the loss function SSE (Sum of Square Error) . This function has the advantage of being easy to optimize, but it has some disadvantages as follows:

SSE evaluates the localization error as equal to the classification error. This is not good because the error scale of the two types is different.
In an image, there are many grid cells that do not contain objects. This pushes the confidence scores of those grid cells to near zero, and a large number of grid cells that do not contain objects will overwhelm the gradient from grid cells that contain objects (like Focal Loss). This causes instability to the model, causing early diver training.
SSE evaluates the error of the large box as equal to that of the small box. Obviously, small deviations in large boxes will have less effect than small deviations in small boxes.

Therefore, YOLO proposes the following solutions in turn:

Increase loss from predicting bounding box coordinates (preserving classification error) using parameter
$lambda_{coord} =5$ $coor_{d} = 5$ (coord stands for coordinate) to increase the error from bounding box coordinate predictions.
Reduced loss for confidence prediction of bounding boxes that do not contain objects by using the parameter
$lambda_{noobj} = 5$ $_{n oo}_= 5$ (noobj stands for no object) to reduce errors from grid cells that do not contain objects.
Instead of predicting the width and height of the bounding box directly, we predict the square root of the width and height of the bounding box.

There is one more thing to note. YOLO predicts multiple bounding boxes per grid cell. At train time, there will only be one bounding box responsible for predicting the object in that cell. The selected Bounding box will have the highest IOU value with the ground truth box. This will lead to specialization between bounding box predictors. Each predictor will get better at predicting a certain size, aspect ratios, class of object, thereby improving recall.

We will explain each term above in turn. First, notice two things

$mathcal{1}_{ij}^{obj}$

$1_{o}$ and

$mathcal{1}_{i}^{obj}$

$1_{i}^{o} . These are all indicator functions, they take two values as follows:$

$mathcal{1}_{i}^{obj}$ $1_{i}^{o}$ equals 1 if there is an object in cell $i, equal to 0 otherwise.$
$mathcal{1}_{ij}^{obj}$ $1_{o}$ equals 1 if bounding box $j of grid cell$
$i is responsible for predicting the object, which is zero otherwise.$

Then, we interpret each term in the loss function as follows:

Localization Error: error of center coordinates (offset)
$(x, y)$ of the bounding box corresponds to the position of a particular grid cell. SSE evaluates this value only when cell $i exists object and bounding box$
$j is responsible for predicting the object in cell$
$i .$
Localization Error: error of width and height. Notice that here we take the square root of width and height, it will partially solve the problem of deviation of small boxes and large boxes (small deviation in small boxes will be more important than in large boxes).
Confidence Error: Error in predicting the th bounding box
$j$ in the second grid cell $i contains object.$
Confidence Eror: Error in predicting the th bounding box
$j$ in the second grid cell $i does not contain an object.$
Confidence Eror: Error of the probability of classes in the th grid cell
$i$ if that grid cell contains object.

We summarize the above ideas with the following figure:

2.4. Inference

Like training, predicting detections for a test image only needs a single network evaluation. This makes the speed of YOLO at the time of testing extremely fast compared to other classifier-based methods.

On the PASCAL VOC dataset, YOLO predicts 98 bounding boxes per image and class probabilities for each box. This is understandable because

$7 \times 7 \times 2 = 98$ .

The bounding box predictions are often of various sizes. Usually, it’s pretty obvious that an object falls into a certain grid cell, and the network only predicts one bounding box for each object. However, there may be cases where some objects are so large that it may fall into many cells, or objects may fall on the borders of many different cells and so they may be located ( localize) by different cells. Then there is a problem, which is prediction duplication – different cells predict the same object. Then, Non-Max Suppression (NMS) technique can be used to solve this problem. Using this technique increases mAP by 2-3%.

3. Limits of YOLO

YOLO places a large spatial constraint on predicting bounding boxes: each grid cell can only predict 2 boxes and a single class (object). Therefore, this constraint will reveal a disadvantage when encountering grid cells with more than one object, especially small objects contained in a grid cell. YOLO will have a hard time predicting small objects that appear close to each other or in groups, such as a flock of birds, because then it is possible that more than two will appear in the same grid cell, and YOLO1 will It is not possible to predict all the birds that will appear in that grid cell due to the limitation on the number of predicted bounding boxes.
Since YOLO learns to predict bounding boxes from the data, it will have difficulty generalizing to objects with new or unusual aspect ratios. This model also uses relatively coarse features in predicting bounding boxes, since YOLO’s architecture has many downsampling layers from the input images.
The loss function treats the error as the same for small bounding boxes and large bounding boxes. In particular, a small error in a large box will usually not have a large effect, while a small error in a small box will have a much larger effect on the value. $I O U$ . At YOLO, most of the errors come from inaccurate object positioning.

4. Result

4.1. VOC PASCAL 2007

We compare the performance of YOLO with other detection systems on the PASCAL VOC 2007 dataset with other real-time systems at that time. YOLO has a real-time processing speed of 45 FPS with mAP of 63.4 – the highest in real-time systems. FAST YOLO achieved the highest real-time speed with 155 FPS and an accuracy of 52.7 mAP – 10 mAP reduction compared to YOLO. In addition, YOLO is still behind Faster R-CNN with an accuracy of 70 mAP. New details see the table below:

We will take a closer look at the differences between YOLO and other leading detection systems by analyzing the results on the 2007 VOC dataset. We will compare YOLO with Faster R-CNN. The figure below is the error analysis between Faster R-CNN and YOLO:

We can see that YOLO has difficulty in locating the object accurately (19.0% error), which is larger than all other types of YOLO errors combined. Faster R-CNN has a smaller localization error (8.6%) but it has more background error (13.6%), almost 3 times more than YOLO (4.75%).

Thanks to the above error analysis, we can combine YOLO with Faster R-CNN to achieve higher accuracy. We will use YOLO to remove the background from Faster R-CNN for better effect. Specifically, for each bounding box predicted from Faster R-CNN, we check if YOLO predicts a similar box. The results are in the table below:

The Faster R-CNN model has the highest prediction result of 71.8 mAP on the 2007 VOC test dataset. When combined with YOLO, the accuracy increases to 75.9%, 3.2% higher.

4.2. VOC PASCAL 2012

On the VOC 2012 test dataset, YOLO achieved an accuracy of 57.9% mAP, which is lower than other leading methods (see details in the table below), because it has difficulty in predicting objects. small. Meanwhile, the Faster R-CNN + YOLO model achieves top accuracy with 70.7% mAP.

4.3. Artwork

We compare YOLO with other detection systems on the Picasso and People-Art datasets – these are two datasets for testing the detection of people on paintings.

In this area, YOLO also gives better results than other methods – its AP decreases less than other methods when applied to 2 datasets. The figure below is a graph of the Precision-Recall curve on the Picasso dataset with different methods.

5. Conclusion

This article introduces and details the first version of YOLO, also known as YOLOv1. This is an easy model to build and can be trained directly on the entire image. Unlike other classifier-based methods, it uses a loss function that directly corresponds to the detection efficiency and the whole model is trained concurrently. Furthermore, it also generalizes well to new areas, making it ideal for applications.

6. References

Share the news now

Source : Viblo