DeepLearning.ai Study Group II
Report of Week 12
Deep Learning Study Group II is a 16 week-long study group, in which we cover advanced deep learning study series for AI enthusiasts and computer engineers. We follow up materials on https://www.deeplearning.ai each week and get together on Saturdays to discuss them.
On February 23, we came to inzva for the 12th week of study group to discuss how to use Convolutional Neural Networks for the purposes of object detection, under the guidance of Yavuz Kömeçoğlu, who led an effective, productive session about the topic.
As widely known, object detection constitutes one of the biggest challenges in computer vision, which is -to put it simply- localizing the object and figuring out what each of the said objects is and where they are located. Object detection can be summarized as the means to localize multiple objects in one image.
On the other hand, object localization mostly involves one object to be recognized and localized within an image, just like image classification. The classification with localization puts a bounding box around the object to be classified. This is conducted by having our neural network output four more units to specify the bounding box of the object we intend to localize. The units showing the localization of the image are as follows.
• Bx : Indicates X location of the object center when we determine the up-left corner of the complete image as (0, 0) and bottom right corner as (1, 1)
• By : Indicates Y location of the object center when we determine the up-left corner of the complete image as (0, 0) and bottom right corner as (1, 1)
• Bw : Indicates width ratio of the image corresponding to the whole image.
• Bh : Indicates height ratio of the image corresponding to the whole image.
In addition to the variables mentioned, we have the Pc variable that indicates if we have the object in the image or not. Plus, we will have c1, c2 · · · cn that tells us if we have ck(class-k) in our image or not. In the end, we will end up having an output matrix(Y ) as presented below:
After the classification with localization, Kömeçoğlu proceeded to explain another important process called landmark detection, in which we make our neural network output the important things in the image and have the algorithm tell us where they are located. Kömeçoğlu also underlined that when it comes to landmark detection, the most important aspect to consider is the consistency of the identity of the landmarks for various images.
For example, in a face recognition system, if we picked our landmark to be the corner of the mouth, it should be the same for each image so that we can get a consistent and healthy result. As an example let’s look at the following output(Y). We are going to have 129 different variables that specify 64 unique points. We also have one parameter for checking if we were able to find the desired object or not.
Kömeçoğlu went onto talking about sliding windows detection, which is an algorithm used for object detection by combining the aforementioned concepts. With this technique, we move a window throughout the image for it to scan to classify each specific crop to see whether it contains our intended object or not. This approach proves to work too slowly if not implemented convolutionally. Because we keep iteratively running a square on an image, and additionally, we need to run a reshaped version of the box across the image. When we implement a convolutional neural network across the image, the final convolutional layer will tell us what the object detection of each sliding window frame is, at once. With the convolutional implementation of the sliding, we run a convolutional network that gives Y matrix as the depth of the output of the network. So, each value (depth) in the matrix corresponds to the one sliding window iteration’s outcome.
Even though the convolutional implementation of the sliding windows algorithm is more time-efficient and less costly computationally, it does not always give the most accurate results while predicting the bounding boxes in regards to the detection of our object. While implementing these predicting of the bounding boxes, we will be putting a grid over our image (generally a 19x19 grid) and create an output layer for each grid by applying the image classification and localization algorithms. To do so properly, Kömeçoğlu introduced an algorithm called YOLO (You Only Look Once), which will give us the definite borders of the objects, even if the window size does not perfectly fit the object itself.
For example, at the end of the YOLO algorithm’s run, we will get 19x19x8 matrix. 19x19 means we have used 19x19 grid. 8 states the length of the Y matrix that we have previously discussed in object localization. So each grid tells us if it has the object or not. Aside from telling us whether it contains the object or not, the grid also tells us the location, width and height of the object.
In addition to these, it is important to note that YOLO can sometimes detect the same object in multiple blocks. To avoid this, we can apply non-max suppression and clean up the detection results to find the true bounding box. While applying this method, we need to begin by implementing intersection over union formula and see whether the intersection of the windows contains the object or not. We usually label a prediction as correct if Intersection Over Union is bigger than 0.5.
When we run this formula with non-max suppression, we will have a higher chance of figuring out if a bounding box truly contains our object.
What happens when we decide to detect multiple objects instead of just one? If we need to detect overlapping objects, we need to predetermine two separate anchor boxes and associate them with two different predictions. This way, we can avoid getting unwanted results when we run the YOLO algorithm.
Collectively, these aforementioned methods can be used to efficiently implement YOLO algorithm to receive the best possible predictions when it comes to object detection.
To summarize it shortly, we can take the following steps to implement YOLO algorithm:
1. Divide the image into BxB grids.
2. For each grid, we determine K boxes.
3. As a result, we have B*B*K boxes in total.
4. Then, we use the prediction rate to eliminate some of the unwanted boxes.
Next week, we will learn how to apply our CNNs to do face recognition and neural style transfer and figure out ways to use them for fun topics such as art generation.
Kömeçoğlu also mentioned these as additional sources which will help us understand the topic thoroughly.
Guide of the WeeK: YAVUZ KÖMEÇOGLU
Yavuz graduated from the Department of Mathematics at Kocaeli University in 2012 and completed his master’s degree in Computer Engineering at Okan University.
Yavuz continues to develop products for the media monitoring center by processing audio processing and image processing in the artificial intelligence team of a private company.