Study Group II

Report of Week 10

Deep Learning Study Group II is a 16 week-long study group, in which we cover advanced deep learning study series for AI enthusiasts and computer engineers. We follow up materials on each week and get together on Saturdays to discuss them.

On February 9, we came to inzva for the tenth week of the study group to start a new course, in which we dived into the basics of Convolutional Neural Networks under the guidance of Ahmet Melek, who led an effective, productive session about the topic.

As previously stated, one of the greatest challenges when detecting or classifying images is the amount of computational and memory requirements necessary to deal with the big datasets. Convolutional neural networks can transform an image into an easily processed state and still keep the features important for the quality of the prediction process through the application of effective filters, which makes them quite useful for problems regarding computer vision.

Melek explained that the convolutional neural networks differ from traditional ones thanks to the convolutional layer added:

  • Sparsity of Connections

As opposed to the traditional neural networks, which connects every neuron of each layer to every neuron on the subsequent layer, CNNs connect a group of neurons from one layer to a single neuron in the next one. This lets us do more work with less computational power and memory cost without compromising on baseline accuracy.

  • Parameter Sharing

Thanks to parameter or weights sharing, a convolutional neural network goes deeper as opposed to wider, which saves up a lot of memory compared to an ordinary neural network since a feature detector succeeded in detecting a certain part of an image can be used for detecting another part of the said image.

When it comes to convolutional neural networks, the most important skill to learn is to do feature extraction, which is the process of extracting a group of relevantly more meaningful attributes from a set of samples.  Convolutional neural networks can transform an image into an easily processed state and still keep the features important for the quality of the prediction process through the application of effective filters by means of convolution kernel as a filter, which is a small and simple matrix that can be used for edge detection. When applied, the filters have the image width and height go smaller, and the depth goes larger in each layer.

Let’s assume that we have a  64x64 image, the features of this image will be calculated as 64*64= 4096. Let’s also suppose that in this image, we want to start the detection process by first detecting the vertical lines of the image ( a leg of a table). By doing that, we can detect the vertical layers. In such a case we can apply the following filter which will help us to find the vertical lines.


If we convolve a 6x6 input with the above filter, we will have an output of 4x4. We start convolving from the up left corner and shift the filter towards the right step by step. This process will help us have horizontal lines as the high integer values. For example, lets look at the following example.


Convolution is a process where we move the C on the Image and multiply the pixel values with the values inside the C. For example, let’s check the bold output value and italic output value. Let’s analyze how did we get those values.


After having gone through a series of convolutional layers, we will increase the complexity of our features by combining the features extracted from the previous layers. Melek pointed out that the real perk of implementing a CNN is its ability to learn the weights of the kernel matrix, which means that we do not have to hand-engineer each filter.

While applying a convolutional operation, we need to make sure to use padding to avoid having our image shrunk too much. It is also important to notice that without padding, the filters visit the center of our images much more frequently than the edges, which means throwing away the information coming from the edges. We can avoid this by adding pixels to the edges of the image. For example, we can augment a 5x5x1 image into a 7x7x1, which would give us a 5x5x1 image when convolved with a 3x3x1 kernel.

Another parameter that can be used to build an effective convolutional neural network is called strided convolution, which defines the step size of the filter takes while moving from upper left to the upper right of the image.


After the convolution operation, if applied Valid Padding the convolved feature will be dimensionally reduced compared to the input. On the other hand, if we apply the Same Padding, the convolved feature will remain the same or increase in comparison with the input.

Melek also mentioned that the convolution operation we apply in Machine Learning is called cross-correlation in the mathematics literature. On the other hand, when it comes to deep learning literature, these two terms define the same process. the literature of deep learning, we always use convolution word instead of cross-correlation.

The current topic is convolution is volumes. The images usually have RGB values. Therefore, we shall need 3x3x(3) the first 3x3 means we have the 3x3 kernel. The last 3 means the volume of the image and kernel. the kernel and image shall have the same volume. For example, if we have 6x6x3 image and 3x3x3 kernel we shall have a 4x4 output matrix. If we increase the 3x3x3 kernel count, we will have 4x4xK output matrices. The "K" in here is the 3x3x3 kernel count. the CNN is almost like NN. However, instead directly multiplying with the W-matrix we convolve the image with W-matrix. We use the same basyes term for the Red, Green and Blue channels.

Melek went on to talk about an efficient technique used to prevent overfitting while building our CNN called pooling.This process enables the images to get recognized by our CNN, regardless of how they are presented; whether it is horizontally, vertically, shrunk or enlarged.  Pooling layers are inserted between convolutional layers to minimize the computation complexity and the number of parameters in the CNN, which is the aspect that helps with the overfitting. The most commonly used pooling types are Max Pooling and Average Pooling.

In the Max Pooling, we chose a sample region from our convolutional neural network and get the maximum input of the said group of image pixels. The Average Pooling is applied in the same manner with a slight difference; while we take the maximum of the inputs for the previous method, we take the average values for each sample region.

Melek also mentioned that some sources do not consider pooling layers as real layers and add them to the previous layers. Since pooling layers do not carry any weight, we skip them during backpropagation.

Next week, Melek will keep leading our discussions about the Convolutional Neural Networks as we try to fully grasp their nature.

Guide of the WeeK: AHMET MELEK

Ahmet Melek is studying Business Management at Bogazici University. He previously worked on topics Blockchain, Biometrics and Semantic Web.

His main interest is Brain-Computer Interfaces. More specifically, Machine Learning approaches on Signal Classification.

Subscribe to our newsletter here and stay tuned for more hacker-driven activities.

inzva is supported by BEV Foundation, an education foundation for the digital native generation which aims to build communities that foster peer-learning and encourage mastery through one-to-one mentorship.