DeepLearning.ai Study Group II
Report of Week 6
Deep Learning Study Group II is a 16 week-long study group, in which we cover advanced deep learning study series for AI enthusiasts and computer engineers. We follow up materials on https://www.deeplearning.ai each week and get together on saturdays to discuss them.
On January 5, we gathered for the sixth week of DeepLearning.ai Study Group II and discussed the course titled “Optimization Algorithms”.
This week’s guide was one of our study group’s participants, Batuhan Kahya, who led a very interactive discussion about various optimization methods such as Mini-batch Gradient Descent, Momentum, RMSProp and Adam that will ultimately help train our neural networks much faster.
Deep Learning is used to train the models on big data sets which makes training the neural networks a slow, iterative process if not efficiently optimized. Therefore, we use optimization algorithms to fasten the entire process and increase the efficiency of the model. When you have huge data sets, it makes no sense to try processing the entire training set at the same time since this will take up too much time and slow down the implementation of the model. As we talked about in the previous weeks, while implementing gradient descent, we need to wait until the whole training set is processed and run through all examples before we take a step of gradient descent.
To solve this slowness problem, we can use a method called Mini-batch Gradient Descent which is implemented by evenly dividing the large dataset we have into small batches. After splitting, instead of waiting for the sums of the entire training set as we do with the gradient descent, we calculate the gradient by using a different batch for each iteration. This lets us use the whole training set while training the model; and since the batch size we use in the iteration is relatively small, we get to calculate the gradient much faster. In the mini-batch gradient, a parameter update that reduces the cost in a batch, may increase it in another. Therefore, the cost function is much noisier. Since it is possible to converge the mini-batch gradient descent to local minimum same with standard gradient descent, we will not suffer from losses in the accuracy. A disadvantage worth noting is the fact that for using mini-batch gradient descent, we need to configure another hyperparameter to our network called batch size.
Kahya also gave a brief explanation in regards to choosing the batch size accordingly. It is important to note that when training the whole sample, if we pick a small size while deciding on batch size, we will be forced into increasing the number of iteration which will cause our code to be trained slower. Kahya also mentioned that while choosing the batch size, it more efficient to prefer applying powers of 2 because the processors can train the network relatively faster that way.
Kahya went on to talk about a method called exponentially weighted averages, which is a key component of optimization algorithms, used for curve smoothing and ultimately helps to solve the problem of noisiness. While implementing this method, we apply the weights to the noisy data in a manner that the weights decrease exponentially with each previous data and compute its average. Thanks to the new values, the disconnection among the sequenced samples will be reduced which will end up smoothing the curve. He also talked about which values of β hyperparameter are used for this method. Then, he introduced a method called bias correction, which helps us receive better estimations at the initial phase of training. Because the method itself functions by computing the weighted averages of the previous values, the initial value of the new curve turns out to be much lower than expected. Therefore, implementing bias correction will result in promptly obtaining more accurate estimates.
Kahya then proceeded to explain an algorithm in which we implement the exponentially weighted averages method called Gradient Descent with Momentum. While applying this method, we try to reach the local minimum by using the exponentially weighted averages of the gradients computed from the previous iterations and use those gradients to update the weight. This way, the steps we take towards the wrong directions will cancel out each other’s impact and we will reach the local minimum with fewer oscillations. Moreover, we can reach the target faster, thanks to the momentum gained. However, by doing so we risk overshooting the target as a result. Kahya pointed out that decreasing the learning rate during the last phase of iterations might be the answer to this problem.
Then, we went on to discuss another optimization algorithm that can be used to speed up the gradient descent instead of Momentum, which is called RMSprop (Root Mean Square Propagation). With this algorithm, instead of computing the exponentially weighted average of the gradients directly, we use the method of root mean square. We will divide the matrix calculated in the iteration by the square root of the exponentially weighted average of element-wise squaring of the gradient matrices, then take step according to the new matrix value. Just like the previous one, this method also prevents us from taking large steps towards a wrong direction and helps us reach the target much more quickly. Kahya also mentioned that we need to add 10-8 to the general denominator to avoid receiving a denominator too close to 0. According to Kahya, both RMSprop and Adam (Adaptive Moment Estimation) which is a combination of Momentum and RMSprop algorithms, are among the optimization algorithms that stand out the most. They are also well-functioning within various deep learning architectures.
Lastly, we talked about the ways to tune the learning rate according to the number of iterations to avoid overshooting during the implementation of the gradient descent algorithm. A method called learning rate decay can be implemented by continuously and slowly reducing the learning rate alpha to speed up our learning algorithm. Kahya reminded us that in addition to the formulas, the tuning process can be implemented manually by checking the size of the gradient.
Next week, we will continue our discussions and talk about hypermeter tuning and batch normalization.
Guide of the WeeK: BATUHAN KAHYA
Batuhan attained a master's degree in Software Engineering from Bogazici University after studying in Industrial Engineering.
He currently works as a Data Scientist / Analytical Consultant at Invent Analytics.