Study Group II

Report of Week 7

Deep Learning Study Group II is a 16 week-long study group, in which we cover advanced deep learning study series for AI enthusiasts and computer engineers. We follow up materials on each week and get together on Saturdays to discuss them.

On January 12, we gathered for the seventh week of Study Group II and discussed “Hyperparameter tuning, Batch Normalization and Programming Frameworks”, which conclude the last session of the second course.

This week’s guide Uras Mutlu started the discussion off by defining what hyperparameters exactly are and how changing their settings can affect the complexity of our neural networks. As we previously stated; hyperparameters are relevant to two aspects of the neural networks: their structure and the means to train them.

Let us begin by remembering the types of hyperparameters we have in our models:

  • The Number of Hidden Units and Layers

    Learning rate

  • Momentum

  • Batch Size

  • The Number of Epochs

Since configuring all these hyperparameters randomly will be inefficient, setting a system to organize them by using optimization techniques will help us with determining the variables. While tuning the hyperparameters, it is important to remember that some of them have more impact on the way the models get trained, and prioritize the tuning process accordingly.  When it comes to neural networks; learning rate, momentum and batch size are the most important hyperparameters to be determined. To tune the hyperparameters systematically, we can use a method called Grid Search, where we evaluate all models; each made of a different combination of hyperparameters; and figure out their accuracy to see which one fits our network the best.  Grid Search is also known to be time-consuming at times due to the high number of possible combinations. In these cases, it is more efficient to just implement Random Search algorithm, in which we sample the search space in smaller sizes to look into a group of random values instead of all combinations that exist in the model.

As Mutlu emphasized; though we may be sampling the hyperparameters at random for tuning, the most important to consider is to sample them from a useful range of values. Since the range of hyperparameters can vary greatly, we need to pick them using an appropriate scale, which will help our workflow get more productive in a shorter amount of time.  For example, the learning rate might be between 0.0001 to 0.001 and the number of layers may vary between 1 to 10. In this case, we can take advantage of the exponential numbers. Instead of selecting random points between 0.0001 to 0.001 for the learning rate, we can select a random natural number, and multiply it with a negative number, and then take ten to the result.  Rather than relying on just one hyperparameter, we can retry until we find the “good hyperparameters”.

The tuning process itself is generally examined under two categories and the strategy to follow is mostly determined by the amount of computational power that we have.

  • Panda approach is based on the way pandas raise their cubs, which is basically babysitting one cub until it is a fully-functioning adult; and it can be used if we lack computational power. In this approach, we stick with one model and we evaluate the performance of our model over time by tuning the hyper-parameters until we reach the desired outcomes.

  • Caviar approach follows the reproductive system of fish that lets us introduce a great number of models, each with a different set of hyperparameters,  and train them at the same time to see which one functions the best.

Mutlu then proceeded to introduce another method called Batch Normalization, which will help search for the good hyperparameters and speed up the training process of the model overall. The method can be implemented to normalize the values either before or after the activation function. In each layer, we are going to normalize the layers by subtracting the mean from the values and dividing them with the standard derivation. The normalization allows us to fit the values between 0 and 1. We can multiply the normalized values with gamma and sum with the beta. It is important to note that gamma and beta are learnable parameters and all of this process takes place during training with the rest of neural network parameters.  Since we are dividing the value with the variance, we would not want to divide it with zero. Therefore, it is safer to add a small value to the variance to prevent the zero-division error.

When finding the weights by using backward propagation, adding new parameters as we do in the Batch Normalization does not change the complexity drastically. However, it has many benefits that will let our model get trained much quicker and more efficiently.

Batch Normalization prevents the covariance shift problem from occurring in our model and stabilizes the flow of the network. It also enables each layer to get trained independently of one another and  consequently speeds up the learning process. Even though it is not its main purpose, the Batch Normalization has regularization effects, in that it contributes to reducing overfitting by adding noise to activations of every hidden layer. Notwithstanding that; when it comes to regularization, Batch Normalization will not be as effective as other regularization models such as Dropout or L2. Thus, depending solely on the Batch Normalization as a regularization method would be unwise.

Mutlu went on to talk about Softmax Regression which is used as a classifier for more than two possible labels.

The Softmax function is a new activation function added to our model instead of the sigmoid function that enables us to work with more classes by assigning new probability values to them.

We take the exponential values of the outputs (Z). By doing so, we widen the values and cancel the negative values. The goal of selecting the exponential version of Z is increasing the higher values and drastically decreasing the lower ones. Moreover, exponential functions use ‘e’. Therefore, we can take the derivative of the function easily. We want to take the derivative easily because it allows us to find the necessary values for backpropagation. Finally, the loss function that we select is categorical class entropy.

Mutlu also mentioned a few of programming frameworks we can use while working with deep neural networks such as Keras, TensorFlow or PyTorch, and reminded us that these frameworks allow us not to write the backward pass.

Next week, we will move on to a new course to learn and discuss how to build a machine learning project!

Guide of the WeeK: uras mutlu

Uras received his B.Sc. degree in Computer Engineering from Istanbul Technical University in 2016. He is currently studying as a Master’s student and working as a teaching assistant in the Computer Engineering department at Boğaziçi University. His research interests include generative adversarial networks, computer vision, natural language processing, and deep learning in general.

Subscribe to our newsletter here and stay tuned for more hacker-driven activities.

inzva is supported by BEV Foundation, an education foundation for the digital native generation which aims to build communities that foster peer-learning and encourage mastery through one-to-one mentorship.