Study Group II

Report of Week 14

Deep Learning Study Group II is a 16 week-long study group, in which we cover advanced deep learning study series for AI enthusiasts and computer engineers. We follow up materials on each week and get together on Saturdays to discuss them.

On March 9, we came to inzva for the 14th week of the study group to start a new course called Sequence Models. In the first session of this new course where Sahin Olut led the discussion, we focused on the topic of Recurrent Neural Networks under the which are known to function well with data that varies over time and commonly used for speech recognition and music generation.

Olut started the session by briefly explaining the reason why recurrent neural networks function better than regular ones when it comes to cases such as speech recognition, machine translation, video activity recognition, etc. The reason lies in the fact that recurrent neural networks can store the information previously fed to the neural network and use it for upcoming predictions, which means when we feed the neural network with one word, it will constitute its foundation.  

Olut then proceeded to introduce the concept of notation. We will first begin by splitting our data (in this case our text) to smaller pieces of units, which is a process called tekonizing. While doing so, we may have several words coming from the same word-base, such as computing and computer. On the other hand, we have another process named lemmatization, in which we resolve the words to their core, as in resolving “is” and “are” to “be”. The important part about the concept of notation is the fact that input lengths might not be equal.

• X: Sentence

• Xt : tth word in Sentece-X

• Tx: Length of X

• Ty: Length of Y

• Vocabulary: The word list that might be in the sentence.

When it comes to any type of machine learning project, we need to first start with pre-processing the data we have so as to have our data be “understood” by the program. In these cases, the words must be subjected to encoding, specifically to one-hot-encoding, where we have our data represented as binary vectors. Simply put, we will choose a word and label it as 1 and the remaining words will be labeled as 0s. Let’s assume that there are over 10.000 words in a dictionary. We will label one word as 1 and the rest of the words (10K-1) will all be labeled as 0s.

Another problem we can encounter is to not have the word in the dictionary. In such cases, we can use the UNKNOWN token for representing the words that we have not seen before.

How can we show a recurrent neural network? We have several weights in the network. The weights are Waa, Wax, Wya. Waa is the previous activation input to activation weights. Wax is input to activation weights. Wya is activation input to output weights.


We can write the following instead of the aforementioned.


We can use sigmoid, RELU or tanh activation functions for the activation function(g).


Olut went onto talking about the loss function and its derivatives. We can use the loss function either as Negative Log Likelihood or Cross Entropy.

As we have stated in our previous sessions, one of the biggest challenges we encounter while training a neural network is having an exploding or vanishing gradient descent. We have already learned how to overcome an exploding gradient descent problem in our previous courses. As for recurrent neural networks which are not well-functioning with long-term dependencies, the vanishing gradient descent concept may constitute a real issue.

To deal with this problem, Olut mentioned a way to modify our recurrent neural network by using Gated Recurrent Unit (GRU), which holds a memory cell (C) which helps our recurrent neural network figure out which information should be passed along for future predictions. This is intended to solve the vanishing gradient issue since the network keeps the relevant information that can be of value to the final output as stated in the following equation:


After introducing these concepts, Olut talked about a kind of recurrent neural networks called Bidirectional Recurrent Neural Networks that is known to be a very effective model when it comes to circumstances where the sequence of data is essential.

Bidirectional Recurrent Neural Networks, or BRNN for short, enables us to get information by connecting two hidden layers running to positive and negative directions at the same time. This way our output layer will have pieces of information both from the past and the future.

Some models may require a piece of information from the following words in the sentence to function efficiently and estimate what comes next, which makes it important for our model to know the next word in the sentence.  The sentence \textit{The place, which we walked before, got destroyed by the government} can be exemplified for such a case.

Next session, we will continue talking about the areas where we can implement recurrent neural networks and learn how to train our RNNs with embedding layers for purposes such as machine translation.

Guide of the WeeK: SAHIn OLUT

Sahin is senior year undergraduate student in Istanbul Technical University, he is a researcher in ITU Vision Lab where he conducted various studies mostly on Medical Imaging. The work produced in ITU Vision Lab has appeared in well-respected venues such as MICCAI.

He was also a teaching assistant for the first-time-offered graduate-level Deep Learning course led by his advisor. His current research focuses on visual geometry/geometric data processing topics such as learning on manifolds. Apart from his work in academia, he works as a Machine Learning Engineering in Decipher Analytics.


Subscribe to our newsletter here and stay tuned for more hacker-driven activities.

inzva is supported by BEV Foundation, an education foundation for the digital native generation which aims to build communities that foster peer-learning and encourage mastery through one-to-one mentorship.