Study Group II

Report of Week 16

Deep Learning Study Group II is a 16 week-long study group, in which we cover advanced deep learning study series for AI enthusiasts and computer engineers. We follow up materials on each week and get together on Saturdays to discuss them.

On March 23, we came to inzva for the 16th week of the study group to conclude our 16-week-long marathon by examining various sequence to sequence models under the guidance of Mehmet Fatih Güler who had also led the discussion on our very first week.

Güler started the session by commenting on the fact that the sequence to sequence model is the foundation of machine translation. The input of machine translation is a sorted input. The sentences are sorted data which we will encode. As previously stated in the lectures, we can label the input to have tags for each word. On the other hand, we can assign just one label for a specific sentence, which is a process called sequence classification. In this case, the input must be sorted data rather than text. Güler reminded us that the recurrent neural networks function well while working with time series data, by creating a time series from already existing ones.

This process is a process named sequence to sequence - in short Seq2Seq - type. Another thing that can be done with recurrent neural networks is captioning images. The captions use the convolutional neural networks as an encoder and recurrent neural networks as a decoder within the architecture. So, instead of using the softmax, we can attach a recurrent neural network to get the desired outcomes.

As we discussed previously, the Seq2Seq models can be used for the purposes of machine translation. We find the probability arising from the given sentence X and the output sentence Y. For example, we have a sentence in  French and we want to predict a sentence in English. We can do a translation by selecting the greatest softmax value for each word in English, which is a technique called a greedy search. However, it is important to note that when it comes to machine translation, the greedy search proves to be not as effective as we would like it to be. The reason behind its ineffectiveness lies in the fact that while selecting the greatest value in the softmax, there is a risk that some of the more commonly used words can take the place of the most appropriate one. On the other hand, if we try out every possible word, we will have to do too many calculations.

In such cases where the greedy search fails to meet our expectations, Güler mentioned another method named beam search which gives us the closest values in terms of accuracy.

While choosing the values for each word, we choose the top B values. While selecting the first group of words, we will select the best B words from the softmax function. During the second iteration, we will consider the first word within the whole dictionary and select the best B results. The sentence finishes with the tag assigned to the end of the sentence.

After talking about the basics of the beam search, Güler moved onto introducing a way to better our algorithm, named length normalization. The beam search is used to maximize the outcome of probabilities but the formula regarding the probabilities gives a very small number and cannot be stored properly. Therefore, it is more efficient to take the log by applying length normalization, to have a better chance of getting longer sentences instead of shorter ones that the algorithm is likely to suggest.  

Moreover, we can use n-grams of the crafted sentences for selecting the best possible outcome and do error analysis to see whether our algorithm is giving the desired results or not, by means of a method called Bleu Score which lets us know the level of accuracy compared to human translation.  

Güler then proceeded to talk about the attention model, which can be used for almost all natural language processing tasks. In this model, we do not have the whole paragraph translated at once. On the contrary, the paragraph is divided into smaller inputs of words and each has its own context vector and gets translated part by part. At each decoding stage of the LSTM, we also search for the related words. After this, instead of decoding the whole paragraph at once, we calculate the weighted mean of the outputs, which were learned by the other smaller neural networks within our main neural network. Güler further stated that in addition to natural language processing tasks, this model can be also implemented for speech recognition tasks through various algorithms.

The second edition of our Study Group is over but we will keep working on improving our academic prowess in deep learning with the help of our community and our guides. For further information, do not forget to check out the rest of our website and subscribe to our newsletter to know more about our upcoming events!


Fatih Mehmet Güler is an obsessed software craftsman and deep learning researcher who has been developing innovative software for over a decade.

His domain of expertise consists of Natural Language Processing and Deep Learning.

To get further information about Fatih’s work, you can visit his company website:

Subscribe to our newsletter here and stay tuned for more hacker-driven activities.

inzva is supported by BEV Foundation, an education foundation for the digital native generation which aims to build communities that foster peer-learning and encourage mastery through one-to-one mentorship.