Study Group II

Report of Week 15

Deep Learning Study Group II is a 16 week-long study group, in which we cover advanced deep learning study series for AI enthusiasts and computer engineers. We follow up materials on each week and get together on Saturdays to discuss them.

On March 16, we came to inzva for the 15th week of the study group to discuss the means to train recurrent neural networks by using embedding layers and word vector representations for the purposes of natural language processing.

Our lead Emrah Budur started the session by introducing the concept of word embedding, which is, to put it simply, turning the texts into a series of numbers since the neural networks can function with vectors of continuous variables.

Budur first reminded us of a method covered in the previous sessions called one-hot-encoding and explained the reasons why we need to opt for using word embedding instead of one-hot-encoding for word representation for the cases related to natural language processing.

As you may recall from the last week’s session, one-hot-encoding is conducted by assigning a single entity to a vector of 1 to specify a certain category and link the others to a vector of 0s, which means the method will not function well when it comes to placing words in a closer proximity and with greater similarity closer to each other to represent correlation between words. In other words, books written by the same authors, such as “Silmarillion” and “The Two Towers”, will not be placed closer to each other in the vector space when compared to the books of irrelevant authors, for example “Harry Potter and the Goblet of Fire” and “The Hunger”.

Instead of this, we can conduct transfer learning by using word embeddings by doing the following:

1- Learn word embeddings from large text corpus. (1B -100B words)

2 -Transfer embedding to the new task with a smaller training set.

The latter approach tends to converge better since it has already learned some weights. However, we need to mind the context while doing the transfer learning. For example, if we have the word "card" in the domain of banking, we need to make sure that our neural network learns that the card means "credit card" in that specific context. Whereas in the domain of business, "card" will mean "identity card".

Budur took a moment to briefly mention the similarities and the differences between the concept encoding in the computer vision and embedding since they seem related to each other.

Word embedding can be conducted by using a smaller sized word to represent the whole dictionary in order to do semantic parsing, which will help us find the similarity and relation between the words. In the word embedding method, we use a matrix that contains our small-sized group names in our rows and words in the columns to assign values ranging from -1 to 1 for each word in each word group that represents the similarity levels within that specific group.  For example, we assign -1 for the word “Man” and 1 for the word “Woman” in the “Gender” group and try to get higher values for the word “Queen” in the result of our training as the answer of the analogies such as "What if we have a king that is not the man but the woman?". In this case, the subtraction of the relationship between “Man-Woman” will give a similar result to the one between “King-Queen”. To measure the similarity, we can use cosine similarity as a metric.


We are going to use embedding matrix and one-hot-encoding for finding the scores of one specific word. The embedding matrix has group names placed as rows and dictionary words placed as columns. If we multiply the embedding matrix with the one hot encoding vector of a particular word, we will get the embedding scores of that word.

Budur then proceeded to explain the way our neural network learns the word embedding matrix. We will use word embedding scores of each word in the sentence and feed the softmax output with them. The softmax function will give us the word that might come after the current word. Moreover, the word that we will predict may change depending on the previous or following words. For example, we can give four previous words and four following words to predict the current word by using them.

Though the word embedding is very efficient in regards to offering the correlation between the words; the downside of the method is the fact that it uses a huge matrix for storage purposes. To optimize this, we can use t-Distributed Stochastic Neighbor Embedding -in short t-SNE- a non-linear algorithm used for dimensional reduction, which helps us represent the multi-dimensional data in two or more dimensions that can be observed by humans. After figuring out the word embedding thoroughly, we moved onto talking about the applications of the process by means of various algorithms.

In the light of all this, Budur introduced a concept named Word2Vec, which helps us obtain vectors that are the distributed representations of the word features based on the words in the context. Instead of using all words in the window that we have selected to have our algorithm learn one specific word, we can randomly sample from that window and put those randomly selected words to target the center word of the selected context words. It is important to note that while implementing this algorithm, we can reduce the complexity by applying the negative sampling technique.

Lastly, Budur explained another popular algorithm for natural language processing called GloVe which can store the frequency of the relations the words have with one another by using the ratios of co-occurrence probabilities. Through this algorithm, we can benefit from global count statistics to learn the dimensions of meaning, and by applying an optimization algorithm afterward, we can get consistent results.

Guide of the WeeK: EMRAH BUDUR

Emrah is a PhD Candidate in Computer Engineering at Boğaziçi University. His field of interest is Conversational AI. Apart from his academic studies, he enjoys applying relevant research skills on the industrial projects at Garanti Technology as a full-time software engineer. He is also a volunteer member of the Turkish Sentence Encoder team at inzva AI community and tackles the inequality of opportunity against the low-resource languages including the Turkish language.


Subscribe to our newsletter here and stay tuned for more hacker-driven activities.

inzva is supported by BEV Foundation, an education foundation for the digital native generation which aims to build communities that foster peer-learning and encourage mastery through one-to-one mentorship.