Study Group II

Report of Week 9

Deep Learning Study Group II is a 16 week-long study group, in which we cover advanced deep learning study series for AI enthusiasts and computer engineers. We follow up materials on each week and get together on Saturdays to discuss them.

On January 26, we came to inzva for the ninth week of the study group to start a new course, in which we kept learning the means to structure a machine learning project further.

Error analysis is good at predicting the error before it occurs, which saves us a lot of time. Let’s say we have a cat classification example that gives us 90% accuracy.

The ninth week’s session was led by Macit Giray Gökırmak, a long-term programmer and Drone enthusiast. Gökırmak concluded the second course by giving tips on building a well-functioning machine learning system.

Gökırmak started the discussion by introducing a concept called error analysis which lets us figure out the ways to improve our model’s results by manually examining the mistakes that our algorithm is making.

Accuracy and 10% error on the dev set and we noticed that our algorithm mislabels some of the dog images that look like a cat as cat images. In such a case, training our cat classifier to do better on dogs will take a really long time and will prove to be unproductive. On the other hand, if we apply the error analysis approach beforehand, we may even have the chance to evaluate multiple error analysis ideas at the same time and choose the best idea to implement while training our model. Error analysis can be applied by getting 100 mislabeled dev set examples at random and counting up how many of them fall under the dog image category. If 5 out of 100 mislabelled data are dogs, it means that training our classifier to do better on dogs will decrease our error at best to 9.5%, which can prove to be too little. On the contrary, if 50 out of 100 mislabelled data are dogs, then we could decrease our error up to 5%, which is a reasonable amount to work on. If there are multiple error data ideas, it is easier to create a spreadsheet and decide how to proceed with the progress.

 Image  Dog  Great Cats Blurry Images  Instagram Filters Comments
1    ✓                   ✓ Pitbull
2    ✓             ✓               ✓  
3         Rainy day at zoo
4           ✓      
%Total   8%      43%          61%             12%  

In regards to the case above, it will be smarter and more productive to work on blurry images and great cats to improve our model’s performance. This quick counting procedure that can be done often and manually by investing a short amount of time, can help us prioritize our decisions and notice the promising approaches to work on.

Deep learning algorithms are known to be quite robust to random errors in the training set but less robust to systematic errors. On the other hand, if possible, it is perfectly fine to try and fix the mislabeled data. To fix the mislabeled data in dev/test set, we need to do error analysis on the mislabeled column.

 Image   Dog    Great Cats Blurry Images  Mislabeled  Comments
1     ✓                 ✓  
2     ✓               ✓             ✓  
4           ✓      
%Total   8%       43%          61%            6%  

For example, if the overall error in dev set is 10% for the case presented above, the error occurring due to the incorrect data would be 0.6% and error occurring due to other causes would be 9.4%, which means it would be wiser to focus on the other causes rather than the incorrect data.

Remember to consider these following factors while we are trying to correct the mislabeled examples of the dev and test sets:

  • Apply the same process to your dev and test sets to make sure they continue to come from the same distribution.

  • Consider examining the examples your algorithm got right as well as the ones it got wrong. (There is no need to do this if you reached a good accuracy.)

  • Train and dev/test data may now come from a slightly different distribution.

  • It's very important to have dev and test sets come from the same distribution. But it could be acceptable for a train set to come from slightly other distributions.

Ideally, when we are building our project with the deep learning, we need to begin with setting up our dev/test set and metrics to build the initial system quickly. Then, we proceed to use bias/variance and error analysis so as to prioritize the steps we will be taking afterwards.

Deep learning algorithms are known for their hunger to data, which causes the teams to have training sets that are different from dev/test sets. When the training set distribution differs from dev/test sets distribution, we need to follow up on various strategies to tackle this process.

We can shuffle all the data together and randomly extract training and dev/test sets. This method allows all the sets come from the same distribution, but it is not commonly recommended since the other (real world) distribution that was in the dev/test sets will occur less in the new dev/test sets and that might not be what you want to achieve.

We can take some of the dev/test set examples and add them to the training set and target the distribution that we care about. This way, we can differentiate the distributions in the training and dev/test sets, but getting a better performance will take a long time.

Let’s  suppose you've worked on a cat example and reached these results:

  • Human error: 0%

  • Train error: 1%

  • Dev error: 10%

Though this seems like a variance problem, you cannot be sure since the distributions are not the same. This may be due to the fact that the train set was easy to train on, but the dev set was more difficult.

This issue can be solved by creating a new set called train-dev set as a random subset of the training set (to make sure it will have the same distribution) and we get:

  • Human error: 0%

  • Train error: 1%

  • Train-dev error: 9%

  • Dev error: 10%

Now we can be sure that this is, indeed, a high variance problem.

Let’s take a look at another situation:

  • Human error: 0%

  • Train error: 1%

  • Train-dev error: 1.5%

  • Dev error: 10%

In this case we have problem called data mismatch.


1.    Human-level error (a proxy for Bayes error)

2.    Train error

We calculate avoidable bias by subtracting human level error from the training error.

If the difference is huge, it means that there is avoidable bias problem, therefore we should use a strategy for high bias.

3.    Train-dev error

We calculate variance by subtracting training error from training-dev error.

If the difference is huge it means that we have a high variance problem, therefore we should use a strategy revolving around that.

4.    Dev error

We calculate the data mismatch by subtracting train-dev error from dev error.

If a difference is much bigger than train-dev error, it’s a data mismatch problem.

5.    Test error

We calculate the degree of overfitting to dev set by subtracting from the test error.

If the result is positive,  then maybe you need to find a bigger dev set.

It is important to remember that, unfortunately, there aren't many systematic ways to deal with data mismatch, so we can only try to get better results by experimenting with different approaches.

  • Carry out a manual error analysis to try to understand the difference between training and dev/test sets.

  • Adjust the training data to be more similar to our dev set or collect more data similar to dev/test sets.

  • Use artificial data synthesis to make the training data more similar to our dev set.

  • Combine some of our training data with something that can convert it to the dev/test set distribution.

Gökırmak explained this further and said that we can add noise to our perfectly created data. As a result, we can handle every possible test-case in the production. For example, if we are building a speech recognition for cars, we can add a honking sound to our dataset. By combining normal audio with car noise, we can have models that can be used in crowded streets full of loud honking sounds.

Using the layers of a trained network to solve other issues will speed up the process and help us build well-functioning systems. Gökırmak told us to follow the following steps to do transfer learning, which is a sequential process where we learn from task A and then transfer that knowledge to task B.

For example, you have trained a cat classifier with a lot of data, you can use the part of the trained NN it to solve x-ray classification problem.

To do transfer learning, delete the last layer of NN and:

Option 1: if you have a small data set - keep all other weights as a fixed weight. Add a new last layer(-s), initialize the new layer weights, feed the new data to the NN and learn the new weights.

Option 2: if you have enough data you can retrain all weights.

Option 1 and 2 are called fine-tuning and training on task A is called pretraining.

When transfer learning makes sense:

Task A and B have the same input X (e.g. image, audio).

You have a lot of data for the task A you are transferring from, and relatively less data for the task B we are transferring to.

Low-level features from task A could be helpful for learning task B.

At this point Gökırmak showed a video about artistic style transfer, in which Bosphorus is reimagined in Vincent van Gogh’s artistic style.

Gökırmak mentioned another method called multi-task learning, where we start off the learning process simultaneously and have one neural network do several things at the same time. For example:

Let’s suppose that we want to build an object recognition system that detects pedestrians, cars, stop signs, and traffic lights (which means that the image has multiple labels). We can train four different neural networks for this, but if some of the lower-level features in a neural network are common, then training one neural network to do four things may give better results. In this case:

o    Y shape will be (4,m) because we have 4 classes and each one is a binary one.

o    Then

Cost = (1/m) * sum(sum(L(y_hat(i)_j, y(i)_j))), i = 1..m, j = 1..4, where

L = - y(i)_j * log(y_hat(i)_j) - (1 - y(i)_j) * log(1 - y_hat(i)_j)

Multi-task learning will also work if y isn't complete for some labels.

•    Y = [1 ? 1 ...]

•        [0 0 1 ...]

•        [? 1 ? ...]

In the example above, multi-task learning will do well with the missing data, bearing in mind that the loss function will be different:

Loss = (1/m) * sum(sum(L(y_hat(i)_j, y(i)_j) for all j which y(i)_j != ?))

It is important to note that, compared to transfer learning, multi-task learning is not as widely used.

When a system has multiple stages, we can use end-to-end deep learning system to implement them with a single neural network. End-to-end deep learning system requires more data compared to non-end-to-end deep learning systems and functions better with big datasets, but they are also known for giving data more freedom. On the other hand, when it comes to small datasets, non-end-to-end deep learning systems work better.

Example 1:

o    Machine translation system:

•    English --> Text analysis --> ... --> French    # non-end-to-end system

•    English ----------------------------> French    # end-to-end deep learning system

Here end-to-end deep learning system works better because we have enough data to build it.

Example 2:

o    Estimating child's age from the x-ray picture of a hand:

•    Image --> Bones --> Age    # non-end-to-end system - best approach for now

•    Image ------------> Age    # end-to-end system

In this example, the non-end-to-end system works better because we don't have enough data to train the end-to-end system.

The end-to-end deep learning lets the data speak rather than being stuck to reflecting human preconceptions. Furthermore, we do not have to hand-design the components as much. But it is important to note that it may eliminate potentially useful hand-design components, which could be more helpful for smaller datasets.

Gökırmak concluded the discussions by reminding us the key question to be asked when applying the end-to-end deep learning system.

“Do we have sufficient data to learn a function of the complexity needed to map x to y?”

Remember that when applying supervised learning you should carefully choose what types of X to Y mappings you want to learn depending on what task you can get data for.

At the end of the sessions, the participants examined a Kaggle Challenge under the guidance of Gökırmak.

Gökırmak also mentioned a few different sources where we can further follow up these topics.

Take a look at the list:

Next session, we will start following a new course, in which we will focus on convolutional neural networks.


Macit Giray received his B.Sc in Computer Engineering from Halic University in 2004, and his M.Sc in Computer Engineering in 2012. Currently working as a machine learning engineer at Turkiye Is Bankasi artificial intelligence tribe. His main fields of work are financial forecasting, computer vision (tracking/detection/recognition) and natural language processing (chatbots/classification of customer complaints etc.).

Subscribe to our newsletter here and stay tuned for more hacker-driven activities.

inzva is supported by BEV Foundation, an education foundation for the digital native generation which aims to build communities that foster peer-learning and encourage mastery through one-to-one mentorship.