Questions in the Deep Learning

Tram Ho

What is overfitting?

When your model is very suitable for training, it results in a low loss function value on train training, but when new data is entered, it is not suitable, giving very poor results. Overfitting occurs when the model is too complex to simulate training data. This especially happens when the amount of training data is too small while the complexity of the model is too high.

What is underfitting?

Is your model not suitable for both training sets and new data sets (also called test sets). This may be because your model is not complex enough to cover the data set.

What is Good Fitting?

A model suitable for both training and test data sets. However, to achieve such a model requires meticulous research.

How can you tell if you are overfitting or not?

We can rely on some metrics to assess whether the model is overfitting or not. We rely on the average measurement error of the model for each dataset to evaluate. We have the formula for calculating the average of error with a data set with m elements:

Thus, when our data will be divided into 3 parts is the training set (training set), the verification set (cross validation set) and the test set (test set). For each part we will give the corresponding error parameters:

Based on the magnitude of the error function above we can recognize:

  • Underfitting is E t r a i n , E t e S t , E c v E_ {train}, E_ {test}, E_ {cv} E t r a i n E t e s t , E c v big
  • Overfitting E t r a i n E_ {train} E t r a i n small, longer E c v , E t e S t E_ {cv}, E_ {test} E c v E t e s t big.
  • Good fitting then E t r a i n , E t e S t , E c v E_ {train}, E_ {test}, E_ {cv} E t r a i n E t e s t , E c v are small.

What are some techniques to avoid overfitting?

We can use validation techniques (validation, Cross-validation) or Regularization (Early Stopping, l2 regularization, … , and now people often use another technique called Dropout.

What is RGB image. In data processing, how often is an RGB image represented?

RGB photos are 3 colors, Red, Green and Blue. The value of each color is in the word [0-255]. Because there are 256 ways to choose r, 256 ways to choose g colors, 256 ways to choose b => the total number of colors that can be created using the RGB color system is: 256 * 256 * 256 = 16777216 colors !!!

In data processing, the RGB image is represented by a tensor of shape (Height, Width, channel), with the channel here of 3.

What is a gray photo? How to show grayscale images?

Grayscale image is the image represented by a matrix with only one pixel in the range [0-255]. Display gray image by matrix.

How the convolution calculation is performed. Meaning of convolution calculation.

You can learn about convolution calculations here .

Meaning of convolution calculation : The purpose of convolution calculation on an image is to blur and sharpen the image; define the paths; … Each different kernel will have different meanings. You can refer to some of the following kener here

What is padding? What is stride?

Padding = k means adding k vector 0 to each side of the matrix.


We sequentially perform the elements in the X matrix, and obtain the matrix Y of the same size matrix X, we call stride = 1.

Stride is often used to reduce the size of a matrix after a convolution calculation.

What is a pooling layer?

Pooling layers are often used between convolutional layers, to reduce data size but still retain important properties. The reduced data size helps reduce computations in the model.

Call pooling size K * K. Input of pooling layer size H * W * D, we split into D matrix size H * W. For each matrix, on the size K * K area on the matrix We find the maximum or average of the data and write in the result matrix. Rules for stride and padding apply as convolution calculations on the image.

But most of the time using pooling layer will use size = (2,2), stride = 2, padding = 0. Then the output width and height of the data are halved, the depth is maintained.

There are 2 common types of pooling layers: max pooling and average pooling.

In some models people use convolutional layer with stride> 1 to reduce the data size instead of pooling layer.

Why use many different kernels in the convolution layer?

With different kernels we will learn different characteristics of the image, so in each convolutional layer we will use many kernels to learn many properties of the image. Since each kernel outputs an output of 1 matrix, k kernels will produce k output matrices. We combine these k output matrices into a 3 dimensional tensor of depth k.

The output of this convolutional layer will become the input of the next convolutional layer.

Note: The output of the convolutional layer will pass the activation function before becoming the input of the next convolutional layer.

Some other questions:

Compare l first l_ {1} l 1 regularization and l 2 l_ {2} l 2 regularization?

What is BatchNorm?

Why does BatchNorm help deep learning optimization algorithm converge faster?

List the difference between training phase and test phase when implementing BatchNorm.

When performing a mini-batch gradient desecent, how does a small or large mini-batch affect the results? Emphasize the disadvantages of mini-batch with small or large size.

In one problem, suppose that the gradient descent mini-batch algorithm works well for large mini-batches. However, because memory is limited, bachprogagation cannot be calculated with large mini-batches. Is there a way to use backpropagation with small mini-batches to have the same effect as using large mini-batches?

What is the Imbalanced data problem?

Are there any ways to solve imbalanced data? (Presentation on Under-sampling and Over-sampling)

When measuring imbalanced, which measurement device should I use to measure the accuracy of the model? (Ask about Confusion matrix, ROC Curve)

Share the news now

Source : Viblo