Dropout Techniques (Dropout) in Deep Learning

Sunday, 13/12/2020

Tram Ho

In this article, I would like to introduce Dropout (Dropout) in Neural network, then I will have some code to see how Dropout affects the performance of Neural network.

1. Theory

1.1. What is a dropout in a neural network?

According to Wikipedia – The term ‘Dropout’ refers to the ignoring of hidden and visible units in a Neural network.

Understand a simple way, the Dropout is ignoring the unit (ie a network node) in the training process randomly. By omitting this unit, the unit will not be considered during forward and backward. Accordingly, p is called the probability of retaining a network node during each training period, so the probability that it will be rejected is (1 – p).

1.2. Why need Dropout

The question is: Why do I have to literally turn off some network nodes during training? The answer is: Avoid Over-fitting.

If a fully connected class has too many parameters and takes up most of the parameters, the network nodes in that class are too interdependent during training, limiting each node’s power, leading to over-coupling .

1.3. Other techniques

If you want to know what Dropout is, just the above 2 theory parts are enough. In this part I also introduce a number of techniques that have the same effect with Dropout.

In Machine Learning, regularization reduces over-fitting by adding a range of ‘penalties’ to the loss function. By adding such a value, your model won’t learn too much of dependencies between the weights. Surely many people who know Logistic Regression know that L1 (Laplacian) and L2 (Gaussian) are two ‘penalty’ techniques.

Training process: For each hidden layer, example, per loop, we will drop out randomly with probability (1 – p) for each network node.
Test Process: Use all triggers, but will decrease by 1 p-factor (to account for dropped actives).

1.4. Some comments

Dropout will learn more powerful useful features
It almost doubles the number of epochs needed to converge. However, the time per epoch is less.
We have H hidden units, with the dropout probability for each unit of (1 – p) we can have 2 ^ H possible models. But during the test phase, all network nodes must be considered, and each activation will be reduced by a factor p.

2. Practice

Talking is a bit confusing, so I will code 2 parts to see what Dropout is like.

Problem: You go to a football match and you try to predict where the goalkeeper takes a shot and the home player hits the ball.

I imported the necessary libraries

# import packages
import numpy as np
import matplotlib.pyplot as plt
from reg_utils import sigmoid, relu, plot_decision_boundary, initialize_parameters, load_2D_dataset, predict_dec
from reg_utils import compute_cost, predict, forward_propagation, backward_propagation, update_parameters
import sklearn
import sklearn.datasets
import scipy.io
from testCases import *

%matplotlib inline
plt.rcParams['figure.figsize'] = (7.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# import packages

import numpy as np

import matplotlib.pyplot as plt

from reg_utils import sigmoid, relu, plot_decision_boundary, initialize_parameters, load_2D_dataset, predict_dec

from reg_utils import compute_cost, predict, forward_propagation, backward_propagation, update_parameters

import sklearn

import sklearn.datasets

import scipy.io

from testCases import *

%matplotlib inline

plt.rcParams['figure.figsize'] = (7.0, 4.0) # set default size of plots

plt.rcParams['image.interpolation'] = 'nearest'

plt.rcParams['image.cmap'] = 'gray'

Visualize the data a bit

train_X, train_Y, test_X, test_Y = load_2D_dataset()

1 2	train_X, train_Y, test_X, test_Y = load_2D_dataset()

We get results

The red dot is the home player who has hit his head, the green dot is the player you hit. What we do is predict which area the goalkeeper should shoot the ball into so that the home player can hit his head. Looks like you only need to draw a line to divide the 2 areas.

2.1. The model does not have formalization

def model(X, Y, learning_rate = 0.3, num_iterations = 30000, print_cost = True):
    """
    Triển khai mạng với 3 layer: LINEAR-&gt;RELU-&gt;LINEAR-&gt;RELU-&gt;LINEAR-&gt;SIGMOID.
    
    Tham số:
    X -- Dữ liệu đầu vào, kích thước (input size, number of examples)
    Y -- 1 vector (1 là chấm xanh / 0 là chấm đỏ), kích thước (output size, number of examples)
    learning_rate -- Tỷ lệ học
    num_iterations -- Số epochs
    print_cost -- Nếu là True, in ra coss cho mỗi 10000 vòng lặp
    
    Returns:
    parameters -- Tham số học được, được dùng để dự đoán
    """
        
    grads = {}
    costs = []                            # to keep track of the cost
    m = X.shape[1]                        # number of examples
    layers_dims = [X.shape[0], 20, 3, 1]
   
    # Initialize parameters dictionary.
    parameters = initialize_parameters(layers_dims)

    # Loop (gradient descent)

    for i in range(0, num_iterations):

        # Forward propagation: LINEAR -&gt; RELU -&gt; LINEAR -&gt; RELU -&gt; LINEAR -&gt; SIGMOID.
        a3, cache = forward_propagation(X, parameters)
        
        # Cost function
        cost = compute_cost(a3, Y)
      
        grads = backward_propagation(X, Y, cache)
        
        # Update parameters.
        parameters = update_parameters(parameters, grads, learning_rate)
        
        # Print the loss every 10000 iterations
        if print_cost and i % 10000 == 0:
            print("Cost after iteration {}: {}".format(i, cost))
        if print_cost and i % 1000 == 0:
            costs.append(cost)
    
    # plot the cost
    plt.plot(costs)
    plt.ylabel('cost')
    plt.xlabel('iterations (x1,000)')
    plt.title("Learning rate =" + str(learning_rate))
    plt.show()
    
    return parameters

def model(X, Y, learning_rate = 0.3, num_iterations = 30000, print_cost = True):

"""

Triển khai mạng với 3 layer: LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID.

Tham số:

X -- Dữ liệu đầu vào, kích thước (input size, number of examples)

Y -- 1 vector (1 là chấm xanh / 0 là chấm đỏ), kích thước (output size, number of examples)

learning_rate -- Tỷ lệ học

num_iterations -- Số epochs

print_cost -- Nếu là True, in ra coss cho mỗi 10000 vòng lặp

Returns:

parameters -- Tham số học được, được dùng để dự đoán

"""

grads = {}

costs = [] # to keep track of the cost

m = X.shape[1] # number of examples

layers_dims = [X.shape[0], 20, 3, 1]

# Initialize parameters dictionary.

parameters = initialize_parameters(layers_dims)

# Loop (gradient descent)

for i in range(0, num_iterations):

# Forward propagation: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID.

a3, cache = forward_propagation(X, parameters)

# Cost function

cost = compute_cost(a3, Y)

grads = backward_propagation(X, Y, cache)

# Update parameters.

parameters = update_parameters(parameters, grads, learning_rate)

# Print the loss every 10000 iterations

if print_cost and i % 10000 == 0:

print("Cost after iteration {}: {}".format(i, cost))

if print_cost and i % 1000 == 0:

costs.append(cost)

# plot the cost

plt.plot(costs)

plt.ylabel('cost')

plt.xlabel('iterations (x1,000)')

plt.title("Learning rate =" + str(learning_rate))

plt.show()

return parameters

Prediction function

print("On the training set:")
predictions_train = predict(train_X, train_Y, parameters)
print("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

print("On the training set:")

predictions_train = predict(train_X, train_Y, parameters)

print("On the test set:")

predictions_test = predict(test_X, test_Y, parameters)

See the results

Cost after iteration 0: 0.6557412523481002
Cost after iteration 10000: 0.16329987525724216
Cost after iteration 20000: 0.13851642423255986
...
On the training set:
Accuracy: 0.947867298578
On the test set:
Accuracy: 0.915

Cost after iteration 0: 0.6557412523481002

Cost after iteration 10000: 0.16329987525724216

Cost after iteration 20000: 0.13851642423255986

...

On the training set:

Accuracy: 0.947867298578

On the test set:

Accuracy: 0.915

It can be seen that the training accuracy is 94% and the test set is 91% (quite high). We’ll visualize a bit

When there is no formalization, we see a very detailed draw line, that is, it is over-fitting.

2.2. Regularized model with Dropout

2.2.1. Forward Propagation process

def forward_propagation_with_dropout(X, parameters, keep_prob=0.5):
    """
    Triển khai 3 layer: LINEAR -&gt; RELU + DROPOUT -&gt; LINEAR -&gt; RELU + DROPOUT -&gt; LINEAR -&gt; SIGMOID.
    
    Arguments:
    X -- Dữ liệu đầu vào, kích thước (2, number of examples)
    parameters -- Các đối số chúng ta có "W1", "b1", "W2", "b2", "W3", "b3":
                    W1 -- weight matrix of shape (20, 2)
                    b1 -- bias vector of shape (20, 1)
                    W2 -- weight matrix of shape (3, 20)
                    b2 -- bias vector of shape (3, 1)
                    W3 -- weight matrix of shape (1, 3)
                    b3 -- bias vector of shape (1, 1)
    keep_prob - xác suất giữ lại 1 unit
    
    Returns:
    A3 -- giá trị đầu ra mô hình, kích thước (1,1)
    cache -- lưu các đối số để tính cho phần Backward Propagation
    """
    
    np.random.seed(1)
    
    # retrieve parameters
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    W3 = parameters["W3"]
    b3 = parameters["b3"]
    
    # LINEAR -&gt; RELU -&gt; LINEAR -&gt; RELU -&gt; LINEAR -&gt; SIGMOID
    Z1 = np.dot(W1, X) + b1
    A1 = relu(Z1)
    ### START CODE HERE ### (approx. 4 lines)         # Steps 1-4 below correspond to the Steps 1-4 described above. 
    D1 = np.random.rand(A1.shape[0], A1.shape[1])     # Step 1: khởi tạo ngẫu nhiên 1 ma trận kích thước bằng kích thước A1, giá trị (0, 1)
    D1 = D1 &lt; keep_prob                            # Step 2: chuyển các giá trị về 0 hoặc 1, trả về 1 nếu giá trị đó nhỏ hơn keep_prob
    A1 = A1 * D1                                      # Step 3: giữ nguyên các phần tự trong A1 ứng với phần tử 1 của D1, và đổi thành 0 nếu vị trị trong D1 tương tứng là 0
    A1 = A1 / keep_prob                               # Step 4: giảm đi 1 hệ số keep_prob, để tính cho các phần tử đã bỏ học.
    ### END CODE HERE ###
    Z2 = np.dot(W2, A1) + b2
    A2 = relu(Z2)
    ### START CODE HERE ### (approx. 4 lines)
    D2 = np.random.rand(A2.shape[0], A2.shape[1])     
    D2 = D2 &lt; keep_prob                                                     
    A2 = A2 * D2                                     
    A2 = A2 / keep_prob                               
    ### END CODE HERE ###
    Z3 = np.dot(W3, A2) + b3
    A3 = sigmoid(Z3)
    
    cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3)
    
    return A3, cache

def forward_propagation_with_dropout(X, parameters, keep_prob=0.5):

"""

Triển khai 3 layer: LINEAR -> RELU + DROPOUT -> LINEAR -> RELU + DROPOUT -> LINEAR -> SIGMOID.

Arguments:

X -- Dữ liệu đầu vào, kích thước (2, number of examples)

parameters -- Các đối số chúng ta có "W1", "b1", "W2", "b2", "W3", "b3":

W1 -- weight matrix of shape (20, 2)

b1 -- bias vector of shape (20, 1)

W2 -- weight matrix of shape (3, 20)

b2 -- bias vector of shape (3, 1)

W3 -- weight matrix of shape (1, 3)

b3 -- bias vector of shape (1, 1)

keep_prob - xác suất giữ lại 1 unit

Returns:

A3 -- giá trị đầu ra mô hình, kích thước (1,1)

cache -- lưu các đối số để tính cho phần Backward Propagation

"""

np.random.seed(1)

# retrieve parameters

W1 = parameters["W1"]

b1 = parameters["b1"]

W2 = parameters["W2"]

b2 = parameters["b2"]

W3 = parameters["W3"]

b3 = parameters["b3"]

# LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID

Z1 = np.dot(W1, X) + b1

A1 = relu(Z1)

### START CODE HERE ### (approx. 4 lines) # Steps 1-4 below correspond to the Steps 1-4 described above.

D1 = np.random.rand(A1.shape[0], A1.shape[1]) # Step 1: khởi tạo ngẫu nhiên 1 ma trận kích thước bằng kích thước A1, giá trị (0, 1)

D1 = D1 < keep_prob # Step 2: chuyển các giá trị về 0 hoặc 1, trả về 1 nếu giá trị đó nhỏ hơn keep_prob

A1 = A1 * D1 # Step 3: giữ nguyên các phần tự trong A1 ứng với phần tử 1 của D1, và đổi thành 0 nếu vị trị trong D1 tương tứng là 0

A1 = A1 / keep_prob # Step 4: giảm đi 1 hệ số keep_prob, để tính cho các phần tử đã bỏ học.

### END CODE HERE ###

Z2 = np.dot(W2, A1) + b2

A2 = relu(Z2)

### START CODE HERE ### (approx. 4 lines)

D2 = np.random.rand(A2.shape[0], A2.shape[1])

D2 = D2 < keep_prob

A2 = A2 * D2

A2 = A2 / keep_prob

### END CODE HERE ###

Z3 = np.dot(W3, A2) + b3

A3 = sigmoid(Z3)

cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3)

return A3, cache

2.2.2. Backward Propagation process

def backward_propagation_with_dropout(X, Y, cache, keep_prob):
    Các đối số:
    X -- Dữ liệu đầu vào, kích thước (2, number of examples)
    Y -- kích thước (output size, number of examples)
    cache -- lưu đầu ra của forward_propagation_with_dropout()
    keep_prob - như forward
    
    Returns:
    gradients -- Đạo hàm của tất cả các weight, activation
    """
    
    m = X.shape[1]
    (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache
    
    dZ3 = A3 - Y
    dW3 = 1. / m * np.dot(dZ3, A2.T)
    db3 = 1. / m * np.sum(dZ3, axis=1, keepdims=True)
    dA2 = np.dot(W3.T, dZ3)
    ### START CODE HERE ### (≈ 2 lines of code)
    dA2 = dA2 * D2              # Step 1: Áp dụng D2 để tắt các unit tương ứng với forward
    dA2 = dA2 / keep_prob              # Step 2: Giảm giá trị 1 hệ số keep_prob
    ### END CODE HERE ###
    dZ2 = np.multiply(dA2, np.int64(A2 &gt; 0))
    dW2 = 1. / m * np.dot(dZ2, A1.T)
    db2 = 1. / m * np.sum(dZ2, axis=1, keepdims=True)
    
    dA1 = np.dot(W2.T, dZ2)
    ### START CODE HERE ### (≈ 2 lines of code)
    dA1 = dA1 * D1              
    dA1 = dA1 / keep_prob             
    ### END CODE HERE ###
    dZ1 = np.multiply(dA1, np.int64(A1 &gt; 0))
    dW1 = 1. / m * np.dot(dZ1, X.T)
    db1 = 1. / m * np.sum(dZ1, axis=1, keepdims=True)
    
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
    
    return gradients

def backward_propagation_with_dropout(X, Y, cache, keep_prob):

Các đối số:

X -- Dữ liệu đầu vào, kích thước (2, number of examples)

Y -- kích thước (output size, number of examples)

cache -- lưu đầu ra của forward_propagation_with_dropout()

keep_prob - như forward

Returns:

gradients -- Đạo hàm của tất cả các weight, activation

"""

m = X.shape[1]

(Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache

dZ3 = A3 - Y

dW3 = 1. / m * np.dot(dZ3, A2.T)

db3 = 1. / m * np.sum(dZ3, axis=1, keepdims=True)

dA2 = np.dot(W3.T, dZ3)

### START CODE HERE ### (≈ 2 lines of code)

dA2 = dA2 * D2 # Step 1: Áp dụng D2 để tắt các unit tương ứng với forward

dA2 = dA2 / keep_prob # Step 2: Giảm giá trị 1 hệ số keep_prob

### END CODE HERE ###

dZ2 = np.multiply(dA2, np.int64(A2 > 0))

dW2 = 1. / m * np.dot(dZ2, A1.T)

db2 = 1. / m * np.sum(dZ2, axis=1, keepdims=True)

dA1 = np.dot(W2.T, dZ2)

### START CODE HERE ### (≈ 2 lines of code)

dA1 = dA1 * D1

dA1 = dA1 / keep_prob

### END CODE HERE ###

dZ1 = np.multiply(dA1, np.int64(A1 > 0))

dW1 = 1. / m * np.dot(dZ1, X.T)

db1 = 1. / m * np.sum(dZ1, axis=1, keepdims=True)

gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,

"dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1,

"dZ1": dZ1, "dW1": dW1, "db1": db1}

return gradients

After having Forward and Backward, we replace these 2 functions in the model function of the previous section:

parameters = model(train_X, train_Y, keep_prob=0.86, learning_rate=0.3)

print("On the train set:")
predictions_train = predict(train_X, train_Y, parameters)
print("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

parameters = model(train_X, train_Y, keep_prob=0.86, learning_rate=0.3)

print("On the train set:")

predictions_train = predict(train_X, train_Y, parameters)

print("On the test set:")

predictions_test = predict(test_X, test_Y, parameters)

Result:

Cost after iteration 10000: 0.06101698657490559
Cost after iteration 20000: 0.060582435798513114
...
On the train set:
Accuracy: 0.928909952607
On the test set:
Accuracy: 0.95

Cost after iteration 10000: 0.06101698657490559

Cost after iteration 20000: 0.060582435798513114

...

On the train set:

Accuracy: 0.928909952607

On the test set:

Accuracy: 0.95

We see, the test set accuracy was up to 95%, although the training training was reduced. Perform visualize:

plt.title("Model with dropout")
axes = plt.gca()
axes.set_xlim([-0.75, 0.40])
axes.set_ylim([-0.75, 0.65])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

plt.title("Model with dropout")

axes = plt.gca()

axes.set_xlim([-0.75, 0.40])

axes.set_ylim([-0.75, 0.65])

plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

We can

We see that the dividing line is not too detailed, so over-fitting is avoided.

2.3. Attention

Do not use Dropout for the test
Apply Dropout for both Forward and Backward
The trigger value must be reduced by 1 keep_prob factor, including dropout nodes.

Source: Medium

Thank you for viewing the article

Share the news now

Source : Viblo

Dropout Techniques (Dropout) in Deep Learning

1. Theory

1.1. What is a dropout in a neural network?

1.2. Why need Dropout

1.3. Other techniques

1.4. Some comments

2. Practice

2.1. The model does not have formalization

2.2. Regularized model with Dropout

2.2.1. Forward Propagation process

2.2.2. Backward Propagation process

2.3. Attention

Thank you for viewing the article

TikTok becomes the second largest social platform in South Africa

The fastest depreciating after 9 months of launch, iPhone 14 Pro Max continues to break the bottom in Vietnam

Beginner's guide to R: Introduction

10 essential SublimeText plugins for JavaScript developers