Going to train ML / DL model but don’t know how to choose parameters? Bayesian Optimization!

Tram Ho


In the training model of ML or neural network, an extremely important and indispensable step is the selection of values ​​for parameters such as learning rate, epochs, number of layers, hidden units, … The logic is often based on experience and for each such set of parameters we have to train the model, observe the results achieved, evaluate the results, adjust the parameters and repeat. To automate the above process, search algorithms such as Grid Search or Random Search are used. However, these algorithms only work effectively with a small number of parameters, usually less than 5 because the search space will increase rapidly when the number of parameters is large, making the search time become very long. Bayesian Optimization (BO) is an algorithm that effectively optimizes target functions with large evaluation costs (such as training a neural network) based on Bayesian’s theorem. BO significantly reduces the number of incorrect attempts to tune parameters compared to Grid Search or Random Search, which with a large deep learning network (ResNet, Inception, Xception, ..) and huge data sets (ImageNet) will cost money. hours even every day. In this article, we will learn how BO works and apply it to optimize the CNN training model parameter for the MNIST problem.

Algorithm overview

BO optimizes the target function based on machine learning. Write as a formula:

max x A f ( x ) max _ {x in A} f (x) x A max f ( x )

Where A is a set of parameters to tune, and the number of parameters should not be more than 20 to ensure the best learning algorithm; f ( x ) f (x) f ( x ) is the target function for biggest / smallest optimization (e.g. accuracy, loss, …), f ( x ) f (x) f ( x ) has some of the following features:

  • f f f is a continuous function
  • Evaluation costs of f f large f
  • f f f is a black box, we do not know the properties of f f f such as linearity, convex function, concave function, …
  • Because f f f is a black box, when evaluate f f f retrieves only the value f ( x ) f (x) f ( x ) , derivative methods such as gradient descent would not be feasible and would not be used.

BO constructs a surrogate function (surrogate function) to model f f f , the goal is to optimize surrogate as much as possible f f f is possible, from that one can find the global minimum / maximum of f f f is easy by taking surrogate’s minimum / maximum. This Surrogate is a Gaussian Process with prior distribution and likelihood defined from the beginning. Every time evaluate f f f of the new point x, we will compute the posterior of surrogate using Bayes’ rule. Based on this posterior, an Acquisition function finds an x ​​’that has the potential to deliver value f f f is largest, x ‘continues to be evaluated and updated posterior. Keep repeating this until we get the result f ( x ) f (x) f ( x ) is desired.

Pseudo-code for BO algorithm:

Preparation step

In this article, I will use GPyTorch to implement the main part of BO, Gaussian Process Regression, the purpose is to provide a better view of the internal components of GP Regression. Alternatively, you can also use scikit-learn’s model GaussianProcessRegressor. Install related libraries:

Import library:

Next is the function to train CNN model on the MNIST set. This function takes a dict of the parameters and corresponding values ​​and trains model with those parameters, the function returns the train model. Here I will only tune 1 single parameter, the learning rate .

Load MNIST data:

The sample_lr generate function generate different learning rate values ​​between 1e-6 and 0.2

Finally, we define the objective function as the function we want to optimize. Specifically, this function takes x as hyperparameter (learning rate), performs train, evaluate CNN model and returns accuracy value.

Gaussian Process Regression

Gausian Process (GP) Regression is a Bayesian statistical method for modeling functions. GP regression works well on small datasets by assuming the data points are on a multivariate normal. First we define a GP model which is a prior distribution on the function. For a random variable X, its distribution is determined by the probability distribution function (pdf):

f ( x μ , σ 2 ) = first 2 π σ 2 e ( x μ ) 2 2 σ 2 f left (x mid mu, sigma ^ {2} right) = frac {1} { sqrt {2 pi sigma ^ {2}}} e ^ {- frac {(x- mu) ^ {2}} {2 sigma ^ {2}}} f ( x μ , σ 2 ) = 2 π σ 2 1 e 2 σ 2 (X – μ) 2

with μ , σ mu, sigma μ , σ are mean and variance respectively. When there is more than one random variable, the mean is represented as a vector and the variance is represented as a matrix, called the covariance matrix. The Covariance matrix is ​​built by evaluating a covariance or kernel function Σ 0 Σ_0 Σ 0 on each pair of values x i , x j x_i, x_j x i , X j of random variables. The kernel is selected to the points x i , x j x_i, x_j x i , X j the closer each other in the input space is the greater the value. Different kernels display different prior, leading to different result functions.

To build a prior (GP model) using GPyTorch, I will inherit the ExactGP class and custom mean and covariance modules. Specifically, I use ConstantMean and RBFKernel :

Class Regressor is a GP Regression including GP model and gaussian likelihood.

Acquisition Function

With the posterior available from GP, the next question is, how do I choose the candidate with the best mean? The simplest way is to choose the maximum position of the mean, but also for the large uncertainty regions (large standard deviation) has the potential to produce good results. The Acquisition function relies on both the current extreme an

d the uncertainty of posterior to deliver the most potential candidate. One of the most commonly used acquisition functions is Expected Improvement (EI), which is calculated using the acquisition tool:

E I ( x ) = { ( μ ( x ) f ( x + ) ξ ) Φ ( Z ) + σ ( x ) ϕ ( Z ) , n e ˆ ˊ u σ ( x ) > 0 0 , n e ˆ ˊ u σ ( x ) = 0 mathrm {EI} ( mathbf {x}) = left { begin {array} {ll} left ( mu ( mathbf {x}) – f left ( mathbf {x} ^ {+ } right) – xi right) Phi (Z) + sigma ( mathbf {x}) phi (Z) & text {, if} sigma ( mathbf {x})> 0 \ 0 & text {, if} sigma ( mathbf {x}) = 0 end {array} right. E I ( x ) = {(Μ (x) f (x +) ξ) Φ (Z) + σ (x) φ (Z) 0 , N f ‘u σ (x)> 0, n e’ u σ (x) = 0

Inside x + mathbf {x} ^ {+} x + is the current best data point; μ ( x ) mu ( mathbf {x}) μ ( x ) and σ ( x ) sigma ( mathbf {x}) σ ( x ) is mean and variance of x x x ; Φ ( Z ) Phi (Z) Φ ( Z ) and ϕ ( Z ) phi (Z) ϕ ( Z ) is the CDF and PDF of the Z distribution:

Z = { μ ( x ) f ( x + ) ξ σ ( x ) , n e ˆ ˊ u σ ( x ) > 0 0 , n e ˆ ˊ u σ ( x ) = 0 Z = left { begin {array} {ll} frac { mu ( mathbf {x}) – f left ( mathbf {x} ^ {+} right) – xi} { sigma ( mathbf {x})} & text {, if} sigma ( mathbf {x})> 0 \ 0 & text {, if} sigma ( mathbf {x}) = 0 end { array} right. Z = { σ ( x ) μ (x) f (x +) ξ 0 , N f ‘u σ (x)> 0, n e’ u σ (x) = 0

Term ( μ ( x ) f ( x + ) ξ ) Φ ( Z ) left ( mu ( mathbf {x}) – f left ( mathbf {x} ^ {+} right) – xi right) Phi (Z) ( μ ( x ) f ( x + ) ξ ) Φ ( Z ) says the “improvement” of x mathbf {x} x compared with x + mathbf {x} ^ {+} x + , shows exploitation while σ ( x ) ϕ ( Z ) sigma ( mathbf {x}) phi (Z) σ ( x ) ϕ ( Z ) “explore” places with high uncertainty (variance), represents exploration . Parameters ξ xi ξ control trade-off between exploration and exploitation. The formula looks complicated, but implementing it is quite easy as follows:

After we have the EI function for any point x, we will have to sample lots of x and choose the point with the largest EI. Although the EI function has two elements exploration and exploitation, it is assumed that sometimes it is still somewhat “greedy” in favor of expoitation. So instead of picking the highest EI score, I’ll choose to follow ϵ epsilon ϵ -greedy: choose the best score with probability (1- ϵ epsilon ϵ ) and select any point with probability ϵ epsilon ϵ , and ϵ epsilon ϵ will decrease gradually with each loop.

Main loop

This loop follows the correct algorithm:

  1. Train GP regression with initial data.
  2. Use the acquisition function to choose the most potential candidate (learning rate).
  3. Evaluate candidate: retrain the CNN model with that candidate
  4. Append candidate and its accuracy go to the original dataset and repeat step 1.

Run the algorithm (it will take some time):

Through 10 loops the result is the best learning rate value is 0.012973 with accuracy equal to 0.9823

Visualize posterior

We will visualize the posterior of GP regression to get a visual view of how BO chooses candidate.

And here is the result. It can be seen that BO mostly selects the point around the area with the highest accuracy value but sometimes it also selects other areas with high variance (the gray area represents posterior’s variance).


This article I introduced and implemented with you the Bayesian Optimization algorithm. This is a pretty good algorithm and useful in optimizing, especially optimizing the deep learning model parameter. If there is anything wrong with the article, please comment to let me know; If it feels good, then why not upvote / clip?


Share the news now

Source : Viblo