Hi guys, today I will talk about the optimizer. So what is the optimizer ?, to answer that question you must answer the following questions:
- What is Optimizer?
- What is its application, why should it be used?
- Optimal algorithms, advantages and disadvantages of each algorithm, what is better than the other?
- When should one optimizer be applied, and when should the other be applied?
Today’s post will explain optimizers in detail according to the layout of the answers to the questions above. This article will not be heavy on calculation, code, I will use a visual example to illustrate it for easy understanding.
What is Optimizer, why use it?
Before we dive in, we need to understand what optimizers are. Basically, the optimization algorithm is the basis for building neural networks for the purpose of “learning” the features. (or pattern) of the input data, from which we can find a pair of weights and bias match to optimize the model. But the problem is “learning” like? Specifically, how weights and bias are found! It’s not just random (weights, bias) a finite number of times and hopefully at a certain step we can find a solution. Obviously not feasible and a waste of resources! We have to find an algorithm to improve the weight and bias step by step, and that is why the optimizer algorithms came in.
1. Gradient Descent (GD)
In optimization problems, we often find the minimum value of a certain function, for which the function reaches its minimum value when its derivative is 0. But it is not always possible to find the derivative of the function, for Multivariate functions, the derivative is very complex, even impossible. Instead, people find the point closest to the minimum and consider that as the solution of the problem. Gradient Descent translates into Vietnamese to gradually reduce the slope, so the approach here is to choose a random solution after each loop (or epoch), then let it progress to the desired point.
Formula: xnew = xold – learningrate.gradient (x)
Question why there is that formula? The above formula is built to update the solution after each loop. The minus ‘-‘ here denotes the opposite of the derivative . Continue to question why the opposite of the derivative?
For example, for the function f (x) = 2x + 5sin (x) as shown below, f ‘(x) = 2x + 5cos (x)
with x_old = -4 then f ‘(- 4) <0 => x_new> x_old so the solution will move to the right and approach the minimum point.
otherwise with x_old = 4 then f ‘(4)> 0 => x_new <x_old so the solution will move left to approach the minimum point.
a) Gradient for function 1 variable:
Through the above images we can see that the gradient descent depends on many factors: if choosing different initial x points will affect the convergence process; or the learning rate is too large or too small, it also affects: if the learning speed is too small, the convergence speed is very slow, affecting the training process, while the learning rate is too large, then quickly reach the destination after a few times. The algorithm however does not converge around the target because the jump is too large.
b) Gradient descent for multivariable functions:
Basic gradient descent algorithm, easy to understand. The algorithm has solved the problem of optimizing the neural network model by updating the weights after each loop.
Because of its simplicity, the Gradient Descent algorithm still has many limitations such as dependence on initial initialization and learning rate.
For example, a function has 2 global minimums, depending on 2 initial initialization points, it will give 2 different final solutions.
Too much learning speed will make the algorithm not converge around the target because the jump is too big; or the small learning speed affects the training speed
2. Stochastic Gradient Descent (SGD)
Stochastic is a variation of Gradient Descent. Instead of after each epoch we will update the Weight 1 time, in each epoch with N data points we will update the weight N times. On the one hand, SGD will decrease the speed of an epoch. However, looking in another direction, SGD will converge very quickly after just a few epochs. SGD formula is similar to GD but performed on each data point.
Looking at the two pictures above, we can see that SGD has quite a zigzag path, not as smooth as GD. It is easy to understand because 1 data point cannot represent all of the data. Ask the question why should I use SGD instead of GD, even though its path is quite zigzag? Here, GD has a limit for large databases (several million data) so the derivative calculation on all data through each loop becomes cumbersome. Besides, education is not suitable for online learning . So what is online learning ? online learning is when the data is constantly updated (for example, adding registered users), each time we add data, we have to recalculate the derivative on all data => calculation time is long, the algorithm is not online anymore. . So SGD was born to solve that problem, because each time new data is added, only one data point needs to be updated, which is suitable for online learning.
An illustrative example: having 10,000 data points, after only 3 epochs we have a good solution, while for GD we have to use up to 90 epochs to achieve that result.
Solving algorithm for large databases that GD cannot do. This optimization algorithm is still commonly used today.
The algorithm has not yet solved 2 major disadvantages of gradient descent (learning rate, initial data point). So we must combine SGD with some other algorithms such as: Momentum, AdaGrad, .. These algorithms will be presented in the following section.
To overcome the above limitations of the Gradient Descent algorithm, we use gradient descent with momentum. So what is gradient with momentum?
To explain the Gradient with Momentum, we should first look from a physical perspective: As shown in Figure b above, if we drop 2 marbles at 2 different points A and B, the ball A will slide down to point C Ball B will slide down to point D, but we don’t want ball B to stop at point D (local minimum) that will continue to roll to point C (global minimum). To do that, we have to give marble B an initial speed that is large enough that it can pass point E to point C. Based on this idea, Momentum algorithm (ie, follow the momentum advance ).
From a mathematical perspective, we have the Momentum formula:
xnew = xold – (gama.v + learningrate.gradient)
xnew: new coordinates
xod: old coordinates
gama: parameter, usually = 0.9
learningrate: learning speed
gradient: the derivative of the function f
Through the two above illustrations of the function f (x) = x.2 + 10sin (x), we see GD without momentum will converge after 5 loops but not a global minimum. But GD with momentum takes many loops, but the solution reaches the global minimum, through the picture we see it will surpass the speed to reach the global minimum point and fluctuate around that point before declaring it stops.
The optimal algorithm solves the problem: Gradient Descent does not reach the global minimum point but only stops at local minimum.
The momentum helps the ball to cross the slope towards the destination, however, when it gets close to the destination, it still takes a lot of time to fluctuate before stopping completely, which is explained because the marble has momentum.
Unlike previous algorithms, the learning rate is almost the same in training process (the learning rate is constant), Adagrad considers the learning rate as a parameter. That is, Adagrad will let the learning rate change after each time t.
gt: gradient at time t
ϵ: error avoidance factor (divide by sample by 0)
G: is the diagonal matrix where each element on the diagonal (i, i) is the square of the derivative of the parameter vector at time t.
One obvious benefit of Adagrad is to avoid manually adjusting the learning rate, just leave the default learning speed to 0.01, the algorithm will automatically adjust.
Adagrad’s weakness is that the variable sum of squares will grow over time until it makes the learning speed extremely small, causing the training to freeze.
RMSprop solves Adagrad’s descending learning rate problem by dividing the learning rate by the mean of the square of the gradient.
The most obvious advantage of RMSprop is that it solves the problem of gradual learning speed of Adagrad (the problem of decreasing learning speed over time will slow training down, possibly leading to freezing)
The RMSprop algorithm can give a result that is only a local minimum, not a global minimum like Momentum. Therefore, we will combine both Momentum algorithms with RMSprop to produce an optimal Adam algorithm. We will cover it in the next section.
As mentioned above Adam is a combination of Momentum and RMSprop. If explained by a physical phenomenon, Momentum is like a ball that plunges downhill, and Adam is like a very heavy ball with friction, so it easily passes the local minimum to global minimum and when it reaches the global minimum it does not. It takes a long time to oscillate back and forth around the target because it has friction so it is easier to stop.
Why has that formula? That is considered as an exercise for you, …….. Actually, I’m shy to find out.
There are many optimization algorithms such as Nesterov (NAG), Adadelta, Nadam, … but I will not cover this in this article, I just focus on the most commonly used optimizers. Currently the most commonly used optimizers are ‘Adam’.
From the picture above we can see that the ‘Adam’ optimizer works quite well, moving to a minimum faster than other methods. Through this article, I hope you can understand and familiarize yourself with the optimizer in machine learning problems, especially deep learning. There are many variations of GD, but I would like to stop writing here.
| Hope the article is useful to you |