Why use activation function in Neural Network

Tram Ho

When people work with neural networks, the activation function component must be no stranger to activation functions like Sigmoid, Relu, softmax, .. But why should it be in a Neural network. Today I will present the reasons I think are the most basic why it is used in neural networks.

1. How neural networks work.

Before going into the explanation, I will go through a bit about how neural networks work.

The neural network will take the output of the previous layers, multiply it by the weights and sum it, add the bias, and finally pass it through the activation function to generate the output for the current layers and the output. of an existing layer will continue to be the input for the next layer, as the gif shows below. From there, the network can learn complex representations of data. The formula for calculating the output of a layer when receiving input from the previous layer is:

A l = a c t i v a t i o n ( W l A l first + REMOVE l ) A^{l}=activation(W^lA^{l-1}+B^l)

_ B ) _

2. Why using activation function is necessary.

2.1 Create nonlinearity for the model.

As you know, a simple problem in machine learning is linear regression where a line or a hyperplane is used to represent the model, but in reality, the data will have a complex distribution and use a linear function. is not strong enough to be represented. As shown above you can see that using a linear function does not give a good representation as a quadratic function. And for other class problems like natural language processing, computer vision, using a linear function to model is almost impossible and we need to model by nonlinearity. Consider a model with n layers and we assume all layers are linear (without using activation function). Then the output of the lth layer will be:

A l = W l A l first + REMOVE l = W l ( W l first A l 2 + REMOVE l first ) + REMOVE l = W l W l first A l 2 + W l REMOVE l first + REMOVE l = W d A l 2 + REMOVE d A^l=W^lA^{l-1}+B^l=W^l(W^{l-1}A^{l-2}+B^{l-1})+B^{l }=W^lW^{l-1}A^{l-2}+W^lB^{l-1}+B^{l}=W_dA^{l-2}+B_d

From the above formula you can see that if the layers between the layers are simply linear, then even if a model is very deep, it will actually still have no hidden layers and what you train it to do. wastes time but the model does not learn anything special.

2.2 Keep output values ​​within certain range.

Assuming we don’t use an activation function and with a model with millions of parameters, the result of the linear multiplication from equation (1) could be a very large (positive infinity) or very small value ( negative infinity) and can cause computational problems (nan) and the network is difficult to converge. Using activation can limit the output to some range of values, for example the sigmoid function, softmax limits the output value to the interval (0, 1) no matter what the result of the linear multiplication is. whatever.

3. Some activation functions

3.1 Sigmoid

If you learn about machine learning, you are no stranger to the sigmoid function in a logical regression problem. The sigmoid function has a nice “S” curve. This is a continuous, differentiable and bounded function in the interval (0, 1). The formula for the sigmoid function is as follows:

f ( x ) = first first + e x f(x)=frac{1}{1+e^{-x}}

This was a popular function in the past with the feature of being derivative at all points, but it is less commonly used nowadays due to a number of reasons such as its derivative value being bounded in the interval (0, 0.25). so it is easy to cause vanishing gradient phenomenon. In addition, using exponential functions makes the calculation take longer. In general, the sigmoid function is often used in logical regression problems or used in attention mechanisms such as CBAM, SEblock,…

3.2 Tanh

The tanh function is similar in shape to the sigmoid function, but it differs from the sigmoid function in that it is symmetric about the origin and has the same properties as the sigmoid function. The formula for the Tanh function is as follows:

f ( x ) = e x e x e x + e x f(x)=frac{e^xe^{-x}}{e^x+e^{-x}}

3.3 Softmax

This is an activation function that is often used in the last layer of classification. There the output will be the predicted probability of falling into the classes. The softmax function formula is as follows:

f i ( z ) = e z i j = first OLD e z j f_i(z)=frac{e^{z_i}}{sum_{j=1}^{C}e^{z_j}}

3.4 ReLu

This is a very popular activation function. The Relu function formula is as follows:

f ( x ) = max ( 0 , x ) f(x)=max(0, x)

The advantage of the Relu function is its simplicity and it has been shown to speed up the training process. Next, it is not bounded like sigmoid or Tanh functions, so it is not the cause of the vanishing gradient. However, at points with negative values, the value of Relu will be zero (dying relu) and in theory it will have no derivative at zero points, but in practice people often add the derivative of relu at 0 is zero and by experiment it is also found that the probability for input relu to fall at 0 is very small. And since it’s unbounded, it also has the downside of causing an exploding gradient, but usually relu will work fine in practice.

3.5 ReLu6

As mentioned above, because ReLu is not bounded above, it can cause exploding gradient, so ReLu6 has an upper limit of relu function of 6 for input > 6. Other properties are similar to ReLu function as mentioned above. presented above. The ReLu6 function formula is as follows:

f ( x ) = min ( max ( 0 , x ) , 6 ) f(x)=min(max(0,x),6)

3.6 LeakyRelu

As mentioned above, Relu has a disadvantage that if input < 0, the ouput of relu will be 0 (dying relu), leading to some nodes dying immediately and not learning anything during training. So leakyRelu has overcome the above drawback by using a super parameter alpha. The formula for the leakyRelu function is as follows:

f ( x ) = max ( α x , x ) f(x)=max(alpha x, x)

From the above formula we see that alpha equals 1 then it becomes a linear activation function and as discussed above the linear activation function will not normally be used. The default alpha is usually 0.01, but you can completely set the alpha values ​​you want.

This article I would like to pause here, see you in another article.

References

[1] Everything you need to know about “Activation Functions” in Deep learning models

[2] Why do Neural Networks Need an Activation Function?

[3] Why Non-linear Activation Functions

Share the news now

Source : Viblo