When people work with neural networks, the activation function component must be no stranger to activation functions like Sigmoid, Relu, softmax, .. But why should it be in a Neural network. Today I will present the reasons I think are the most basic why it is used in neural networks.
1. How neural networks work.
Before going into the explanation, I will go through a bit about how neural networks work.
A l = a c t i v a t i o n ( W l A l − first + REMOVE l ) A^{l}=activation(W^lA^{l-1}+B^l)
_ B ) _
2. Why using activation function is necessary.
2.1 Create nonlinearity for the model.
As you know, a simple problem in machine learning is linear regression where a line or a hyperplane is used to represent the model, but in reality, the data will have a complex distribution and use a linear function. is not strong enough to be represented.
A l = W l A l − first + REMOVE l = W l ( W l − first A l − 2 + REMOVE l − first ) + REMOVE l = W l W l − first A l − 2 + W l REMOVE l − first + REMOVE l = W d A l − 2 + REMOVE d A^l=W^lA^{l-1}+B^l=W^l(W^{l-1}A^{l-2}+B^{l-1})+B^{l }=W^lW^{l-1}A^{l-2}+W^lB^{l-1}+B^{l}=W_dA^{l-2}+B_d
From the above formula you can see that if the layers between the layers are simply linear, then even if a model is very deep, it will actually still have no hidden layers and what you train it to do. wastes time but the model does not learn anything special.
2.2 Keep output values within certain range.
Assuming we don’t use an activation function and with a model with millions of parameters, the result of the linear multiplication from equation (1) could be a very large (positive infinity) or very small value ( negative infinity) and can cause computational problems (nan) and the network is difficult to converge. Using activation can limit the output to some range of values, for example the sigmoid function, softmax limits the output value to the interval (0, 1) no matter what the result of the linear multiplication is. whatever.
3. Some activation functions
3.1 Sigmoid
If you learn about machine learning, you are no stranger to the sigmoid function in a logical regression problem. The sigmoid function has a nice “S” curve. This is a continuous, differentiable and bounded function in the interval (0, 1). The formula for the sigmoid function is as follows:
f ( x ) = first first + e − x f(x)=frac{1}{1+e^{-x}}
3.2 Tanh
The tanh function is similar in shape to the sigmoid function, but it differs from the sigmoid function in that it is symmetric about the origin and has the same properties as the sigmoid function. The formula for the Tanh function is as follows:
f ( x ) = e x − e − x e x + e − x f(x)=frac{e^xe^{-x}}{e^x+e^{-x}}
3.3 Softmax
This is an activation function that is often used in the last layer of classification. There the output will be the predicted probability of falling into the classes. The softmax function formula is as follows:
f i ( z ) = e z i ∑ j = first OLD e z j f_i(z)=frac{e^{z_i}}{sum_{j=1}^{C}e^{z_j}}
3.4 ReLu
This is a very popular activation function. The Relu function formula is as follows:
f ( x ) = max ( 0 , x ) f(x)=max(0, x)
3.5 ReLu6
As mentioned above, because ReLu is not bounded above, it can cause exploding gradient, so ReLu6 has an upper limit of relu function of 6 for input > 6. Other properties are similar to ReLu function as mentioned above. presented above. The ReLu6 function formula is as follows:
f ( x ) = min ( max ( 0 , x ) , 6 ) f(x)=min(max(0,x),6)
3.6 LeakyRelu
As mentioned above, Relu has a disadvantage that if input < 0, the ouput of relu will be 0 (dying relu), leading to some nodes dying immediately and not learning anything during training. So leakyRelu has overcome the above drawback by using a super parameter alpha. The formula for the leakyRelu function is as follows:
f ( x ) = max ( α x , x ) f(x)=max(alpha x, x)
From the above formula we see that alpha equals 1 then it becomes a linear activation function and as discussed above the linear activation function will not normally be used. The default alpha is usually 0.01, but you can completely set the alpha values you want.
This article I would like to pause here, see you in another article.
References
[1] Everything you need to know about “Activation Functions” in Deep learning models