When overfit occurs, one of the effective measures is to increase the amount of data learning (train data). Current popular methods include Data augmentation, Adversarial training which performs more data with **input** input or a more recent method is Label smoothing and MixCut implements augmentation for **Label** . From this idea, Xavier Gastaldi devised a technique of applying augmentation with internal representation (or **feature space** ) with Shake-shake regularization applied to the ResNet family of architectures. Let’s find out!

## 1. Shake-Shake Regularization

Conventional branched ResNet network architecture

x i + first = x i + F ( x i , ( W i ( first ) ) + F ( x i , W i ( 2 ) ) x_ {i + 1} = x_i + mathcal {F} (x_i, mathcal ( mathcal {W_i ^ {(1)}}) + mathcal {F} (x_i, mathcal {W_i ^ {(2)) }}) $x_{i+1}=x_{i}+F(x_{i}(W_{i})+F(x_{i},W_{i})$

Inside x i x_i $x_{i}$ is input, $x$ $i$ $+$ $first$ $x_ {i + 1}$ $x_{i+1}$ $F$ $mathcal {F}$ $F$ $W$ $i$ $($ $first$ $)$ $W_i ^ {(1)}$ $W_{i}$ $W$ $i$ $($ $2$ $)$ $W_i ^ {(2)}$ $W_{i}$

Shake-Shake regularization will have an additional scaling coefficient parameter α alpha $α$

x i + first = x i + α i F ( x i , ( W i ( first ) ) + ( first – α ) F ( x i , W i ( 2 ) ) x_ {i + 1} = x_i + alpha_i mathcal {F} (x_i, mathcal (W_i ^ {(1)}) + (1- alpha) mathcal {F} (x_i, W_i ^ {(2) )}) $x_{i+1}=x_{i}+α_{i}F(x_{i}(W_{i})+(1–α)F(x_{i},W_{i})$

Inside α i alpha_i $α_{i}$ would be random variable according to uniform distribution (0, 1)

*Note: For test practice,* *α* * alpha* *$α$ = 0.5 will give the result as a normal Dropout (0.5) layer *

Parameters α alpha $α$ is changed with randomized values according to distribution before next backward pass. Thereby creating random results between the backward pass and forward pass during traning, reducing the ability to memorize the model’s data. The author also mentions that this is a form of *gradient augmentation* .

## 2. Analyze the results

When training, the author has proposed a number of modes for testing

**Shake** : random all coefficients (random α alpha $α$ ) before forward pass

**Even** : scaling coefficient is set to 0.5 before the forward pass

**Keep** : During the backward pass, the scaling coefficient stays the same with the forward pass

Application level:

**Batch** level: the scaling coefficient value in each **batch** is the same

**Image** level: different scaling coefficient values in each data point (image)

#### Results on CIFAR10 set

#### Results on CIFAR100 set

Overall, the shake pattern with forward and shake with backward gives best results on CIFAR10 with a much lower Error rate (only 2.86). Perhaps for this reason the author named Shake-Shake (Forward – Backward)

#### Compare with other SOTA

With Shake-Shake regularization, the model has a much lower error rate than previous models.

## 3. Shake-Drop Regularization

Shake-Shake regularization has worked very well, but it still has some weaknesses

- Designed for 3-branch structured ResNet (ResNeXt)
- The real reason for the effectiveness has not been clarified

Yoshihiro Yamada and colleagues adopted a method to enhance the applicability of Shake-Shake.

Specifically, the team added RandomDrop (or ResDrop) to the network. RandomDrop can be understood as a simple form of Dropout, but RandomDrop makes a drop **layer** instead of nodes like Dropout.

Resnet Network Architecture:

x i + first = G ( x ) = x + F ( x ) x_ {i + 1} = G (x) = x + mathcal {F (x)} $x_{i+1}=G(x)=x+F(x)$

Random Drop:

G ( x ) = { x + b l F ( x ) in train forward x + b l F ( x ) in train backward x + E [ b l ] F ( x ) in test G (x) = begin {cases} x + b_l mathcal {F (x)} & text {in train forward} \ x + b_l mathcal {F (x)} & text {in train backward} \ x + E [b_l] mathcal {F (x)} & text {in test} end {cases} $G(x)=⎩⎪⎨⎪ ⎪⎧x+b_{l}F(x)x+b_{l}F(x)x+E[b_{l}]F(x)in trainin train backwardforwardin test$

Inside b l ∈ { 0 , first } b_l in {0,1 } $b_{l}∈{0,1}$ is a Bernoulli random variable with probability $P$ $($ $b$ $l$ $=$ $first$ $)$ $=$ $E$ $[$ $b$ $l$ $]$ $=$ $p$ $l$ $P (b_l = 1) = E [b_l] = p_l$ $_{l}$ $p$ $l$ $=$ $first$ $–$ $l$ $L$ $($ $first$ $–$ $p$ $l$ $)$ $p_l = 1 – frac {l} {L} (1 – p_l)$ $_{l}$ $p$ $l$ $p_l$ $p l = 0.5$

ShakeDrop proposed method by author:

G ( x ) = { x + ( b l + α – b l α ) F ( x ) in train forward x + ( b l + β – b l β ) F ( x ) in train backward x + E [ b l + α – b l α ] F ( x ) in test G (x) = begin {cases} x + (b_l + alpha – b_l alpha) mathcal {F (x)} & text {in train forward} \ x + (b_l + beta – b_l beta) mathcal {F (x)} & text {in train backward} \ x + E [b_l + alpha – b_l alpha] mathcal {F (x)} & text {in test} end {cases} $G(x)=⎩⎪⎨⎪ ⎪⎧x+(b_{l}+Α–b_{l}α)F(x)x+(b_{l}+β–b$

Regularization method for ResNet network. a) and b) are previously used methods such as Shake-Shake and RandomDrop

c) is a simple one-branch regularization method with a derivative of d)

“Conv” represents the convolution layer; E [x] is an expected value of x; with α , β alpha, beta $α,β$ , and $b$ $l$ $b_l$ $b_{l}$

## 4. Experiments

According to the experiment, ShakeDrop is most suitable for PyramidNet when obtaining error rate 3.08 on set CIFAR-10 and 14.96 on set CIFAR-100. ShakeDrop also gives better results when applied with the 3-branch ResNeXt model.

## 5. Conclusion

ShakeDrop is a stochastic regularization that can be applied to a ResNet network to limit overfit. Through experiments, the author has proven that ShakeDrop has better performance than previous methods (Shake-Shake and RandomDrop). You can refer to the original ShakeDrop paper to see more the author’s experimental results on the ResNeXt, Wide-ResNet, PyramidNet networks with different layers on ImageNet, COCO data sets that I have not mentioned in This Viblo post.

Thank you for reading!