Shake-Shake Regularization and Shake-Drop Regularization for Deep Residual Network

Tram Ho

When overfit occurs, one of the effective measures is to increase the amount of data learning (train data). Current popular methods include Data augmentation, Adversarial training which performs more data with input input or a more recent method is Label smoothing and MixCut implements augmentation for Label . From this idea, Xavier Gastaldi devised a technique of applying augmentation with internal representation (or feature space ) with Shake-shake regularization applied to the ResNet family of architectures. Let’s find out!

1. Shake-Shake Regularization

Conventional branched ResNet network architecture

x i + first = x i + F ( x i , ( W i ( first ) ) + F ( x i , W i ( 2 ) ) x_ {i + 1} = x_i + mathcal {F} (x_i, mathcal ( mathcal {W_i ^ {(1)}}) + mathcal {F} (x_i, mathcal {W_i ^ {(2)) }}) x i + 1 = x i + F (x i (W i (1) ) + F (x i , W i (2) )

Inside x i x_i x i is input, x i + first x_ {i + 1} x i + 1 is the output
F mathcal {F} F is the Residual Aggregated Transformations function
W i ( first ) W_i ^ {(1)} W i (1) and W i ( 2 ) W_i ^ {(2)} W i (2) Weights with residual blocks respectively

Shake-Shake regularization will have an additional scaling coefficient parameter α alpha α

x i + first = x i + α i F ( x i , ( W i ( first ) ) + ( first α ) F ( x i , W i ( 2 ) ) x_ {i + 1} = x_i + alpha_i mathcal {F} (x_i, mathcal (W_i ^ {(1)}) + (1- alpha) mathcal {F} (x_i, W_i ^ {(2) )}) x i + 1 = x i + α i F (x i (W i (1) ) + ( 1 α) F (x i , W i (2) )

Inside α i alpha_i α i would be random variable according to uniform distribution (0, 1)

Note: For test practice, α alpha α = 0.5 will give the result as a normal Dropout (0.5) layer

Parameters α alpha α is changed with randomized values ​​according to distribution before next backward pass. Thereby creating random results between the backward pass and forward pass during traning, reducing the ability to memorize the model’s data. The author also mentions that this is a form of gradient augmentation .

2. Analyze the results

When training, the author has proposed a number of modes for testing
Shake : random all coefficients (random α alpha α ) before forward pass
Even : scaling coefficient is set to 0.5 before the forward pass
Keep : During the backward pass, the scaling coefficient stays the same with the forward pass

Application level:
Batch level: the scaling coefficient value in each batch is the same
Image level: different scaling coefficient values ​​in each data point (image)

Results on CIFAR10 set

Results on CIFAR100 set

Overall, the shake pattern with forward and shake with backward gives best results on CIFAR10 with a much lower Error rate (only 2.86). Perhaps for this reason the author named Shake-Shake (Forward – Backward)

Compare with other SOTA

With Shake-Shake regularization, the model has a much lower error rate than previous models.

3. Shake-Drop Regularization

Shake-Shake regularization has worked very well, but it still has some weaknesses

  • Designed for 3-branch structured ResNet (ResNeXt)
  • The real reason for the effectiveness has not been clarified

Yoshihiro Yamada and colleagues adopted a method to enhance the applicability of Shake-Shake.
Specifically, the team added RandomDrop (or ResDrop) to the network. RandomDrop can be understood as a simple form of Dropout, but RandomDrop makes a drop layer instead of nodes like Dropout.

Resnet Network Architecture:

x i + first = G ( x ) = x + F ( x ) x_ {i + 1} = G (x) = x + mathcal {F (x)} x i + 1 = G ( x ) = x + F ( x )

Random Drop:

G ( x ) = { x + b l F ( x ) in train forward x + b l F ( x ) in train backward x + E [ b l ] F ( x ) in test G (x) = begin {cases} x + b_l mathcal {F (x)} & text {in train forward} \ x + b_l mathcal {F (x)} & text {in train backward} \ x + E [b_l] mathcal {F (x)} & text {in test} end {cases} G ( x ) = ⎪ ⎪ x + b l F (x) x + b l F (x) x + E [b l ] F (x) in train in train backward forward in test

Inside b l { 0 , first } b_l in {0,1 } b l { 0 , 1 } is a Bernoulli random variable with probability P ( b l = first ) = E [ b l ] = p l P (b_l = 1) = E [b_l] = p_l P (b l = 1 ) = E [b l ] = p l
p l = first l L ( first p l ) p_l = 1 – frac {l} {L} (1 – p_l) p l = 1 L l ( 1 p l ) is selected with p l p_l p l = 0.5

ShakeDrop proposed method by author:

G ( x ) = { x + ( b l + α b l α ) F ( x ) in train forward x + ( b l + β b l β ) F ( x ) in train backward x + E [ b l + α b l α ] F ( x ) in test G (x) = begin {cases} x + (b_l + alpha – b_l alpha) mathcal {F (x)} & text {in train forward} \ x + (b_l + beta – b_l beta) mathcal {F (x)} & text {in train backward} \ x + E [b_l + alpha – b_l alpha] mathcal {F (x)} & text {in test} end {cases} G ( x ) = ⎪ ⎪ x + (b l + Α b l α) F (x) x + (b l + β b l β) F (x) x + E [b l + α – b l α] F (x) in train forward in train backward in test

Regularization method for ResNet network. a) and b) are previously used methods such as Shake-Shake and RandomDrop
c) is a simple one-branch regularization method with a derivative of d)
“Conv” represents the convolution layer; E [x] is an expected value of x; with α , β alpha, beta α , β , and b l b_l b l is a random coefficients parameter

4. Experiments

According to the experiment, ShakeDrop is most suitable for PyramidNet when obtaining error rate 3.08 on set CIFAR-10 and 14.96 on set CIFAR-100. ShakeDrop also gives better results when applied with the 3-branch ResNeXt model.

5. Conclusion

ShakeDrop is a stochastic regularization that can be applied to a ResNet network to limit overfit. Through experiments, the author has proven that ShakeDrop has better performance than previous methods (Shake-Shake and RandomDrop). You can refer to the original ShakeDrop paper to see more the author’s experimental results on the ResNeXt, Wide-ResNet, PyramidNet networks with different layers on ImageNet, COCO data sets that I have not mentioned in This Viblo post.

Thank you for reading!

Reference

  1. Shake-Shake regularization
  2. ShakeDrop Regularization for Deep Residual Learning
  3. RandomDrop
  4. Review: Shake-Shake Regularization (Image Classification)
  5. Shake-Shake regularization with Interactive Code [Manual Back Prop with TF]
Share the news now

Source : Viblo