(Paper Explained) Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network

Tram Ho

Introduce

In the super resolution problem, the CNN network has proven its strength in this problem with the accuracy superior to the traditional methods. With only a few layers of convolution layers, the SRCNN network was able to outperform the bicubic interpolation method right at the beginning of the learning process. However, in the paper Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network , the authors have proposed a new method to perform this problem that can be better in accuracy and processing speed. To achieve that, they used a technique called sub-pixel convolution layer

Problems with the SRCNN . network

image.png

In the SRCNN network, to process a low-resolution (LR) input image, the author used the bicubic interpolation method to upsample the image so that it has the same size as the high-resolution (HR) image. This has two disadvantages:

  1. Increasing the size of the input image to the size of the output increases the workload many times over. It includes upscaling the image before putting it into the model and calculating the model with the input of an upsample image (which is many times the size of the small image). Particularly for the model, suppose if the size is increased
    n n times the calculation volume will increasen 2 n^2 times. This causes the SRCNN network to have a long runtime and is not suitable for real-time applications [2].
  2. The bicubic interpolation method gives no additional information to the model. In addition, the use of bicubic interpolation also causes the result of the model to be affected by the result of this interpolation.

Therefore, the author of the article Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network has proposed a new method to solve these two weaknesses. Instead of doing the upscaling right at the input to match the high resolution size of the output, they suggest doing this at the end of the network to reduce the computational cost of the model.

Efficient Sub-Pixel Convolutional Neural Network (ESPCN) Network Architecture

image.png

In the ESPCN network of the paper, the feature extraction step is also performed like the SRCNN network. However, ESPCN is different in that the LR input image will not be upscaled by bicubic interpolation like SRCNN, but it will be taken directly through hidden layers (convolution layers) to extract feature maps. After this step, we obtain feature maps in low resolution space (LR). The next step is to build the HR image from the extracted LR feature map? Suppose, from an LR . image

I OFFER CHEAP I_{LR} with size

A first method that can be thought of is to use a deconvolutional layer (or transposed convolution). If convolutional layers are used mainly to reduce spatial dimensions (including height and width), the deconvolutional layer is used to reverse that, i.e. produce output with a larger height and width than the input. In fact, bicubic interpolation in SRCNN is also considered a deconvolutional layer because it is also used to increase the size of the input.

Sub-pixel

When taking a digital image, the camera’s imaging system projects the scene onto an image plane and then performs sampling and quantizing to produce a digital image. The sampling step here will be used to digitize the sampling coordinates of the pixels, and the quantize step is used to digitize the value of each pixel. Due to sensor limitations, images will often be limited to a certain resolution. Therefore, on that image we will have no more information in between two adjacent pixels. However, in the real world, we can have a lot of pixels between those two pixels. The pixels in between are called sub-pixels. As shown in the example below, the square red points are sampled points and will appear in the image, while the round black points in the middle will not be sampled and these are the sub-pixels.

image.png

Efficient sub-pixel convolution layer

In this paper, the author introduces a new layer type called sub-pixel convolution layer. This layer consists of 2 steps, the first step is the usual convolution to give the output is

H × W × r 2 OLD H times W times r^2C , the remaining step is to shuffle the pixels to give an output of

image.png

Using this layer has two main advantages:

  • Helps us avoid having to use zero-padding to affect the output.
  • Using a deconvolution layer increases the computational cost because the convolution is performed in the high-resolution space.

Result

The ESPCN network has slightly better results than other networks such as SRCNN and TNRD. image.png

However, the highlight of ESPCN lies in the runtime. With an upscaling factor of 3, the running time of ESPCN (ours) is much better than SRCNN and other networks: image.png

Conclusion

Thus, with only sub-pixel convolution and pixel shuffle, the ESPCN network has been able to reduce the super-resolution execution time many times while the accuracy is still improved compared to its predecessor SRCNN.

References:

  1. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network
  2. Accelerating the Super-Resolution Convolutional Neural Network
Share the news now

Source : Viblo