# Introduction

Image-to-image translation is a class of computer vision problems whose goal is to learn a mapping between the input image and the output image. This problem can be applied to a number of areas such as style transfer, image coloring, sharpening, data generation for segmentation, face filter, …

Normally to train an Image-to-image translation model, you will need a large number of input image and label pairs. For example: the color image and the corresponding grayscale image, the fuzzy image and the sharpened image, etc. However, preparing the dataset in this way can be quite expensive in some cases like: style transfer photos from summer to winter (get landscape shots under different conditions), turn a regular horse into a zebra (difficult to get a picture of a regular horse and its photo but is a zebra).

Since paired datasets are virtually non-existent, there is a need to develop a model capable of learning from unpaired data. More specifically, any two unrelated image sets can be used and the general features extracted from each collection and used during image translation. This is called the unpaired image-to-image translation problem.

One successful approach to unpaired image-to-image translation is CycleGAN .

# CycleGAN architecture

CycleGAN is based on the Generative Adversarial Network (GAN). The GAN architecture is an approach to train an image generation model that includes two neural networks: a generator network and a discriminator network. Generator uses a random vector taken from latent space as input and generates a new image and the Discriminator takes an image as input and predicts whether it’s real (taken from dataset) or fake (generated by generator). . Both models will compete against each other, Generator will be trained to generate images that can deceive Discriminator and Discriminator will be trained to better distinguish the generated image.

CycleGAN is an extension of the classic GAN architecture that includes 2 Generators and 2 Discriminators. The first generator, called G, takes the image input from domain X (zebra) and converts it to domain Y (normal horse). The other generator is called Y, which converts images from domain Y to X. Each Generator network has a corresponding Discriminator.

- D Y D_Y $D_{Y}$ : distinguishing images taken from domain Y and translated images G (x).
- D X D_X $D_{X}$ : distinguishing images taken from the X domain and translated images F (y).

## Generator

The CycleGAN generator is based on this paper and consists of three components: encoder, transformer and decoder

The encoder section consists of 3 convolutional layers, the following 2 layers have stride = 2 to reduce the input size of the image and increase the number of channels. The encoder output used as the transformer input consists of 6 residual blocks as shown in the resnet . The batch normalization class in the residual block is replaced with instance normalization. Finally, the decoder consists of 3 transposed convolution layers, which will transform the image to its original size and channel number depending on the output domain.

## Discriminator

Discriminator uses the PatchGAN architecture. Usually in classification problems, the network output will be a scalar value – the probability of being of some class. In the CycleGAN model, the author designed Discriminator such that its output is a feature map N × N × first N times N times1 $N×N×1$ . It can be seen that Discriminator will divide the input image into a grid $N$ $×$ $N$ $N times N$ $N×N$

# Loss function

## Adversarial loss

During the training process, generator G tries to minimize the adversarial loss function by translating the image G (x) (where x is the image taken from domain X) so that it is most similar to the image from domain Y, in contrast, Discriminator. D Y D_Y $D_{Y}$ try to maximize the adversarial loss function by distinguishing the G (x) image and the real y image from the domain

L a d v ( G , D Y , X , Y ) = first n [ l o g D Y ( y ) ] + first n [ l o g ( first – D Y ( G ( x ) ) ] L_ {adv} (G, D_Y, X, Y) = frac {1} {n} [logD_ {Y} (y)] + frac {1} {n} [log (1- D_Y (G (x ))] $L_{adv}(G,D_{Y},X,Y)=n1[LogD_{Y}(y)]+n1[log(1–D_{Y}(G(x))]$

Adversarial loss is the same for generator F and Discriminator

L a d v ( F , D X , Y , X ) = first n [ l o g D X ( x ) ] + first n [ l o g ( first – D X ( F ( y ) ) ] L_ {adv} (F, D_X, Y, X) = frac {1} {n} [logD_ {X} (x)] + frac {1} {n} [log (1- D_X (F (y)) ))] $L_{adv}(F,D_{X},Y,X)=n1[LogD_{X}(x)]+n1[log(1–D_{X}(F(y))]$

## Cycle consistency loss

Adversarial loss alone is not enough for the model to give good results. It will hybridize the generator in the direction of generating any output image in the target domain, not the desired output. For example, with the problem of turning zebra into a normal horse, the generator can turn the zebra into a very beautiful ordinary horse, but has no features related to the original zebra.

To solve this problem, cycle consistency loss is introduced. In the paper, the author thinks that if image x from domain X is translated into domain Y and then translated back to domain Y with 2 generators G and F, we will get the original x image: x → G ( x ) → F ( G ( x ) ) ≈ x x rightarrow G (x) rightarrow F (G (x)) approx x $x→G(x)→F(G(x))≈x$

L c y c l e ( G , F ) = first n ∑ ∣ F ( G ( x i ) ) – x i ∣ + ∣ G ( F ( y i ) ) – y i ∣ L_ {cycle} (G, F) = frac {1} {n} sum | F (G (x_i)) – x_i | + | G (F (y_i)) – y_i | $L_{cycle}(G,F)=n1Σ|F(G(x_{i}))–x_{i}∣+|G(F(y_{i}))–y_{i}∣$

## Full loss

L = L a d v ( G , D Y , X , Y ) + L a d v ( F , D X , Y , X ) + λ L c y c l e ( G , F ) L = L_ {adv} (G, D_Y, X, Y) + L_ {adv} (F, D_X, Y, X) + lambda L_ {cycle} (G, F) $L=L_{adv}(G,D_{Y},X,Y)+L_{adv}(F,D_{X},Y,X)+λL_{cycle}(G,F)$

Inside λ lambda $λ$ is the super parameter and is chosen as 10.

# Some results

Style transfer of paintings to photos

Zebras to regular horses

Apples turn orange

Human face into a doll