# 1. Preface

Multitask problems can be performed at the same time many tasks applied in computer vision. For example, is face analysis making predictions about age, emotion, gender or predicting a flower? How many years have been planted ?, …. However, multitasking problems often require many stickers on a training dataset where we often have trouble finding a dataset that contains all the stickers. I wish. So in this article, I introduce to you the * CNN Shared Network* model built on the tensorflow framework to help solve the data shortage problem just mentioned above.

# 2. How does the CNN Shared Network model work?

We first build a training data set that is composed of several datasets depending on the purpose. The combined data set will be put into a **CNN Shared Network** and will then split into separate branches to perform different tasks. The branch number is equal to the desired output of the model.

The advantage of the CNN Shared Network model is that sharing a network helps the model to learn many lower features from many different datasets to improve accuracy, especially with data-limited tasks and a The model can be used for both classification and regression

# 3. Building a multitask learning model to predict age, gender and smile

To illustrate the effectiveness of the CNN Shared Network model, I built a demo predicting age, gender, smile based on a BKNET backbone network. For more information about BKNET, see the BKNET paper

## 3.1. Load data

Here I use two main datasets: IMDB-WIKI Age & Gender Datasets and GENKI-4K Smile Datasets . Code specific data load, you can see in Multitask Age-Gender-Smile . There is a very important detail here that makes it possible for the model to receive combined data from many different datasets. That is, we assign to each data type of a label an **index** to distinguish when processing data. Index = 1 for Smile, Index = 3 for Age, Index = 4 for Gender. Here I have normalized the data and put the data as a hot vector. **Note:** we put the data as a hot one by the maximum number of classes a label can have. For example, here age task has a maximum of 7 classes, we take 7

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | def convert_data_format(self): if self.trainable: # Smile datasets for i in range(len(self.smile_train) * 10): image = (self.smile_train[i % 3000][0] - 128.0) / 255.0 label = utils.get_one_hot_vector(7, int(self.smile_train[i % 3000][1])) index = 1.0 self.all_data.append((image, label, index)) # Age datasets for i in range(len(self.age_train)): image = (self.age_train[i][0] - 128.0) / 255.0 label = utils.get_one_hot_vector(7, int(self.age_train[i][1])) index = 3.0 self.all_data.append((image, label, index)) # Gender datasets for i in range(len(self.gender_train)): image = (self.gender_train[i][0] - 128.0) / 255.0 label = utils.get_one_hot_vector(7, int(self.gender_train[i][1])) index = 4.0 self.all_data.append((image, label, index)) |

## 3.2. Model

In this introduction, I use the BKNET model to conduct training. The first data is put into a common network of 4 `VGG_BLOCK`

and then will turn into 3 branches corresponding to 3 tasks: Smile branch, Gender branch, Age branch. At the end of each branch we have a softmax activation classifier that helps sort the multiclass for each label based on the extracted feature.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | # Extract features x = utils.VGG_ConvBlock('Block1', self.input_images, 1, 32, 2, 1, self.phase_train) x = utils.VGG_ConvBlock('Block2', x, 32, 64, 2, 1, self.phase_train) x = utils.VGG_ConvBlock('Block3', x, 64, 128, 2, 1, self.phase_train) x = utils.VGG_ConvBlock('Block4', x, 128, 256, 3, 1, self.phase_train) # Smile branch smile_fc1 = utils.FC('smile_fc1', x, 256, self.keep_prob) smile_fc2 = utils.FC('smile_fc2', smile_fc1, 256, self.keep_prob) self.y_smile_conv = utils.FC('smile_softmax', smile_fc2, 2, self.keep_prob, 'softmax') # Gender branch gender_fc1 = utils.FC('gender_fc1', x, 256, self.keep_prob) gender_fc2 = utils.FC('gender_fc2', gender_fc1, 256, self.keep_prob) self.y_gender_conv = utils.FC('gender_softmax', gender_fc2, 2, self.keep_prob, 'softmax') # Age branch age_fc1 = utils.FC('age_fc1', x, 256, self.keep_prob) age_fc2 = utils.FC('age_fc2', age_fc1, 256, self.keep_prob) self.y_age_conv = utils.FC('age_softmax', age_fc2, 5, self.keep_prob, 'softmax') |

## 3.3. Loss function

To be able to train combined data from many different datasets, the processing of data in the **loss function is** very important.

First we use three network masks (mask) based on the index is passed as we mentioned in **section 3.1** . Network masks help distinguish each type of data transmitted.

1 2 3 4 | self.smile_mask = tf.cast(tf.equal(self.input_indexes, 1.0), tf.float32) self.age_mask = tf.cast(tf.equal(self.input_indexes, 3.0), tf.float32) self.gender_mask = tf.cast(tf.equal(self.input_indexes, 4.0), tf.float32) |

We then take the input label according to the number of classes per label. Here smile has 2 classes: `Smile, Not Smile`

; Age has 5 classes divided respectively: `1-13, 14-23, 24-39, 40-55, 56-80`

and Gender has 2 classes: `Male, Female`

1 2 3 4 | self.y_smile = self.input_labels[:, :2] self.y_age = self.input_labels[:, :5] self.y_gender = self.input_labels[:, :2] |

In the calculation of the correct prediction (smile_true_pred, age_true_pred, gender_true_pred) for each task we need to multiply the `mask`

because in a batch of data can be included smile, age and gender, so multiplying with the mask helps retrieve The exact predictions correspond to each task. Finally, because the * Model* part uses softmax activation, the loss function here is cross_entropy.

**Note: the tf.clip_by_value function eliminates large or zero errors that cause the log function**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | # Extra variables smile_correct_prediction = tf.equal(tf.argmax(self.y_smile_conv, 1), tf.argmax(self.y_smile, 1)) age_correct_prediction = tf.equal(tf.argmax(self.y_age_conv, 1), tf.argmax(self.y_age, 1)) gender_correct_prediction = tf.equal(tf.argmax(self.y_gender_conv, 1), tf.argmax(self.y_gender, 1)) self.smile_true_pred = tf.reduce_sum(tf.cast(smile_correct_prediction, dtype=tf.float32) * self.smile_mask) self.age_true_pred = tf.reduce_sum(tf.cast(age_correct_prediction, dtype=tf.float32) * self.age_mask) self.gender_true_pred = tf.reduce_sum(tf.cast(gender_correct_prediction, dtype=tf.float32) * self.gender_mask) self.smile_cross_entropy = tf.reduce_mean( tf.reduce_sum(-self.y_smile * tf.math.log(tf.clip_by_value(tf.nn.softmax(self.y_smile_conv), 1e-10, 1.0)), axis=1) * self.smile_mask) self.age_cross_entropy = tf.reduce_mean( tf.reduce_sum(-self.y_age * tf.math.log(tf.clip_by_value(tf.nn.softmax(self.y_age_conv), 1e-10, 1.0)), axis=1) * self.age_mask) self.gender_cross_entropy = tf.reduce_mean( tf.reduce_sum(-self.y_gender * tf.math.log(tf.clip_by_value(tf.nn.softmax(self.y_gender_conv), 1e-10, 1.0)), axis=1) * self.gender_mask) |

Finally, the total loss function is equal to the sum of the errors of each task, ensuring a balance between each task. Dividing each separate loss like this helps us to be able to perform many types of loss functions in a model and not affect each other. Here we use the l2 regularizer so we have to add l2_loss to the total loss.

`self.total_loss = self.smile_cross_entropy + self.gender_cross_entropy + self.l2_loss + self.age_cross_entropy`

# 4. Result

You can see the entire data processing, model training & prediction as well as your model accuracy in Multitask learning Age-Gender-Smile . Here are some results I have obtained:

Hope this article solves some problems for you about the lack of data as well as implementations in building multitask models. Thanks to everyone for taking the time to read your post

# References

- Effective Deep Multi-source Multi-task Learning Frameworks for SmileDetection, Emotion Recognition and Gender Classi ﬁ cation
- Dinh Viet Sang, Le Tran Bao Cuong, Pham Thai Ha, Multi-task learning for smile detection, emotion recognition and gender classification, December 2017