Tiny micro model with huge accuracy

Tram Ho

Why does the model need to be tiny?

There are many people who only care about training a Deep learning model so that the accuracy is really high. Sometimes models have enormous complexity and there is really no rule to choose a really good architecture. Creating a model with accuracy is good enough but the number of calculations is very small, it is really a matter of concern, isn’t it. There is a sentence that he quite liked it

The AI ​​models that reside on jupyter notebooks are the dead models.

This sentence means that no matter how good you model and experiment, just leaving it on a jupyter notebook is just a research model. And if you want to apply it in practice, then it needs a lot of factors other than accuracy, guys. This statement also places more emphasis on post-processing after modeling and deployment on real systems. So for practical systems, we should have thought to make our model as simple, lighter the better. Of course still have to ensure accuracy. Since then the concept of compression model was born and these are the main reasons for it

  • Help reduce storage size : When our model is deployed on the computer siu, then you should relax the hen, there is nothing to worry about. Unfortunately, life is not always a dream, we are not always able to touch a computer to deploy. The vast majority of projects want to save money, so it is worth researching to bring AI down to medium-speed hardware. We can mention devices with limited storage space like Rapberry Pi, Mobile, Embeded Device … so reducing the size of the model is very important. Imagine instead of storing 1 1GB model, the children reduced to 10MB only then went to a different story. If you ensemble 10 variants of the 1GB model, you will spend 10GB to store the memory, whereas if you compress it to 10MB, then you only need 100MB of physical memory. It’s so different, isn’t it?
  • Help speed up processing : This is natural then, if the whole world is siu computer like he mentioned above, it would not need to care much about this problem. However, most large AI models will not be able to run real time on low-processing hardware. So it is best to choose a simple model for everyone.
  • Accuracy varies insignificantly : Perhaps the children are wondering if simply choosing the simple model will not achieve the required accuracy. This of course is a true question. However, if compressing the model is simply understood, instead of training a large model, training a smaller model is not entirely true. The meaning of the compression word here is that from a larger model, we reduce the size while preserving the accuracy of the original large model. The key point here is that accuracy is not changed so much, guys. This is the technique to design a small but martial art model

How to model “small but martial”

From martial arts training to model training

It sounds strange yet, small but having martial arts reminds us of a certain swordplay series, isn’t it? The players in the past were mostly teen heroes, but one thing they have in common is that these heroic teenagers are usually acquired from a famous master or have a chance to meet. be secretly superior martial. So it’s not natural to be a hero, guys. Back to the training of our model, if the model we initially built is super small and want it to have super huge power, it will be difficult , guys. Must try hard a lot but the result is not much. These models are just called small working age, depending on their own strength , please. So how to achieve the upper realm like above? There are some basic ways you can grasp the following:

  • Self-cultivation of recipes and refining into their own : Learning from the secret means learning from the superior martial masters but know how to create creative minimalism, turn into their martial arts. This demonstrates in compressing the model that training a larger model achieves good accuracy then streamlining, eliminating unnecessary components in that model to achieve a microscopic model. . This approach is quite self-advocating, requiring you to have a solid martial arts foundation (similar to having a good network architecture, complex enough to learn all the cases) but the results are achieved. usually very satisfactory.
  • Absorbing internal resources from the predecessors : In martial arts, there are also occasional lucky young people, martial arts, they still have no basis but meet the noble and beloved people for the inner and natural public. become a master. Your childhood will still remember the young people below. It is in the martial arts world, in training model is no different from you guys. There is a technique called Knowledge distillation, which is how to train a smaller model by re-learning the knowledge learned from a larger model. This is almost like the kind of master who devoted all his life to educating the true disciples on his advice.

Password determination France

Guys, the martial arts training methods are similar to the AI ​​model training, there are many ups and downs, many events occur to be able to achieve a good model. Hope you understand that. Now it is time for me to show you the verdict and mind of the martial art that compresses this model for you. Encourage the children to read this section, do not rush to jump into practice right away, it is easy to get into the fire , guys. Let’s get started, guys.

Cut

Network pruning is not a new concept, especially for children who work hard with Deep Learning models. Actually, the dropout that you often use during training is also a form of pruning to avoid overfit phenomenon. However, pruning applications in the compression model is a concept not quite similar to dropout. If the dropout will randomly drop a percentage of the number of connections between the layers during the training process, the Network pruning concept here means finding unnecessary connections in the network (removing it does not affect much accuracy) to prune away. This algorithm can be implemented through three steps as follows:

  • This network training is usually a large network with full connections between each layer. The reason for choosing a network large enough to help model learning achieve a certain level of accuracy
  • Remove redundant connections : We will discard connections with weights less than a predetermined threshold. Usually these weights are not too important and can be removed. Of course, when removing these connections, your network will become more sparse because there are many connections that are set to 0 and will affect accuracy at first. Therefore, to restore the original accuracy, the children need to do the third step
  • Retrain the pruned network architecture after pruning, the accuracy will change, your job is to retrain the parameters (of the trimmed model) to achieve the same accuracy. equivalent to the initial accuracy. And eventually you will get a model that is both small and martial

Quantization and share weight

After compressing the model using the pruning method (In his lower experiment it decreased to 90% of the original network). We will use a technique called quantization to reduce the number of bits needed to store neural network weights. In simple terms like this, you have to cook rice for a family of many generations, cook a soup that your father-in-law likes to give 0.9 liters of fish sauce, his mother-in-law likes to give 0.95 liters of fish sauce. husband likes to add 1 liter of fish sauce If you cook like that, it will be very laborious, so you just need to choose a suitable amount of fish sauce for the whole cafeteria and the figure of 0.95 liters of fish sauce (according to the interests of mother-in-law) is more balanced and skillful. . This is the rule of selecting representative rating and is also the main ideology of quantization and share weight . This is the process of dividing the entire weight of a layer into clusters and each cluster will share the same weight value. This value is usually the centroid value representing the most characteristic of the cluster. This is similar to the fact that you choose your mother-in-law to be the center of that family. For a better picture, please see the picture below:

You can identify a few things:

  • The first Weight matrix (top left) is the initial weight. From this matrix we divide it into 4 different clusters represented by 4 colors blue, green, orange and pink. This is done using k-means with k equal to 4.
  • then each cluster will be saved the same value as the center of the cluster value, shown at the top right. So with 4 clusters you only need to save 4 values ​​of centroid and cost
    16 2 = 32 16 * 2 = 32 bits to save the index position
  • A general way to calculate the network compression when using quantization and share weight, you need to imagine the following. At first yes
    n n weight and cost

    b b bits for storage

    n n weight that. So the number of bits used is

    n b nb . After quantization becomes

    k k clusters will cost

    k b kb bits to store the centroid value of each cluster and add

    n l o g 2 ( k ) nlog_2 (k) bits to store the value of the index. So the compression ratio will be

Knowledge distillation – Knowledge distillation

Knowledge distillation is a new idea of ​​compression network, it is not based on trimming on the old network itself, but using a larger neural network that has been trained to teach a smaller network. By teaching the small network to learn each feature map after each accumulation class of the big network, this can be considered a soft-label for that small network. In simple terms, instead of learning directly from the original data, the small network will re-learn how the big network learned, re-learn wieght distributions and feature maps that have been generalized from the large network. The small network will try to learn the ways of dealing with the large network at every layer of the network, not just the total loss function. You can picture it in the following picture. Regarding this internal transmission method, he will dedicate an article to write about the techniques and ways to implement the network in it.

Cultivation tips

Now for the part that everyone is looking forward to, let’s go into the code. After all, studying theory forever has to be practical. In this section, I will not use any pruning library but will guide you to implement from the beginning to understand the nature of this algorithm. Not too complicated. Let’s get started

Cut

The first method we will consider is pruning. To demo for this article, he used the structure of Fully Connected is simple. The implementation of CNN or LSTM models is similar if you understand its ideas already.

Import the necessary libraries

In this article, I use Pytorch, so you need to import some necessary libraries

Building base model

The trimming model should be inherited from the Pytorch base module . As I told you from the theory section, the nature of pruning is to select a threshold to filter out weights that are smaller than that (less important weights). For simplicity, he will use the standard deviation to calculate the threshold.

You can customize the parameter s = 0.25 to calculate the value of the threshold to be trimmed

Build trimming module

In the above function, you need to pay attention to some main functions as follows:

  • Mask : do you notice the self.mask layer is defined? This is a mask or filter that allows us to decide which weights are calculated and which ones are not. This mask was originally initialized as a full-number matrix. After trimming any weights you no longer need, it will be zero.
  • Forward function This function performs weight calculation function, instead of multiplying weight directly with normal input, then weight will be multiplied by filter first. This eliminates unnecessary weight after pruning

  • The prune function performs the main function of pruning. At each trimming it will calculate the numbers that have a weight smaller than the specified threshold, update the mask and weight at those locations to the value of 0. It’s quite simple, isn’t it?

Fully Connected network settings

Simply connect all the modules together, just like the common classification problems, guys. This network consists of 3 fully connected classes (which is the instance of the MaskedLinear class installed in the front). These 3 layers connect to each other to form our network.

Install some hyperparemeter

Before entering the training network, you need to define the necessary parameters of the model

Install dataloader

In this article, he will use the familiar MNIST dataset to make the training speed faster. The grandchildren install with transforms of PyTorch

Then define the model to prepare for the training step

Definition of optimizer

You use the familiar Adam Optimizer in the classification problem

Training model

This is an important step, after we have reached the architecture of the model, the model training is also very important. Especially training model after pruning, you pay attention to the code below

This code executes all derivatives of the trimmed weights to 0 so the optimizer will ignore those weights. Please note that this function will not run during the first training, but will run after the network has been trimmed and needs fine tuning. Its purpose is to help the optimizer only optimize on the pruned weights (important). After building this function, you proceed to training according to the command:

Make a cup of coffee and wait for the model to run

Testing model

After training the children have obtained a model, the next need to test the model and check the number of parameters other than 0

This function is also very basic, you just run it

The following results will be obtained:

Then you can save this log value to a file to track for convenience

And make a save wieght of the model

Calculate the number of non-zeros parameters

This makes sense and is essential for children to decide which layers to trim

Then test run

You can get the following result:

It can be seen that when not pruned, this network has all weights that are nonzero.

Proceed to pruning

Run a single command

Then test it again to see how much accuracy has been reduced

Save the results to the log file and double check the number of parameters of the network

You see that the model is reducing accuracy quite a lot, from 98.18% to 58.05% while the number of parameters is trimmed is 94.01% corresponding to the compression ratio of about 16 times. Our next thing to do is to retrain it. Also by a single command

Sit back and drink coffee and wait for the results. After the training is complete, you will retest the accuracy of the model

Have you seen the miracle yet, the accuracy is a little higher than the original model while the compression ratio of the model is 16 times already. If you continue to save logs, we will finish the trimming of the model

Next to proceed to increase the compression ratio we will come to quantization and share weight.

Quantization and Share weight

Explain a little more about this function. There are a few points to note

  • Stored in compressed sparse row (CSR) or compressed sparse column (CSC) format, these are two formats for sparse matrix storage for easy calculation due to memory savings. Do you remember that their weight is a very sparse matrix with 94% of the weights being different from zero? Therefore, an appropriate structure is needed to store and calculate. He did not explain the format in depth, if necessary you can find out for yourself here .
  • Using Kmean to cluster : This is done in the following function

Here the number of bits used to store the weight values ​​is 5. Therefore you will have a maximum of

2 5 = 32 2 ^ 5 = 32 clusters of Kmeans. After performing clustering, then you share centroid to the weight position with the function

After carrying out the share weight, you need to recalculate the accuracy

Can see the accuracy after the share weight slightly better than before the share weight. This is also the case, the kids are not always higher. If the results are worse, you should review the pruning thresholds so that the result after fine tune is as high as possible.

Conclude

The technique of modeling compression is one of the very good and important techniques in the deploy step. This technique makes it possible to deploy the model to low-profile hardware, taking advantage of the computer memory’s storage space. For other networks, the model size after decompression is greatly reduced. You can see an example in the following picture:

Hopefully, with this article, you will be able to perform operations related to model pruning yourself. Goodbye everyone and see you in the next post

Share the news now

Source : Viblo