Install Vietnamese emotion classification model

Tram Ho

Although a bit later than expected, I wish you a happy and prosperous Year of the Rabbit!!!! 

In order to classify and filter out negative comments, or to know if the emotional nuances of the message we just typed are suitable for our purposes, the above requirements are all related to the problem of emotion classification. for text. Emotion classification is a popular topic in the field of natural language processing (NLP) or deep learning. In this article, I will guide you through the steps to install a basic deep learning model to classify emotions for movie reviews on IMDB that has been translated into Vietnamese, with the model installation library PyTorch.
In this article, I assume that you already have basic knowledge about neural networks, LSTM networks, … so there will be some content that I will not explain carefully, hope you can ignore it.

my friends

I have a group of friends who always text 🙂 for every situation, I think this is also an emotion classification problem without labels

1. Installation steps

Although the title is to install a deep learning model, the installation process is only a small step in this whole project. The most time consuming part is still the data processing.

  • Step 1: Install the Vocabulary class, we need a class to help convert the text into numeric form to be stored in the tensor , before being fed into the model.
  • Step 2: Install the IMDBDatabase class, after converting the data into numeric form, we need to systematically rearrange the data to make it easier to retrieve and train. Then divide the data into 3 sets of train, valid and test.
  • Step 3: Install the RNN regression model, because this step I took the model from Mr. bentrevett , so it only takes a little time to adjust.
  • Step 4: Train and evaluate, this step is no stranger to deep learning problems.
  • Step 5: Check, use the test set to check for the last time and check on the sentences that you enter yourself.


Main installation steps

I find the focus of this article is on step 1 and step 2, because I have created and used libraries to handle Vietnamese specifically. While the documents I read most of them are for English and the pre-installed libraries for English are not used for legacy.

This article mainly explains the code, so for those of you who prefer to see the code, you can use the following links.

The next source code will be based on my Google Colab
The repo is for you who need the source code in the form of .py files

2. Word Embedding

Of course, you can’t just stuff a bunch of words into the model and make it understand itself. I need to convert documents to numbers and save them in tensors – a very good class to use as input for models written in PyTorch.
This is broken down into several subproblems that we often encounter in natural language processing:

  • Word splitting : from a text string, we split into subwords. For example, the string "Mình xin cảm ơn" will be split into the list ["Mình", "xin", "cảm_ơn"] .
  • Convert words to numbers: after getting a list of words, we need to convert to numbers or vectors so that the model can perform math operations on it.

Word separation is made easy thanks to the underthesea library – a specialized library to support Vietnamese language processing.

To convert words to numbers, I will use the pre-trained word embedding method PhoW2V . I will explain a bit about this method, if you already know, you can skip to read the next part.

word representations

Several different word representation methods

The simplest way is to save a word-number dictionary like {"tôi": 1, "xin_chào": 2} and then continuously search and replace on the sentence. However, the machine learning model can be misinterpreted by the order of 1, 2, 3, … (numbers have an ascending relationship while words have no such relationship).
To solve that situation we use one-hot encoding method, suppose we have 5 words ["tôi", "xin_chào", "cam", "chanh", "táo"] , then each the word will be represented by a length vector of 5 consisting of 0s and 1s representing the position of the word like "tôi": [1, 0, 0, 0, 0], "xin_chào": [0, 1, 0, 0, 0] . Because these vectors have a dot product of 0 they are considered unrelated. However, words that are not necessarily like that, like “east”, “west”, “south”, “north” will have close meanings, for a better learning model we also need to represent the semantics of the word. not leave the words completely independent of each other. In addition, the training data often uses a large vocabulary of more than 10,000 words, which can make the vector representation space very large, consuming storage space.
The method to overcome the above limitations, also the most popular method is word embedding . Similar to one-hot encoding, words are stored as vectors but instead of just 0s and 1s they are positive real numbers. For example "tôi": [0.4, 0.23, 0.13, 0.58] , then these numbers will be used to represent the semantics of the word. Words that are closer together will have a lower euler distance. This method is not perfect, still encounters problems such as lack of vocabulary (OOV), contextual word representation. But good enough to solve word representation problems.
If you wonder where the real numbers in word vectors come from, there are two ways. The first way is trained through a neural network such as Skip-Gram, CBOW, …. The second way is randomly initialized and changed during training for another problem. Here, I use the PhoW2V trained result consisting of 1587507 words with a size vector of 100, stored in vi_word2vec.txt , to save time and increase training efficiency.

Check word embedding by finding words that are close in meaning to “Vietnam_Vietnam”

The returned results show that the word “Vietnam” is close to many words that are also names of other countries.

3. Vocabulary class

The Vocabulary class is created to split words and convert the text into numbers stored in the tensor (these numbers will then be used to map with word embedding). I installed this class based on the source code of Assigment 4 in the Stanford CS224n course .

Trong phương thức khởi tạo, Vocabulary mặc định bao gồm 2 chữ "<unk>" dùng để biểu diễn chữ không có trong từ điển và "<pad>" được dùng làm chữ đệm để cho các câu có cùng kích thước mà mình sẽ explain later.
Special methods like __getitem__, __contains__, __len__ are implemented to execute statements like vocab[idx]; word in vocab; len(vocab) , which makes manipulating the class simpler.

The class was made to convert letters to numbers, this was done through the __getitem__ and the word2id property. However, to check if the class converts properly, we set the id2word property and method.
The add method is used to add words to the dictionary. The latter is used to add words contained in word embedding.
From this point on, I started using the word document to refer to a text string (type string ), and corpus to refer to a list of documents (type list(string) ). The static method tokenize_corpus uses the word_tokenize function of the underthesea library to separate corpus documents into a list of separate words.

The final function of Vocabulary is to convert corpus into tensor and vice versa. In the corpus_to_tensor method that takes an is_tokenized parameter, this parameter is True to skip the word splitting step for the already delimited corpus and vice versa.

To check if the class is working properly, we create a vocab object and add the words contained in the word_embedding created above. Then translate a sentence into a tensor and convert that tensor back into a sentence.

The class works as expected, the word thuần_hóa because it’s not in the dictionary returned the <unk> .

4. IMDBDataset class

The data used to train the model here is taken from IMDB data – including 50,000 movie review sentences with positive or negative emotions (sentiment). These review sentences have been translated into Vietnamese by google translate to serve the purpose of the model.

From the above data, I need to create an IMDBDataset class that can perform the following role:

  • Load and save the data in the file csv VI_IMDB.csv .
  • Indicates the size of the data set (number of review – sentiment pairs).
  • Convert the review and sentiment statements to tensor form so that they can be included in the model.
  • Returns the tuple (review, sentiment) idx that was converted to tensor when calling dataset[idx] .

Here, I have inherited from PyTorch’s Dataset class with the aim of later creating DataLoader with PyTorch for ease, but because I need to rewrite a lot of things, I created a generator function instead of creating DataLoader, so you don’t need to inherit Dataset where.

The class above was written to do what I just listed. However, the following paragraph should be noted:

The sentiments_type variable is used to store sentiment types, used to create the sentiment2id property. Because the order in set is random, I need to return it to a list and sort it again, so that sentiment2id will always have the value {'negative': 0, 'positive': 1} .

We initialize the dataset object. This process takes more than 15 minutes to separate words for 50,000 sentences. To save time, I download the file – the sentences are already delimited in the VI_IMDB.csv file to make the object creation process faster.

After loading all the data into the dataset , we split it into 3 datasets train_dataset, valid_dataset, test_dataset for training and testing.

5. Create Batch Iterator from IMDBDataset

We will use all the Dataset to train in 1 epoch, and in 1 epoch will be divided into many small batches. Here because I use packed padded sequences method which will be explained later. Therefore, the sentences in a batch need to be arranged in order of length from largest to smallest, these lengths will be used as input for the model. Then add padding <pad> so that the sentences are of equal length to generate tensor to train the model.

The batch_iterator function is a generator that takes a large Dataset and returns each batch.

6. RNN . layer

From this point on, most of my source code is based on bentrevett’s pytorch-sentiment-analysis tutorial . This tutorial is free and explained in great detail, which inspired me to write this article.

The simple model includes an embedding class that converts the tensor containing the index into the tensor containing the vector embedding. It is then passed through the regression layer. Finally the results of the regression layer are passed through the linear layer to return a tensor of numbers representing positive (close to 1) or negative (close to 0) emotions.

As I said, I will not go into the explanation of LSTM regression models, the DropOut layer, but only emphasize the important points. Here is the method packed padded sequences aka packing .
In a batch, there will be many sentences of different lengths, there may be 50 word sentences and 100 word sentences. Then a 50 word sentence needs to add the padding <pad> up to 50 times. Since these paddings carry no meaning, learning and processing them only degrades the performance of the model.

pack padded sequence

The padding is only used to bring sentences of equal length to create tensors. While training the regression network, the buffer word does not carry much meaning

PyTorch provides the pack_padded_sequence function to ignore padding positions in the data when fed to the regression network. This function asks the tensor to represent the padded sentences and the tensor to represent the original length of each sentence. The returned results of the regression network now need to be “unpacked” by the pad_packed_sequence function so that it can be included in other network layers.

In addition, while initializing the embedding class, we have to specify the id of the padding letter, so that during training, the embedding class will not change the embedding value of this letter.

After we have initialized the model, we need to assign the pre-trained word embedding to the embedding class of the model. This helps the model get good results in a faster time than retraining the embedding class from scratch.
In addition, the <unk> vector of and pad is initialized to vector 0 as a way to inform the model that these two words provide no information for the training process.
Unlike <pad> <unk> the with embedding vector will be changed during training.

7. Model training

This is an indispensable stage when working with neural networks. I use optimizer Adam to help optimize the model and loss function Binary Cross-entropy (BCELoss) because this is a Binary Classification problem. I calculate the loss and accuracy of the model in turn through each epoch. Since this stage is quite simple, I only record the training results. The source code you can see at the following Google Colab link

The model achieves Accuracy almost the same over 80% for the training set and the validation set. To ensure that the model is not overfit, we test the model on the test set.

The test set also achieved an accuracy of over 80%. Great!

8. Enter review to check

I will try to create two movie reviews for two different emotions. Recalling emotions will be labeled as follows:

That is, the closer to 0, the more negative the review, the closer to 1 the more positive.

As expected!

9. Conclusion

Through this article, we have gone through some important content as follows:

  • Vietnamese language performance as word embedding.
  • Bring Vietnamese text into tensor form to train deep learning model.
  • The packing method is used for natural language processing.
  • Training Vietnamese emotional classification model.

For those of you who need a .py version rather than a jupyter notebook file. You can refer to this repo of mine

This article is a bit long, so I would like to thank you for reading this far, I hope my article helps you. If there is something wrong or can be improved, please let me know in the comments.

10. References

Share the news now

Source : Viblo