Text data processing and model Fine-tuning

Tram Ho

Hello everyone, today I will learn with you how to process text data as well as fine-tune model with text classification task.

More libraries for data processing

Exploratory Data Analysis

The data set I used for this task is a disaster tweet on kaggle, you can download it via this link: https://www.kaggle.com/c/nlp-getting-started/data

Class distribution

We can see that the distribution of the training set is slightly skewed towards the non-disaster tweets portion. But it’s okay, it’s just a bit off

Average word length in a tweet

We can see that on the disaster tweets side there are more words used for each word (probably due to emphasis).

Consider the amount of stopwords in the samples

Create a separate corpus constructor for each target for easy comparison

Let’s try to see which stopwords are used the most

Similar to target = 1 we have:

In the text classification task, stopwords play a not too important role, we can consider removing them to help reduce the sentence length.

Consider the amount of punctuation

Same with target = 0

Data Cleaning

Since I find that deleting stopwords reduces the model performance a bit and my RAM is still to process, I will not delete stopwords, but everyone can try

Here we will change some abbreviated words into complete words, making the tokenizer work effectively (I refer to this link: https://www.kaggle.com/ghaiyur/ensemble-models-versiong)

Split train-test set

I will divide the train-test in a ratio of 0.7:0.3 due to the relatively small amount of data. (I refer to Andrew Ng’s video.: https://www.youtube.com/watch?v=1waHlpKiNyY)

Model

Here I use PyTorch and use the transformers library of Hugging Face to use pretrained Roberta. There are two approaches, fine-tuning and feature-based. Here I will fine-tuning so that the Roberta model can be learned.

Here I use bidirectional LSTM for downstream task (text classification) and Embedding has been pre-trained by Roberta and extracted to layer 12 by Roberta.

Model initialization

Create a train function according to epoch

Create function test

Training

Summary

So, I have learned with everyone how to process text data and fine-tune a pretrained models. Hope this post is useful to everyone.

References

https://www.kaggle.com/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert

https://www.kaggle.com/shahules/basic-eda-cleaning-and-glove

Share the news now