In this article, I will present the first classification model based on the features extracted from the TF-IDF technique that my previous article mentioned. The classification task of the model is to classify the emotional nuances of the review sentences about movies in the data set – semtiment classification .
Introducing data sets
The data set that I took is the IMDB movie reviews datasets . You can download and use it for free. It consists of 25,000 positive review statements and 25,000 positive reiview sentences (negative). Labels are labeled as stars, from 1 star to 10 stars. As such, this data set is perfectly suitable for classification tasks where you have text input and output as the number of stars, where expressing user review feelings.
If the number of stars is at least 7 stars, you can label it as positive. If the maximum number of stars is only 4 stars, you can label it as negative.
This data set includes 30 reviews for each movie and it tries to avoid bias as much as possible for any movie in the dataset. The data set is divided 50/50 for train and test. A fairly ideal data set in practice.
To evaluate the model, we will use accuracy to see what is going on because we have the same number of positive and negative in the data set. That is, the data set is balanced with all terms across the data set.
Extract feature from text
As a first step, let’s transform text into features based on the TF-IDF technique that I mentioned in the previous article. Assume the n-gram we first use is n = 1. And in the result, we have a feature matrix that includes 25,000 rows representing 25,000 reviews in the training data set, and 75,000 columns representing 75,000 features that we split. This can be considered a relatively large feature matrix.
And one more thing is that this is an extremely sparse matrix. We see a lot of positions that contain zeroes, we can see 99.8% of the positions in the matrix are all zeros.
And for this kind of bad features, you can use linear models – because it works fast and works well with millions of features, similar to Naive Bayes. In addition, with poor quality features like this, you should not use Decision Tree because it will take a lot of time because it comprehensive search on all the features to make the next split, similar with Gradient Boosted Trees.
The model we can use is the logistics regression model. It works as follows:
It tries to predict the probability of a review being positive based on the features of the corresponding review. You can rely on the TF-IDF feature values to find the corresponding parameters for each feature by multiplying each feature by the TF-IDF value and then calculating their total and passing this value through the activation sigmoid function. . And that’s how you get logistics regression models. It is an advanced model of the linear model when the data is now linearly distributed, the linear cannot be represented, the logistics regression model fit can be nonlinear but still retain the thought of the linear. And it’s a linear classification model and it’s good when it can handle sparse data.
This model can train fast and furthermore their weights can explain the meaning after the train process.
Looking at the graph of the activation sigmoid function, if you have linearly expelled the values of TF-IDF and their corresponding weights (denoted by X) then the output value of the activation sigmoid function is 0.5, and with this value , we cannot determine if the input review statement is pos or neg. If X increases gradually, the probability that pos will increase, and vice versa.
You can imagine that, for a sentence to be pos, the weights will almost go up in the positive direction, while the review is neg, the weight will go in the negative direction.
Suppose, we train the logistic regression model on the bag of 1-grams with the TF-IDF we will have an accuracy value on the test set of 88.5%. Let’s look at the weights values in n-grams for an estimate, because the linear model can fully explain what it means:
If you look inside the top positive table, you can see that their weights are positive and relatively large, while, in the top negative table, weights are negative and relatively small. You can see, even though you do not know English, you can still conclude that this review is positive or negative only because of the examples given for the model.
Next, suppose we train the model on a slightly larger n-grams, say 2-grams, usually the value of n in n-grams does not exceed 5. Because with a large value of n, it usually comes from an error. that spelling, or n-grams doesn’t make sense, and it can make the model overfit when it comes to very specific cases. So we’re going to use n = 2 with the minimum frequency threshold value we can get, and from there you can get a large matrix of features with 25,000 lines for reviews and 15.6821 columns respectively. Application for performing features.
Although this is a relatively large matrix, we still use a linear model and it still works:
With the increase of 2-grams, the accuracy value has increased by 1.5% compared to the original with 89.9%, nearly 90%. And you look at weight and you can see that with n = 2, the weight values work to differentiate between the better classes, the bigger the positive and the smaller the negative.
A few other tricks to increase accuracy:
- Because in the review sentences there are emoticon icons, you can rely on these icons to make features which is also one of the good features.
- Try to standardize tokens based on stemming or lemmazation techniques
- Try other layering models: SVM, Naive bayes, …
So, I have talked about the emotional classification model of the review sentences. Currently there is no code on this model, I will update it at the end of the article in the near future. Hope you are interested in this topic in NLP. Thanks for reading my writing.
I used the content based on Feature extraction from text in the Natural Language Processing course created by National Research University Higher School of Economics. See you in your next article in the NLP series with the subject: Hashing trick in spam filtering (Hacking trick in spam filtering).