Feature extraction from text

Tram Ho

Hello everyone, in this article I would like to talk about converting tokens separated from input text into features. This is quite important in creating input features to train NLP models.

Transform token into features

Bag of words

One of the best ways to transform tokens into features, which is also the core idea method for later methods of transforming tokens into features, is the Bag of Words technique. The idea of ​​this technique is to calculate the occurrence of specific tokens in our input text.

The implementation of this technique is as follows: We will search for words that have appeared in a sentence and mark them with the value 1, words that do not appear then mark them with the value 0.

Let’s look at an example of these 3 review statements:

Now, we will search for all possible tokens in our documents. Write them as the headings of the columns in the matrix, the rows will be mark values ​​that mark whether the tokens appear in reivew sentences with the corresponding number corresponding to the row. Then we will have a relatively large feature matrix. This is how we translate the text sentences into a vector corresponding to the review sentence.

This is called text vectorization .

However, with this method, it has many disadvantages:

  1. The first is the order in which tokens are not kept, which affects the semantics of the sentence. That’s why we call the bag of words because bag is unordered.
  2. Second, the counter (counter) is not standardized and does not mean high.

To solve these two problems, a new method was introduced with the purpose of maintaining the order of the tokens.

TF-IDF

In fact, the best way to maintain the order of the tokens is to care about pairs of tokens, three consecutive tokens, or more instead of only one token. This approach is called extracting n-grams . 1-gram corresponds to 1 token, 2-gram corresponds to 2 tokens, and so on.

Let’s take a look at how this works.

We still have the same 3 review sentences. Now the columns are not just a separate token but now a pair of tokens. And reivew sentences are also converted to vector form corresponding to the Bag of words method, with a value of 1/0 indicating whether the token pair appears / does not appear in the corresponding review or not.

Notice here, the representation at this level only provides a local-order relationship in the sentence, but what we want is to analyze the input text in a better way. And another problem is that there will be a lot of features that we will have here if we take a pair of tokens. Assuming the number of tokens reaches 100,000 tokens, the number of features can increase exponentially.

To solve this problem, we will eliminate some n-grams based on their frequency in our set of input reviews (corpus). In three cases, n-grams have a high frequency of occurrence, n-grams have a low frequency of occurrence, and n-grams have an average frequency of occurrence.

  • n-grams have high frequency of occurrence : this is the case in most documents, you can see these n-grams. For English, it can be prepositions, … (a, an, the, …). And because we only use grammatical structures, they don’t make much sense. They are called stop-words , they don’t really help us to distinguish text from each other. And it helps if you remove these stop-words.
  • n-grams have a low frequency of occurrence : this is often the case because of a typo from the user, or that n-grams is rare in any of the reviews in our dataset. Both of these cases are bad for later models. Because, if we don’t delete these n-grams, the model will probably be overfit.
  • n-grams have an average frequency of appearance : these are the best n-grams because they include n-grams without stop-words as well as no n-grams of misspelled words or rare occurrences in n-grams. data set. The problem is that in the n-grams set, there are tons of average appearing frequency, there are many n-grams belonging to different frequency ranges. It is useful that they are useful to be able to rely on the frequency at which n-grams is filtered as well, which n-grams are worse. If we could rank these n-grams according to their importance level, it would definitely be very beneficial. We can decide on the n-grams with the average frequency of appearance, which n-grams are good and which n-grams are bad. And the idea here is that smaller n-grams are more likely to be weighted because they are shown for individual cases in the review.

To accomplish this idea here, the term TF will be used to denote the frequency of occurrence of n-grams.

TF (Term Frequency)

We denote TF for the frequency of term t . Term t here can be interpreted as an n-grams token in a document d :

tf(t, d): frequency cho term t trong document d

There are many ways to calculate the frequency of this tf occurrence:

The first and easiest way is to use binary values ​​(0, 1). You can include either 0 or 1 in order for the token to appear / not appear in your input text.

The second way is to include the raw count value, which is how many times you see term t appearing in your documents. The notation for this value is f.

The third way is to calculate the total number of occurrences of all terms in document d. Then divide this f value by the sum found. So you’ve normalized the value of tf to about (0, 1). This indicates that for the total number of occurrences of all terms, the term t appears more or less.

As a last resort, you can use a logarithmic normalization scheme. This will take you to the logarithmic scale and will be able to help you deal with subsequent tasks better.

IDF (Inverse Document Freequency)

Another way is to fine-tune the frequency of document inversion ( IDF for short).

First, set the symbol: N : Sum all the documents in our dataset. D : our data set, which is a collection of all documents. Put simply, a document can be a review statement in a dataset.

Now, we need to find the number of documents that term t appears in:

d ϵ D : t ϵ d | {d epsilon D: t epsilon d} | d ϵ D : t ϵ d

Now, you can see that, you have the frequency of the document, then you’re interested in the number of documents where the term t comes from, and then dividing this number by the sum of N, and so you will have the frequency of term t appearing on your entire dataset. From here you can reverse this quotient and get its logarithm, according to the formula:

And this shows that, with term t appearing more, the importance level will decrease, and the less term term t will increase, because the greater the quotient the larger the logarithm, the quotient will be. the smaller the logarithm the smaller. And this formula is called Inverse document freequency.

TF-IDF

Now, we will use both the terms TF and IDF, we will get the value TF-IDF . One is the term frequency of term t in document d, and one is the frequency of term t on the entire data set. The TF-IDF value will be equal to the product of TF and IDF. It will give us a comprehensive view on the entire dataset as well as on the main document containing the term t that we are considering.

  • The TF-IDF value is high when we have a high TF value (i.e., the frequency of term t on document d is high) and the frequency of documents containing term t on the entire dataset is low (when the frequency If this capacity is low, the IDF value will be large). This reflects that the term t is almost exclusively focused on the document we are reviewing, and the high TF-IDF value highlights this term to distinguish the documents being reviewed throughout the data set. This is the idea we are trying to follow.

Take a look at how TF-IDF works:

We will replace the counter values ​​of the bag of words of the token representation to the TF-IDF value. We can then normalize each element in the rows according to the L2 standard. You can divide based on the L2 standard or divide by the sum, whatever you want:

You can see the 0.47 value for did not is the highest, since it is only in the third review sentence while the n-grams good movie , movie smaller because they appear in two review sentences, the value 0.47 works. highlight the special cases in our dataset.

In sklearn library there is a method to implement text vectorization based on TF-IDF, you can import and use them. You can see the following code:

There are some parameters that you need to pass including:

  • mean_df: minimum document freequency value, essentially a threshold used to cut low freequency
  • max_df: the threshold value represents the largest document freequency, that is, the number of documents containing the largest term t we can use, if greater than this value, perhaps term t belongs to the case of stop-words .
  • ngram_range: it indicates the TF-IDF vectorizer can use n-grams with n within what range for performing features. And one more thing is that not all n-grams are used but in the algorithm there are n-grams filtered out.

Thus, in this article I have talked about techniques to extract features from input text. This is an important basis to create the input characteristics for training NLP model later. Hope I write easily understandable about this knowledge. Thank you for reading the article.

I used the content based on Feature extraction from text in the Natural Language Processing course created by National Research University Higher School of Economics. See you in your next post on the content of Linear models for sentiment analysis (using linear models in analyzing emotions of review sentences) using features created from the above techniques.

Share the news now

Source : Viblo