Keyword Extraction: Quick solution for information filtering

Tram Ho

overview

Currently with the development of technology, social platforms, newspapers, media …etc. We constantly have access to many different sources of information, so the needs of people in selecting and using information are also increasing. The problems of user suggestions, trending, chat bots … have been increasingly improved and developed. So how to extract all the information ? In this article, I would like to present some methods of selecting keywords from text that have been used in many natural language processing (NLP) problems.

1. Spacy

To talk about extract keywords , it is impossible not to mention spacy . As one of the popular Python NLP libraries, SpaCy comes with pretrained pipelines and currently supports coding and training for over 60 different languages. Includes neural network models for tasks such as tagging, parsing, named entity recognition, text classification…

Some basic handling of spacy is as follows.

  • Split input text content by tokens
  • Extract key words from the token list.
    • Set key words with POS tags like “PROPN”, “ADJ”, “VERB”, “NOUN”… (Customizable POS tag list)

You can learn more about spacy here: Spacy

2. Rake_NLTK

RAKE (Rapid Automatic Keyword Extraction) is a keyword extraction method to find the most relevant words or phrases in a piece of text using a set of stop words and phrase delimiters. RAKE uses a domain-independent keyword extraction algorithm and attempts to identify key phrases in the text by analyzing the frequency of the word’s occurrence and its co-occurrence with other words in the text. copy. Rake nltk is an extended version of RAKE with support by NLTK RAKE Basic handler

  • Split input text content by dotes
  • Create a matrix of co-occurring words
  • Word scoring – That score can be calculated by the degree of a word in the matrix, as word frequency or the degree of a word divided by its frequency
  • Keyword phrases can also be created by combining keywords

You can take a closer look at the Rake nltk library here: RAKE_NLTK

3. TextRank

Textrank is a library in Python that has functions for keyword extraction and text summarization. The algorithm determines how closely related words are by seeing if they follow each other. The most important terms in the text are then sorted using a ranking algorithm. Textrank is generally compatible with Spacy’s pipelines. Here is an example you can refer to.

You can learn more about the ideas and algorithms implemented in this library here: TextRank

4. KeyBert

KeyBERT is a basic and easy to use keyword extraction technique that generates keywords and key phrases that are most similar to a given document using BERT-embeddings . It uses BERT-embeddings and cosine similarity to locate sub-documents in a document that are most similar to the document itself.

BERT is used to extract document embeddings to create a representation of documents . Then the word embeddings for the words/phrases will be extracted. KeyBert then uses cosine similarity to find the words/phrases that are most similar to the document.

You can learn more about the ideas and algorithms implemented in this library here: KeyBert

5. Word cloud

Word cloud is a tool for visualizing textual data, often used to highlight important text data points.

The more times a term appears in the Word cloud, the more times it appears in a text data source (such as a speech, blog post, or database) (Also known as tag cloud or text cloud). ). The more often and more important a term appears in the document, the larger and bolder the term will be. These are ways to extract the most important parts of textual data, such as blog posts or databases.

You can learn more about the ideas and algorithms implemented in this library here: Word cloud

6. Yet Another Keyword Extractor (Yake)

YAKE is an unsupervised automatic keyword extraction method that identifies the most relevant keywords in a document using text statistics data from single documents . This technique does not rely on dictionaries, external data stores, text sizes, languages… and it does not require training on a particular set of documents. The main features of the Yake algorithm are as follows:

  • Unsupervised approach
  • Corpus-Independent
  • Domain and Language Independent
  • Single-Document

You can learn more about the ideas and algorithms implemented in this library here: Yake

7. Textrazor API

In addition to using some of the libraries available in python, the api is also a good choice for the task I want to implement. The Textrazor API can be accessed in a variety of computer languages, including Python, Java, PHP, and others. We will receive an API key to extract keywords from the text when we have created an account with Textrazor.

Textrazor is a good choice for developers who need quick extraction tools with comprehensive customization options. This is a keyword extraction service that can be used locally or on the cloud. The TextRazor API can be used to extract meaning from text and can be easily connected to a programming language. We can design a custom extractor and extract synonyms and entity relationships, in addition to extracting keywords and entities in 12 different languages

You can learn more about the API here: Textrazor API

summary

With the speed of development of current technology, there are many keyword selection solutions built and developed with high speed and accuracy, the problems developed based on that are also big problems. requires a small amount of data. In this article, I only introduce some solutions that have been built to be easy to install and use. Hope can help you have some more options in the process of building and developing projects related to natural language.

References

Spacy

RAKE_NLTK

TextRank

KeyBert

Word cloud

Yake

Textrazor API

 

Share the news now

Source : Viblo