Keyword Extraction: Quick solution for information filtering

Thursday, 25/08/2022

Tram Ho

overview

Currently with the development of technology, social platforms, newspapers, media …etc. We constantly have access to many different sources of information, so the needs of people in selecting and using information are also increasing. The problems of user suggestions, trending, chat bots … have been increasingly improved and developed. So how to extract all the information ? In this article, I would like to present some methods of selecting keywords from text that have been used in many natural language processing (NLP) problems.

1. Spacy

To talk about extract keywords , it is impossible not to mention spacy . As one of the popular Python NLP libraries, SpaCy comes with pretrained pipelines and currently supports coding and training for over 60 different languages. Includes neural network models for tasks such as tagging, parsing, named entity recognition, text classification…

Some basic handling of spacy is as follows.

Split input text content by tokens
Extract key words from the token list.
- Set key words with POS tags like “PROPN”, “ADJ”, “VERB”, “NOUN”… (Customizable POS tag list)

<span class="token keyword">import</span> spacy

<span class="token comment"># Load English tokenizer, tagger, parser and NER</span>
nlp <span class="token operator">=</span> spacy <span class="token punctuation">.</span> load <span class="token punctuation">(</span> <span class="token string">"en_core_web_sm"</span> <span class="token punctuation">)</span>

text <span class="token operator">=</span> <span class="token punctuation">(</span> <span class="token string">"When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him  seriously."</span> <span class="token punctuation">)</span>
doc <span class="token operator">=</span> nlp <span class="token punctuation">(</span> text <span class="token punctuation">)</span>

<span class="token comment"># Analyze syntax</span>
<span class="token keyword">print</span> <span class="token punctuation">(</span> <span class="token string">"Noun phrases:"</span> <span class="token punctuation">,</span> <span class="token punctuation">[</span> chunk <span class="token punctuation">.</span> text <span class="token keyword">for</span> chunk <span class="token keyword">in</span> doc <span class="token punctuation">.</span> noun_chunks <span class="token punctuation">]</span> <span class="token punctuation">)</span>
<span class="token keyword">print</span> <span class="token punctuation">(</span> <span class="token string">"Verbs:"</span> <span class="token punctuation">,</span> <span class="token punctuation">[</span> token <span class="token punctuation">.</span> lemma_ <span class="token keyword">for</span> token <span class="token keyword">in</span> doc <span class="token keyword">if</span> token <span class="token punctuation">.</span> pos_ <span class="token operator">==</span> <span class="token string">"VERB"</span> <span class="token punctuation">]</span> <span class="token punctuation">)</span>

<span class="token comment"># Find named entities, phrases and concepts</span>
<span class="token keyword">for</span> entity <span class="token keyword">in</span> doc <span class="token punctuation">.</span> ents <span class="token punctuation">:</span>
    <span class="token keyword">print</span> <span class="token punctuation">(</span> entity <span class="token punctuation">.</span> text <span class="token punctuation">,</span> entity <span class="token punctuation">.</span> label_ <span class="token punctuation">)</span>

import spacy

# Load English tokenizer, tagger, parser and NER

nlp = spacy . load ( "en_core_web_sm" )

text = ( "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously." )

doc = nlp ( text )

# Analyze syntax

print ( "Noun phrases:" , [ chunk . text for chunk in doc . noun_chunks ] )

print ( "Verbs:" , [ token . lemma_ for token in doc if token . pos_ == "VERB" ] )

# Find named entities, phrases and concepts

for entity in doc . ents :

print ( entity . text , entity . label_ )

Noun phrases: ['Sebastian Thrun', 'self-driving cars', 'Google', 'few people', 'the company', 'him']
Verbs: ['start', 'work', 'drive', 'take']
Sebastian Thrun PERSON
2007 DATE

Noun phrases: ['Sebastian Thrun', 'self-driving cars', 'Google', 'few people', 'the company', 'him']

Verbs: ['start', 'work', 'drive', 'take']

Sebastian Thrun PERSON

2007 DATE

You can learn more about spacy here: Spacy

2. Rake_NLTK

RAKE (Rapid Automatic Keyword Extraction) is a keyword extraction method to find the most relevant words or phrases in a piece of text using a set of stop words and phrase delimiters. RAKE uses a domain-independent keyword extraction algorithm and attempts to identify key phrases in the text by analyzing the frequency of the word’s occurrence and its co-occurrence with other words in the text. copy. Rake nltk is an extended version of RAKE with support by NLTK RAKE Basic handler

Split input text content by dotes
Create a matrix of co-occurring words
Word scoring – That score can be calculated by the degree of a word in the matrix, as word frequency or the degree of a word divided by its frequency
Keyword phrases can also be created by combining keywords

<span class="token keyword">from</span> rake_nltk <span class="token keyword">import</span> Rake
<span class="token keyword">import</span> nltk
nltk <span class="token punctuation">.</span> download <span class="token punctuation">(</span> <span class="token string">'stopwords'</span> <span class="token punctuation">)</span>
nltk <span class="token punctuation">.</span> download <span class="token punctuation">(</span> <span class="token string">'punkt'</span> <span class="token punctuation">)</span>
r <span class="token operator">=</span> Rake <span class="token punctuation">(</span> <span class="token punctuation">)</span>
my_text <span class="token operator">=</span> <span class="token string">"When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him  seriously."</span>
r <span class="token punctuation">.</span> extract_keywords_from_text <span class="token punctuation">(</span> my_text <span class="token punctuation">)</span>
keywordList           <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token punctuation">]</span>
rankedList            <span class="token operator">=</span> r <span class="token punctuation">.</span> get_ranked_phrases_with_scores <span class="token punctuation">(</span> <span class="token punctuation">)</span>
<span class="token keyword">for</span> keyword <span class="token keyword">in</span> rankedList <span class="token punctuation">:</span>
  keyword_updated       <span class="token operator">=</span> keyword <span class="token punctuation">[</span> <span class="token number">1</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> split <span class="token punctuation">(</span> <span class="token punctuation">)</span>
  keyword_updated_string    <span class="token operator">=</span> <span class="token string">" "</span> <span class="token punctuation">.</span> join <span class="token punctuation">(</span> keyword_updated <span class="token punctuation">[</span> <span class="token punctuation">:</span> <span class="token number">2</span> <span class="token punctuation">]</span> <span class="token punctuation">)</span>
  keywordList <span class="token punctuation">.</span> append <span class="token punctuation">(</span> keyword_updated_string <span class="token punctuation">)</span>
<span class="token keyword">print</span> <span class="token punctuation">(</span> keywordList <span class="token punctuation">)</span>

from rake_nltk import Rake

import nltk

nltk . download ( 'stopwords' )

nltk . download ( 'punkt' )

r = Rake ( )

my_text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."

r . extract_keywords_from_text ( my_text )

keywordList = [ ]

rankedList = r . get_ranked_phrases_with_scores ( )

for keyword in rankedList :

keyword_updated = keyword [ 1 ] . split ( )

keyword_updated_string = " " . join ( keyword_updated [ : 2 ] )

keywordList . append ( keyword_updated_string )

print ( keywordList )

['sebastian thrun', 'people outside', 'driving cars', 'company took', 'seriously', 'self', 'google', '2007']

1 2	['sebastian thrun', 'people outside', 'driving cars', 'company took', 'seriously', 'self', 'google', '2007']

You can take a closer look at the Rake nltk library here: RAKE_NLTK

3. TextRank

Textrank is a library in Python that has functions for keyword extraction and text summarization. The algorithm determines how closely related words are by seeing if they follow each other. The most important terms in the text are then sorted using a ranking algorithm. Textrank is generally compatible with Spacy’s pipelines. Here is an example you can refer to.

<span class="token keyword">import</span> spacy
<span class="token keyword">import</span> pytextrank
<span class="token comment"># example text</span>
text <span class="token operator">=</span> <span class="token string">"When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him  seriously."</span>
<span class="token comment"># load a spaCy model, depending on language, scale, etc.</span>
nlp <span class="token operator">=</span> spacy <span class="token punctuation">.</span> load <span class="token punctuation">(</span> <span class="token string">"en_core_web_sm"</span> <span class="token punctuation">)</span>
<span class="token comment"># add PyTextRank to the spaCy pipeline</span>
nlp <span class="token punctuation">.</span> add_pipe <span class="token punctuation">(</span> <span class="token string">"textrank"</span> <span class="token punctuation">)</span>
doc <span class="token operator">=</span> nlp <span class="token punctuation">(</span> text <span class="token punctuation">)</span>
<span class="token comment"># examine the top-ranked phrases in the document</span>
<span class="token keyword">for</span> phrase <span class="token keyword">in</span> doc <span class="token punctuation">.</span> _ <span class="token punctuation">.</span> phrases <span class="token punctuation">[</span> <span class="token punctuation">:</span> <span class="token number">10</span> <span class="token punctuation">]</span> <span class="token punctuation">:</span>
    <span class="token keyword">print</span> <span class="token punctuation">(</span> phrase <span class="token punctuation">.</span> text <span class="token punctuation">)</span>

import spacy

import pytextrank

# example text

text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."

# load a spaCy model, depending on language, scale, etc.

# add PyTextRank to the spaCy pipeline

nlp . add_pipe ( "textrank" )

doc = nlp ( text )

# examine the top-ranked phrases in the document

for phrase in doc . _ . phrases [ : 10 ] :

print ( phrase . text )

few people
Google
self-driving cars
Sebastian Thrun
the company
2007
him

few people

Google

self-driving cars

Sebastian Thrun

the company

2007

him

You can learn more about the ideas and algorithms implemented in this library here: TextRank

4. KeyBert

KeyBERT is a basic and easy to use keyword extraction technique that generates keywords and key phrases that are most similar to a given document using BERT-embeddings . It uses BERT-embeddings and cosine similarity to locate sub-documents in a document that are most similar to the document itself.

BERT is used to extract document embeddings to create a representation of documents . Then the word embeddings for the words/phrases will be extracted. KeyBert then uses cosine similarity to find the words/phrases that are most similar to the document.

<span class="token keyword">from</span> keybert <span class="token keyword">import</span> KeyBERT
doc <span class="token operator">=</span> <span class="token string">"When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him  seriously."</span>
kw_model <span class="token operator">=</span> KeyBERT <span class="token punctuation">(</span> <span class="token punctuation">)</span>
keywords <span class="token operator">=</span> kw_model <span class="token punctuation">.</span> extract_keywords <span class="token punctuation">(</span> doc <span class="token punctuation">)</span>
<span class="token keyword">print</span> <span class="token punctuation">(</span> keywords <span class="token punctuation">)</span>

from keybert import KeyBERT

doc = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."

kw_model = KeyBERT ( )

keywords = kw_model . extract_keywords ( doc )

print ( keywords )

[('sebastian', 0.3796), ('driving', 0.3548), ('google', 0.3379), ('thrun', 0.3156), ('cars', 0.2946)]

1 2	[('sebastian', 0.3796), ('driving', 0.3548), ('google', 0.3379), ('thrun', 0.3156), ('cars', 0.2946)]

You can learn more about the ideas and algorithms implemented in this library here: KeyBert

5. Word cloud

Word cloud is a tool for visualizing textual data, often used to highlight important text data points.

The more times a term appears in the Word cloud, the more times it appears in a text data source (such as a speech, blog post, or database) (Also known as tag cloud or text cloud). ). The more often and more important a term appears in the document, the larger and bolder the term will be. These are ways to extract the most important parts of textual data, such as blog posts or databases.

<span class="token keyword">import</span> collections
<span class="token keyword">import</span> numpy <span class="token keyword">as</span> np
<span class="token keyword">import</span> pandas <span class="token keyword">as</span> pd
<span class="token keyword">import</span> matplotlib <span class="token punctuation">.</span> cm <span class="token keyword">as</span> cm
<span class="token keyword">import</span> matplotlib <span class="token punctuation">.</span> pyplot <span class="token keyword">as</span> plt
<span class="token keyword">from</span> matplotlib <span class="token keyword">import</span> rcParams
<span class="token keyword">from</span> wordcloud <span class="token keyword">import</span> WordCloud <span class="token punctuation">,</span> STOPWORDS
all_headlines <span class="token operator">=</span> <span class="token string">"When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him  seriously."</span>
stopwords <span class="token operator">=</span> STOPWORDS
wordcloud <span class="token operator">=</span> WordCloud <span class="token punctuation">(</span> stopwords <span class="token operator">=</span> stopwords <span class="token punctuation">,</span> background_color <span class="token operator">=</span> <span class="token string">"white"</span> <span class="token punctuation">,</span> max_words <span class="token operator">=</span> <span class="token number">1000</span> <span class="token punctuation">)</span> <span class="token punctuation">.</span> generate <span class="token punctuation">(</span> all_headlines <span class="token punctuation">)</span>

filtered_words <span class="token operator">=</span> <span class="token punctuation">[</span> word <span class="token keyword">for</span> word <span class="token keyword">in</span> all_headlines <span class="token punctuation">.</span> split <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token keyword">if</span> word <span class="token keyword">not</span> <span class="token keyword">in</span> stopwords <span class="token punctuation">]</span>
counted_words  <span class="token operator">=</span> collections <span class="token punctuation">.</span> Counter <span class="token punctuation">(</span> filtered_words <span class="token punctuation">)</span>
words <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token punctuation">]</span>
counts <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token punctuation">]</span>
<span class="token keyword">for</span> letter <span class="token punctuation">,</span> count <span class="token keyword">in</span> counted_words <span class="token punctuation">.</span> most_common <span class="token punctuation">(</span> <span class="token number">10</span> <span class="token punctuation">)</span> <span class="token punctuation">:</span>
    words <span class="token punctuation">.</span> append <span class="token punctuation">(</span> letter <span class="token punctuation">)</span>
    counts <span class="token punctuation">.</span> append <span class="token punctuation">(</span> count <span class="token punctuation">)</span>
<span class="token keyword">print</span> <span class="token punctuation">(</span> words <span class="token punctuation">)</span>

import collections

import numpy as np

import pandas as pd

import matplotlib . cm as cm

import matplotlib . pyplot as plt

from matplotlib import rcParams

from wordcloud import WordCloud , STOPWORDS

all_headlines = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."

stopwords = STOPWORDS

wordcloud = WordCloud ( stopwords = stopwords , background_color = "white" , max_words = 1000 ) . generate ( all_headlines )

filtered_words = [ word for word in all_headlines . split ( ) if word not in stopwords ]

counted_words = collections . Counter ( filtered_words )

words = [ ]

counts = [ ]

for letter , count in counted_words . most_common ( 10 ) :

words . append ( letter )

counts . append ( count )

print ( words )

['When', 'Sebastian', 'Thrun', 'started', 'working', 'self-driving', 'cars', 'Google', '2007,', 'people']

1 2	['When', 'Sebastian', 'Thrun', 'started', 'working', 'self-driving', 'cars', 'Google', '2007,', 'people']

You can learn more about the ideas and algorithms implemented in this library here: Word cloud

6. Yet Another Keyword Extractor (Yake)

YAKE is an unsupervised automatic keyword extraction method that identifies the most relevant keywords in a document using text statistics data from single documents . This technique does not rely on dictionaries, external data stores, text sizes, languages… and it does not require training on a particular set of documents. The main features of the Yake algorithm are as follows:

Unsupervised approach
Corpus-Independent
Domain and Language Independent
Single-Document

<span class="token keyword">import</span> yake
doc <span class="token operator">=</span> <span class="token string">"When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him  seriously."</span>
kw_extractor <span class="token operator">=</span> yake <span class="token punctuation">.</span> KeywordExtractor <span class="token punctuation">(</span> <span class="token punctuation">)</span>
keywords <span class="token operator">=</span> kw_extractor <span class="token punctuation">.</span> extract_keywords <span class="token punctuation">(</span> doc <span class="token punctuation">)</span>
<span class="token keyword">for</span> kw <span class="token keyword">in</span> keywords <span class="token punctuation">:</span>
  <span class="token keyword">print</span> <span class="token punctuation">(</span> kw <span class="token punctuation">)</span>

import yake

kw_extractor = yake . KeywordExtractor ( )

keywords = kw_extractor . extract_keywords ( doc )

for kw in keywords :

print ( kw )

('Sebastian Thrun started', 0.006884150060415161)
('Thrun started working', 0.015042304599106411)
('Sebastian Thrun', 0.02140921543860024)
('Thrun started', 0.04498862876540802)
('cars at Google', 0.04498862876540802)
('started working', 0.09700399286574239)
('working on self-driving', 0.09700399286574239)
('self-driving cars', 0.09700399286574239)
('Sebastian', 0.1447773057422032)
('Thrun', 0.1447773057422032)
('Google', 0.1447773057422032)
('started', 0.29736558256021506)
('working', 0.29736558256021506)
('self-driving', 0.29736558256021506)
('cars', 0.29736558256021506)
('people', 0.29736558256021506)
('company', 0.29736558256021506)

('Sebastian Thrun started', 0.006884150060415161)

('Thrun started working', 0.015042304599106411)

('Sebastian Thrun', 0.02140921543860024)

('Thrun started', 0.04498862876540802)

('cars at Google', 0.04498862876540802)

('started working', 0.09700399286574239)

('working on self-driving', 0.09700399286574239)

('self-driving cars', 0.09700399286574239)

('Sebastian', 0.1447773057422032)

('Thrun', 0.1447773057422032)

('Google', 0.1447773057422032)

('started', 0.29736558256021506)

('working', 0.29736558256021506)

('self-driving', 0.29736558256021506)

('cars', 0.29736558256021506)

('people', 0.29736558256021506)

('company', 0.29736558256021506)

You can learn more about the ideas and algorithms implemented in this library here: Yake

7. Textrazor API

In addition to using some of the libraries available in python, the api is also a good choice for the task I want to implement. The Textrazor API can be accessed in a variety of computer languages, including Python, Java, PHP, and others. We will receive an API key to extract keywords from the text when we have created an account with Textrazor.

Textrazor is a good choice for developers who need quick extraction tools with comprehensive customization options. This is a keyword extraction service that can be used locally or on the cloud. The TextRazor API can be used to extract meaning from text and can be easily connected to a programming language. We can design a custom extractor and extract synonyms and entity relationships, in addition to extracting keywords and entities in 12 different languages

<span class="token keyword">import</span> textrazor
textrazor <span class="token punctuation">.</span> api_key <span class="token operator">=</span> <span class="token string">"your_api_key"</span>
client <span class="token operator">=</span> textrazor <span class="token punctuation">.</span> TextRazor <span class="token punctuation">(</span> extractors <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token string">"entities"</span> <span class="token punctuation">,</span> <span class="token string">"topics"</span> <span class="token punctuation">]</span> <span class="token punctuation">)</span>
response <span class="token operator">=</span> client <span class="token punctuation">.</span> analyze_url <span class="token punctuation">(</span> <span class="token string">"https://www.textrazor.com/docs/python"</span> <span class="token punctuation">)</span>
<span class="token keyword">for</span> entity <span class="token keyword">in</span> response <span class="token punctuation">.</span> entities <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token punctuation">:</span>
    <span class="token keyword">print</span> <span class="token punctuation">(</span> entity <span class="token punctuation">.</span> <span class="token builtin">id</span> <span class="token punctuation">,</span> entity <span class="token punctuation">.</span> relevance_score <span class="token punctuation">,</span> entity <span class="token punctuation">.</span> confidence_score <span class="token punctuation">)</span>

import textrazor

textrazor . api_key = "your_api_key"

client = textrazor . TextRazor ( extractors = [ "entities" , "topics" ] )

response = client . analyze_url ( "https://www.textrazor.com/docs/python" )

for entity in response . entities ( ) :

print ( entity . id , entity . relevance_score , entity . confidence_score )

You can learn more about the API here: Textrazor API

summary

With the speed of development of current technology, there are many keyword selection solutions built and developed with high speed and accuracy, the problems developed based on that are also big problems. requires a small amount of data. In this article, I only introduce some solutions that have been built to be easy to install and use. Hope can help you have some more options in the process of building and developing projects related to natural language.