overview
Currently with the development of technology, social platforms, newspapers, media …etc. We constantly have access to many different sources of information, so the needs of people in selecting and using information are also increasing. The problems of user suggestions, trending, chat bots … have been increasingly improved and developed. So how to extract all the information ? In this article, I would like to present some methods of selecting keywords from text that have been used in many natural language processing (NLP) problems.
1. Spacy
To talk about extract keywords , it is impossible not to mention spacy . As one of the popular Python NLP libraries, SpaCy comes with pretrained pipelines and currently supports coding and training for over 60 different languages. Includes neural network models for tasks such as tagging, parsing, named entity recognition, text classification…
Some basic handling of spacy is as follows.
- Split input text content by tokens
- Extract key words from the token list.
- Set key words with POS tags like “PROPN”, “ADJ”, “VERB”, “NOUN”… (Customizable POS tag list)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | <span class="token keyword">import</span> spacy <span class="token comment"># Load English tokenizer, tagger, parser and NER</span> nlp <span class="token operator">=</span> spacy <span class="token punctuation">.</span> load <span class="token punctuation">(</span> <span class="token string">"en_core_web_sm"</span> <span class="token punctuation">)</span> text <span class="token operator">=</span> <span class="token punctuation">(</span> <span class="token string">"When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."</span> <span class="token punctuation">)</span> doc <span class="token operator">=</span> nlp <span class="token punctuation">(</span> text <span class="token punctuation">)</span> <span class="token comment"># Analyze syntax</span> <span class="token keyword">print</span> <span class="token punctuation">(</span> <span class="token string">"Noun phrases:"</span> <span class="token punctuation">,</span> <span class="token punctuation">[</span> chunk <span class="token punctuation">.</span> text <span class="token keyword">for</span> chunk <span class="token keyword">in</span> doc <span class="token punctuation">.</span> noun_chunks <span class="token punctuation">]</span> <span class="token punctuation">)</span> <span class="token keyword">print</span> <span class="token punctuation">(</span> <span class="token string">"Verbs:"</span> <span class="token punctuation">,</span> <span class="token punctuation">[</span> token <span class="token punctuation">.</span> lemma_ <span class="token keyword">for</span> token <span class="token keyword">in</span> doc <span class="token keyword">if</span> token <span class="token punctuation">.</span> pos_ <span class="token operator">==</span> <span class="token string">"VERB"</span> <span class="token punctuation">]</span> <span class="token punctuation">)</span> <span class="token comment"># Find named entities, phrases and concepts</span> <span class="token keyword">for</span> entity <span class="token keyword">in</span> doc <span class="token punctuation">.</span> ents <span class="token punctuation">:</span> <span class="token keyword">print</span> <span class="token punctuation">(</span> entity <span class="token punctuation">.</span> text <span class="token punctuation">,</span> entity <span class="token punctuation">.</span> label_ <span class="token punctuation">)</span> |
1 2 3 4 5 | Noun phrases: ['Sebastian Thrun', 'self-driving cars', 'Google', 'few people', 'the company', 'him'] Verbs: ['start', 'work', 'drive', 'take'] Sebastian Thrun PERSON 2007 DATE |
You can learn more about spacy here: Spacy
2. Rake_NLTK
RAKE (Rapid Automatic Keyword Extraction) is a keyword extraction method to find the most relevant words or phrases in a piece of text using a set of stop words and phrase delimiters. RAKE uses a domain-independent keyword extraction algorithm and attempts to identify key phrases in the text by analyzing the frequency of the word’s occurrence and its co-occurrence with other words in the text. copy. Rake nltk is an extended version of RAKE with support by NLTK RAKE Basic handler
- Split input text content by dotes
- Create a matrix of co-occurring words
- Word scoring – That score can be calculated by the degree of a word in the matrix, as word frequency or the degree of a word divided by its frequency
- Keyword phrases can also be created by combining keywords
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | <span class="token keyword">from</span> rake_nltk <span class="token keyword">import</span> Rake <span class="token keyword">import</span> nltk nltk <span class="token punctuation">.</span> download <span class="token punctuation">(</span> <span class="token string">'stopwords'</span> <span class="token punctuation">)</span> nltk <span class="token punctuation">.</span> download <span class="token punctuation">(</span> <span class="token string">'punkt'</span> <span class="token punctuation">)</span> r <span class="token operator">=</span> Rake <span class="token punctuation">(</span> <span class="token punctuation">)</span> my_text <span class="token operator">=</span> <span class="token string">"When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."</span> r <span class="token punctuation">.</span> extract_keywords_from_text <span class="token punctuation">(</span> my_text <span class="token punctuation">)</span> keywordList <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token punctuation">]</span> rankedList <span class="token operator">=</span> r <span class="token punctuation">.</span> get_ranked_phrases_with_scores <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token keyword">for</span> keyword <span class="token keyword">in</span> rankedList <span class="token punctuation">:</span> keyword_updated <span class="token operator">=</span> keyword <span class="token punctuation">[</span> <span class="token number">1</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> split <span class="token punctuation">(</span> <span class="token punctuation">)</span> keyword_updated_string <span class="token operator">=</span> <span class="token string">" "</span> <span class="token punctuation">.</span> join <span class="token punctuation">(</span> keyword_updated <span class="token punctuation">[</span> <span class="token punctuation">:</span> <span class="token number">2</span> <span class="token punctuation">]</span> <span class="token punctuation">)</span> keywordList <span class="token punctuation">.</span> append <span class="token punctuation">(</span> keyword_updated_string <span class="token punctuation">)</span> <span class="token keyword">print</span> <span class="token punctuation">(</span> keywordList <span class="token punctuation">)</span> |
1 2 | ['sebastian thrun', 'people outside', 'driving cars', 'company took', 'seriously', 'self', 'google', '2007'] |
You can take a closer look at the Rake nltk library here: RAKE_NLTK
3. TextRank
Textrank is a library in Python that has functions for keyword extraction and text summarization. The algorithm determines how closely related words are by seeing if they follow each other. The most important terms in the text are then sorted using a ranking algorithm. Textrank is generally compatible with Spacy’s pipelines. Here is an example you can refer to.
1 2 3 4 5 6 7 8 9 10 11 12 13 | <span class="token keyword">import</span> spacy <span class="token keyword">import</span> pytextrank <span class="token comment"># example text</span> text <span class="token operator">=</span> <span class="token string">"When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."</span> <span class="token comment"># load a spaCy model, depending on language, scale, etc.</span> nlp <span class="token operator">=</span> spacy <span class="token punctuation">.</span> load <span class="token punctuation">(</span> <span class="token string">"en_core_web_sm"</span> <span class="token punctuation">)</span> <span class="token comment"># add PyTextRank to the spaCy pipeline</span> nlp <span class="token punctuation">.</span> add_pipe <span class="token punctuation">(</span> <span class="token string">"textrank"</span> <span class="token punctuation">)</span> doc <span class="token operator">=</span> nlp <span class="token punctuation">(</span> text <span class="token punctuation">)</span> <span class="token comment"># examine the top-ranked phrases in the document</span> <span class="token keyword">for</span> phrase <span class="token keyword">in</span> doc <span class="token punctuation">.</span> _ <span class="token punctuation">.</span> phrases <span class="token punctuation">[</span> <span class="token punctuation">:</span> <span class="token number">10</span> <span class="token punctuation">]</span> <span class="token punctuation">:</span> <span class="token keyword">print</span> <span class="token punctuation">(</span> phrase <span class="token punctuation">.</span> text <span class="token punctuation">)</span> |
1 2 3 4 5 6 7 8 | few people Google self-driving cars Sebastian Thrun the company 2007 him |
You can learn more about the ideas and algorithms implemented in this library here: TextRank
4. KeyBert
KeyBERT is a basic and easy to use keyword extraction technique that generates keywords and key phrases that are most similar to a given document using BERT-embeddings . It uses BERT-embeddings and cosine similarity to locate sub-documents in a document that are most similar to the document itself.
BERT is used to extract document embeddings to create a representation of documents . Then the word embeddings for the words/phrases will be extracted. KeyBert then uses cosine similarity to find the words/phrases that are most similar to the document.
1 2 3 4 5 6 | <span class="token keyword">from</span> keybert <span class="token keyword">import</span> KeyBERT doc <span class="token operator">=</span> <span class="token string">"When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."</span> kw_model <span class="token operator">=</span> KeyBERT <span class="token punctuation">(</span> <span class="token punctuation">)</span> keywords <span class="token operator">=</span> kw_model <span class="token punctuation">.</span> extract_keywords <span class="token punctuation">(</span> doc <span class="token punctuation">)</span> <span class="token keyword">print</span> <span class="token punctuation">(</span> keywords <span class="token punctuation">)</span> |
1 2 | [('sebastian', 0.3796), ('driving', 0.3548), ('google', 0.3379), ('thrun', 0.3156), ('cars', 0.2946)] |
You can learn more about the ideas and algorithms implemented in this library here: KeyBert
5. Word cloud
Word cloud is a tool for visualizing textual data, often used to highlight important text data points.
The more times a term appears in the Word cloud, the more times it appears in a text data source (such as a speech, blog post, or database) (Also known as tag cloud or text cloud). ). The more often and more important a term appears in the document, the larger and bolder the term will be. These are ways to extract the most important parts of textual data, such as blog posts or databases.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | <span class="token keyword">import</span> collections <span class="token keyword">import</span> numpy <span class="token keyword">as</span> np <span class="token keyword">import</span> pandas <span class="token keyword">as</span> pd <span class="token keyword">import</span> matplotlib <span class="token punctuation">.</span> cm <span class="token keyword">as</span> cm <span class="token keyword">import</span> matplotlib <span class="token punctuation">.</span> pyplot <span class="token keyword">as</span> plt <span class="token keyword">from</span> matplotlib <span class="token keyword">import</span> rcParams <span class="token keyword">from</span> wordcloud <span class="token keyword">import</span> WordCloud <span class="token punctuation">,</span> STOPWORDS all_headlines <span class="token operator">=</span> <span class="token string">"When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."</span> stopwords <span class="token operator">=</span> STOPWORDS wordcloud <span class="token operator">=</span> WordCloud <span class="token punctuation">(</span> stopwords <span class="token operator">=</span> stopwords <span class="token punctuation">,</span> background_color <span class="token operator">=</span> <span class="token string">"white"</span> <span class="token punctuation">,</span> max_words <span class="token operator">=</span> <span class="token number">1000</span> <span class="token punctuation">)</span> <span class="token punctuation">.</span> generate <span class="token punctuation">(</span> all_headlines <span class="token punctuation">)</span> filtered_words <span class="token operator">=</span> <span class="token punctuation">[</span> word <span class="token keyword">for</span> word <span class="token keyword">in</span> all_headlines <span class="token punctuation">.</span> split <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token keyword">if</span> word <span class="token keyword">not</span> <span class="token keyword">in</span> stopwords <span class="token punctuation">]</span> counted_words <span class="token operator">=</span> collections <span class="token punctuation">.</span> Counter <span class="token punctuation">(</span> filtered_words <span class="token punctuation">)</span> words <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token punctuation">]</span> counts <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token punctuation">]</span> <span class="token keyword">for</span> letter <span class="token punctuation">,</span> count <span class="token keyword">in</span> counted_words <span class="token punctuation">.</span> most_common <span class="token punctuation">(</span> <span class="token number">10</span> <span class="token punctuation">)</span> <span class="token punctuation">:</span> words <span class="token punctuation">.</span> append <span class="token punctuation">(</span> letter <span class="token punctuation">)</span> counts <span class="token punctuation">.</span> append <span class="token punctuation">(</span> count <span class="token punctuation">)</span> <span class="token keyword">print</span> <span class="token punctuation">(</span> words <span class="token punctuation">)</span> |
1 2 | ['When', 'Sebastian', 'Thrun', 'started', 'working', 'self-driving', 'cars', 'Google', '2007,', 'people'] |
You can learn more about the ideas and algorithms implemented in this library here: Word cloud
6. Yet Another Keyword Extractor (Yake)
YAKE is an unsupervised automatic keyword extraction method that identifies the most relevant keywords in a document using text statistics data from single documents . This technique does not rely on dictionaries, external data stores, text sizes, languages… and it does not require training on a particular set of documents. The main features of the Yake algorithm are as follows:
- Unsupervised approach
- Corpus-Independent
- Domain and Language Independent
- Single-Document
1 2 3 4 5 6 7 | <span class="token keyword">import</span> yake doc <span class="token operator">=</span> <span class="token string">"When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."</span> kw_extractor <span class="token operator">=</span> yake <span class="token punctuation">.</span> KeywordExtractor <span class="token punctuation">(</span> <span class="token punctuation">)</span> keywords <span class="token operator">=</span> kw_extractor <span class="token punctuation">.</span> extract_keywords <span class="token punctuation">(</span> doc <span class="token punctuation">)</span> <span class="token keyword">for</span> kw <span class="token keyword">in</span> keywords <span class="token punctuation">:</span> <span class="token keyword">print</span> <span class="token punctuation">(</span> kw <span class="token punctuation">)</span> |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | ('Sebastian Thrun started', 0.006884150060415161) ('Thrun started working', 0.015042304599106411) ('Sebastian Thrun', 0.02140921543860024) ('Thrun started', 0.04498862876540802) ('cars at Google', 0.04498862876540802) ('started working', 0.09700399286574239) ('working on self-driving', 0.09700399286574239) ('self-driving cars', 0.09700399286574239) ('Sebastian', 0.1447773057422032) ('Thrun', 0.1447773057422032) ('Google', 0.1447773057422032) ('started', 0.29736558256021506) ('working', 0.29736558256021506) ('self-driving', 0.29736558256021506) ('cars', 0.29736558256021506) ('people', 0.29736558256021506) ('company', 0.29736558256021506) |
You can learn more about the ideas and algorithms implemented in this library here: Yake
7. Textrazor API
In addition to using some of the libraries available in python, the api is also a good choice for the task I want to implement. The Textrazor API can be accessed in a variety of computer languages, including Python, Java, PHP, and others. We will receive an API key to extract keywords from the text when we have created an account with Textrazor.
Textrazor is a good choice for developers who need quick extraction tools with comprehensive customization options. This is a keyword extraction service that can be used locally or on the cloud. The TextRazor API can be used to extract meaning from text and can be easily connected to a programming language. We can design a custom extractor and extract synonyms and entity relationships, in addition to extracting keywords and entities in 12 different languages
1 2 3 4 5 6 7 | <span class="token keyword">import</span> textrazor textrazor <span class="token punctuation">.</span> api_key <span class="token operator">=</span> <span class="token string">"your_api_key"</span> client <span class="token operator">=</span> textrazor <span class="token punctuation">.</span> TextRazor <span class="token punctuation">(</span> extractors <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token string">"entities"</span> <span class="token punctuation">,</span> <span class="token string">"topics"</span> <span class="token punctuation">]</span> <span class="token punctuation">)</span> response <span class="token operator">=</span> client <span class="token punctuation">.</span> analyze_url <span class="token punctuation">(</span> <span class="token string">"https://www.textrazor.com/docs/python"</span> <span class="token punctuation">)</span> <span class="token keyword">for</span> entity <span class="token keyword">in</span> response <span class="token punctuation">.</span> entities <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token punctuation">:</span> <span class="token keyword">print</span> <span class="token punctuation">(</span> entity <span class="token punctuation">.</span> <span class="token builtin">id</span> <span class="token punctuation">,</span> entity <span class="token punctuation">.</span> relevance_score <span class="token punctuation">,</span> entity <span class="token punctuation">.</span> confidence_score <span class="token punctuation">)</span> |
You can learn more about the API here: Textrazor API
summary
With the speed of development of current technology, there are many keyword selection solutions built and developed with high speed and accuracy, the problems developed based on that are also big problems. requires a small amount of data. In this article, I only introduce some solutions that have been built to be easy to install and use. Hope can help you have some more options in the process of building and developing projects related to natural language.
References