1. Analyzer
Analyzer is the one that will process data submitted to elasticsearch and process data of some type of query search such as match query . As I mentioned in the previous article , an analyzer includes the following components: one or more character filters to normalize data before analysis, a tokenizer to analyze data and one or more token filters to normalize data after analysis.
2 Build in analyzer
Currently elasticsearch provides a number of build-in analyzer
2.1 Standard
This is the default analyzer, using standard, lowercase, and stop token-filter, discussed later
2.2 Simple
The simple analyzer splits the text into a token when it encounters a non-letter character. It includes lowercase tokenizer
1 2 3 4 5 6 | POST _analyze { "analyzer": "simple", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } |
results: [the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone]
2.3 Whitespace
Split text into tokens based on whitespace. It includes whitespace tokenizer
1 2 3 4 5 6 | POST _analyze { "analyzer": "whitespace", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } |
Results: [The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog’s, bone. ]
2.4 Stop
Like the Simple analyzer, splitting the text into a token when encountering a non-letter character has the added feature of removing tokens as stopword (a, an, the, will be mentioned later) thanks to an extra The token filter is the Stop Token Filter
1 2 3 4 5 6 | POST my_index/_analyze { "analyzer": "my_stop_analyzer", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } |
[quick, brown, foxes, jumped, lazy, dog, s, bone]
2.5 Keyword
Get all the text into a token. Encourage setting the field to not_analyzed instead of using the analyzer keyword
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | curl -XPOST 'localhost <span class="token operator">:</span> <span class="token number">9200</span> /my_index <span class="token punctuation">{</span> <span class="token property">"mappings"</span> <span class="token operator">:</span> <span class="token punctuation">{</span> <span class="token property">"document"</span> <span class="token operator">:</span> <span class="token punctuation">{</span> <span class="token property">"properties"</span> <span class="token operator">:</span> <span class="token punctuation">{</span> <span class="token property">"field_name"</span> <span class="token operator">:</span> <span class="token punctuation">{</span> <span class="token property">"type"</span> <span class="token operator">:</span> <span class="token string">"string"</span> <span class="token punctuation">,</span> <span class="token property">"analyzer"</span> <span class="token operator">:</span> <span class="token string">"keyword"</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> ' |
Encourage
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | curl -XPOST 'localhost <span class="token operator">:</span> <span class="token number">9200</span> /my_index <span class="token punctuation">{</span> <span class="token property">"mappings"</span> <span class="token operator">:</span> <span class="token punctuation">{</span> <span class="token property">"document"</span> <span class="token operator">:</span> <span class="token punctuation">{</span> <span class="token property">"properties"</span> <span class="token operator">:</span> <span class="token punctuation">{</span> <span class="token property">"field_name"</span> <span class="token operator">:</span> <span class="token punctuation">{</span> <span class="token property">"type"</span> <span class="token operator">:</span> <span class="token string">"string"</span> <span class="token punctuation">,</span> <span class="token property">"index"</span> <span class="token operator">:</span> <span class="token string">"not_analyzed"</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> ' |
2.6 Pattern
Split the text into a token based on a regular expression (RegEx), the default pattern is W+:
except all non-word characters
1 2 3 4 5 6 | POST _analyze <span class="token punctuation">{</span> <span class="token property">"analyzer"</span> <span class="token operator">:</span> <span class="token string">"pattern"</span> <span class="token punctuation">,</span> <span class="token property">"text"</span> <span class="token operator">:</span> <span class="token string">"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."</span> <span class="token punctuation">}</span> |
results: [the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone]
2.7 Language analyzer
Set of analyzers for different languages (arabic, armenian, basque country, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai).
3. Custom an analyzer
In addition to the provided analyzer, we can also create the analyzer to suit the problem requirements by combining elasticsearch character filters, tokenizers, and token filters.
For example: Custom an analyzer using charfilter, tokenizer, token filter
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | curl -X PUT <span class="token string">"localhost:9200/my_index?pretty"</span> -H 'Content-Type <span class="token operator">:</span> application/json' -d' <span class="token punctuation">{</span> <span class="token property">"settings"</span> <span class="token operator">:</span> <span class="token punctuation">{</span> <span class="token property">"analysis"</span> <span class="token operator">:</span> <span class="token punctuation">{</span> <span class="token property">"analyzer"</span> <span class="token operator">:</span> <span class="token punctuation">{</span> <span class="token property">"my_custom_analyzer"</span> <span class="token operator">:</span> <span class="token punctuation">{</span> <span class="token property">"type"</span> <span class="token operator">:</span> <span class="token string">"custom"</span> <span class="token punctuation">,</span> <span class="token property">"char_filter"</span> <span class="token operator">:</span> <span class="token punctuation">[</span> <span class="token string">"emoticons"</span> <span class="token punctuation">]</span> <span class="token punctuation">,</span> <span class="token property">"tokenizer"</span> <span class="token operator">:</span> <span class="token string">"punctuation"</span> <span class="token punctuation">,</span> <span class="token property">"filter"</span> <span class="token operator">:</span> <span class="token punctuation">[</span> <span class="token string">"lowercase"</span> <span class="token punctuation">,</span> <span class="token string">"english_stop"</span> <span class="token punctuation">]</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token punctuation">,</span> <span class="token property">"tokenizer"</span> <span class="token operator">:</span> <span class="token punctuation">{</span> <span class="token property">"punctuation"</span> <span class="token operator">:</span> <span class="token punctuation">{</span> <span class="token property">"type"</span> <span class="token operator">:</span> <span class="token string">"pattern"</span> <span class="token punctuation">,</span> <span class="token property">"pattern"</span> <span class="token operator">:</span> <span class="token string">"[ .,!?]"</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token punctuation">,</span> <span class="token property">"char_filter"</span> <span class="token operator">:</span> <span class="token punctuation">{</span> <span class="token property">"emoticons"</span> <span class="token operator">:</span> <span class="token punctuation">{</span> <span class="token property">"type"</span> <span class="token operator">:</span> <span class="token string">"mapping"</span> <span class="token punctuation">,</span> <span class="token property">"mappings"</span> <span class="token operator">:</span> <span class="token punctuation">[</span> <span class="token string">":) => _happy_"</span> <span class="token punctuation">,</span> <span class="token string">":( => _sad_"</span> <span class="token punctuation">]</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token punctuation">,</span> <span class="token property">"filter"</span> <span class="token operator">:</span> <span class="token punctuation">{</span> <span class="token property">"english_stop"</span> <span class="token operator">:</span> <span class="token punctuation">{</span> <span class="token property">"type"</span> <span class="token operator">:</span> <span class="token string">"stop"</span> <span class="token punctuation">,</span> <span class="token property">"stopwords"</span> <span class="token operator">:</span> <span class="token string">"_english_"</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> ' curl -X POST <span class="token string">"localhost:9200/my_index/_analyze?pretty"</span> -H 'Content-Type <span class="token operator">:</span> application/json' -d' <span class="token punctuation">{</span> <span class="token property">"analyzer"</span> <span class="token operator">:</span> <span class="token string">"my_custom_analyzer"</span> <span class="token punctuation">,</span> <span class="token property">"text"</span> <span class="token operator">:</span> <span class="token string">"Iu0027m a :) person, and you?"</span> <span class="token punctuation">}</span> ' |
3.1 Tokenizer
Currently elasticsearch analyzers are divided into the following 3 groups:
3.1.1 Group of tokenizer towards word oriented tokenizers
The tokenizer of this group has the function of separating long texts into individual words
Standard tokenizer
This is the default tokenizer, using the word separation algorithm of UAX 29 so it can be applied to most European languages.
1 2 3 4 5 6 | POST _analyze { "tokenizer": "standard", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } |
Result [The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog’s, bone]
The options
- max_token_length: the allowed length of 1 token, the default is 255
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | PUT my_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "my_tokenizer" } }, "tokenizer": { "my_tokenizer": { "type": "standard", "max_token_length": 5 } } } } } |
Letter tokenizer
Split the text into a token whenever it encounters a non-letter character
1 2 3 4 5 6 | POST _analyze { "tokenizer": "letter", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } |
Result [The, QUICK, Brown, Foxes, jumped, over, the, lazy, dog, s, bone]
Lowercase tokenizer
Like the Letter tokenizer splits the text into a token whenever a non-letter character is encountered, but also lowercase the separated tokens.
1 2 3 4 5 6 | POST _analyze { "tokenizer": "lowercase", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } |
Results: [the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone]
Whitespace tokenizer
Like the Letter tokenizer splits the text into a token whenever it encounters a whitespace character
- max_token_length: the allowed length of 1 token, the default is 255
3.1.2 structured text tokenizers
The tokenizer of this group tends to split the text into smaller texts according to other text structures such as email, file path
Keyword tokenizer
Outputs the entire text input as a term, used to combine with tokenizer to normalize the output data, often used for search like
For example
POST _analyze {“tokenizer”: “keyword”, “text”: “New York”}
[New York]
The options
- max_token_length: the allowed length of 1 token, the default is 255
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | PUT my_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "my_tokenizer" } }, "tokenizer": { "my_tokenizer": { "type": "keyword", "buffer_size": 200 } } } } } |
UAX URL email tokenizer
Same as the standard tokenizer except it can recognize and save url and email as a token
POST _analyze {“tokenizer”: “uax_url_email”, “text”: “Email me at [email protected] or visit https://john-smith.com “}
Results [Email, me, at, [email protected] , or, visit, https://john-smith.com ]
The options
- max_token_length: the allowed length of 1 token, the default is 255
Path hierarchy tokenizer
Used to handle directory and file paths. Split the text into directory paths from the parent directory
POST _analyze {“tokenizer”: “path_hierarchy”, “text”: “/usr/local/var/log/elasticsearch.log”}
result [/ usr, / usr / local, / usr / local / var, / usr / local / var / log, /usr/local/var/log/elasticsearch.log]
Options:
- delimiter: defines path separator, the default is ‘/’
- replacement: replace the delimiter in the output token
- buffer_size: default 1024
- reverse: separate the word in the direction of the child -> father, the default is false
3.1.3 partial word tokenizers
Split words, texts into smaller fragments.
N-gram tokenizer
Splits a string into evenly spaced grams (Ns) of length N, useful for languages that do not concatenate words as long as spaces in German
POST _analyze {“tokenizer”: “ngram”, “text”: “Quick Fox”} results [Q, Qu, u, ui, i, ic, c, ck, k, “k”, “”, “F “, F, Fo, o, ox, x]
options
- min_gram: minimum length of 1 gram, default is 1
- max_gram: maximum length of 1 gram, default is 2
- token_chars: types of characters used to separate grams include: letter (alphanumeric characters), digit (digits), whitespace (” or ‘ n’), symbol
Edge N-gram tokenizer
Same as N-gram but only for the gram is the beginning of a word
3.2 Token filter
Used to normalize tokens generated after tokenizing to index, used to customize an analyzer, can have 0 or more token filters in an analyzer
3.2.1 Standard token filter
In the old versions, it trimmed the letter s after the words, and in later versions, it simply did nothing.
3.2.2 Lowercase / Upper token filter
Lowercase / uppercase the tokens
3.3.3 Stopword token filter
Remove tokens that are stopword, such as the words a, an, the, and … in English, support many different languages https://www.ranks.nl/stopwords
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | PUT /my*index { "settings": { "analysis": { "filter": { "my_stop": { "type": "stop", "stopwords": "_language*" } } } } } |
3.3.4 Stemming tokenizers
The tokenizer is used to convert tokens back to their original roots in a grammatical way, for example: “walks”, “walking”, “walked” have the root words “walk”, in es there are 3 stemming algorithms corresponding to 3 tokenizer: porter , snowball, kstem
4. Conclusion
Above is my introduction to the analyzers of elassticsearch as well as the components that make up them. Thank you for your interest ^ _ ^
Reference source:
- Tokenizer: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html
- Token filter: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html
- Ebook “Elasticsearch in action” Part 1 – Chapter 5: https://github.com/BlackThursdays/https-github.com-TechBookHunter-Free-Elasticsearch-Books/blob/master/book/Elasticsearch in Action.pdf