Elasticsearch and data analysis process (Components of an analyzer)

Tram Ho

1. Analyzer

Analyzer is the one that will process data submitted to elasticsearch and process data of some type of query search such as match query . As I mentioned in the previous article , an analyzer includes the following components: one or more character filters to normalize data before analysis, a tokenizer to analyze data and one or more token filters to normalize data after analysis.

2 Build in analyzer

Currently elasticsearch provides a number of build-in analyzer

2.1 Standard

This is the default analyzer, using standard, lowercase, and stop token-filter, discussed later

2.2 Simple

The simple analyzer splits the text into a token when it encounters a non-letter character. It includes lowercase tokenizer

results: [the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone]

2.3 Whitespace

Split text into tokens based on whitespace. It includes whitespace tokenizer

Results: [The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog’s, bone. ]

2.4 Stop

Like the Simple analyzer, splitting the text into a token when encountering a non-letter character has the added feature of removing tokens as stopword (a, an, the, will be mentioned later) thanks to an extra The token filter is the Stop Token Filter

[quick, brown, foxes, jumped, lazy, dog, s, bone]

2.5 Keyword

Get all the text into a token. Encourage setting the field to not_analyzed instead of using the analyzer keyword

Encourage

2.6 Pattern

Split the text into a token based on a regular expression (RegEx), the default pattern is W+: except all non-word characters

results: [the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone]

2.7 Language analyzer

Set of analyzers for different languages ​​(arabic, armenian, basque country, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai).

3. Custom an analyzer

In addition to the provided analyzer, we can also create the analyzer to suit the problem requirements by combining elasticsearch character filters, tokenizers, and token filters.

For example: Custom an analyzer using charfilter, tokenizer, token filter

3.1 Tokenizer

Currently elasticsearch analyzers are divided into the following 3 groups:

3.1.1 Group of tokenizer towards word oriented tokenizers

The tokenizer of this group has the function of separating long texts into individual words

Standard tokenizer

This is the default tokenizer, using the word separation algorithm of UAX 29 so it can be applied to most European languages.

Result [The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog’s, bone]

The options

  • max_token_length: the allowed length of 1 token, the default is 255

Letter tokenizer

Split the text into a token whenever it encounters a non-letter character

Result [The, QUICK, Brown, Foxes, jumped, over, the, lazy, dog, s, bone]

Lowercase tokenizer

Like the Letter tokenizer splits the text into a token whenever a non-letter character is encountered, but also lowercase the separated tokens.

Results: [the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone]

Whitespace tokenizer

Like the Letter tokenizer splits the text into a token whenever it encounters a whitespace character

  • max_token_length: the allowed length of 1 token, the default is 255

3.1.2 structured text tokenizers

The tokenizer of this group tends to split the text into smaller texts according to other text structures such as email, file path

Keyword tokenizer

Outputs the entire text input as a term, used to combine with tokenizer to normalize the output data, often used for search like

For example

POST _analyze {“tokenizer”: “keyword”, “text”: “New York”}

[New York]

The options

  • max_token_length: the allowed length of 1 token, the default is 255

UAX URL email tokenizer

Same as the standard tokenizer except it can recognize and save url and email as a token

POST _analyze {“tokenizer”: “uax_url_email”, “text”: “Email me at [email protected] or visit https://john-smith.com “}

Results [Email, me, at, [email protected] , or, visit, https://john-smith.com ]

The options

  • max_token_length: the allowed length of 1 token, the default is 255

Path hierarchy tokenizer

Used to handle directory and file paths. Split the text into directory paths from the parent directory

POST _analyze {“tokenizer”: “path_hierarchy”, “text”: “/usr/local/var/log/elasticsearch.log”}

result [/ usr, / usr / local, / usr / local / var, / usr / local / var / log, /usr/local/var/log/elasticsearch.log]

Options:

  • delimiter: defines path separator, the default is ‘/’
  • replacement: replace the delimiter in the output token
  • buffer_size: default 1024
  • reverse: separate the word in the direction of the child -> father, the default is false

3.1.3 partial word tokenizers

Split words, texts into smaller fragments.

N-gram tokenizer

Splits a string into evenly spaced grams (Ns) of length N, useful for languages ​​that do not concatenate words as long as spaces in German

POST _analyze {“tokenizer”: “ngram”, “text”: “Quick Fox”} results [Q, Qu, u, ui, i, ic, c, ck, k, “k”, “”, “F “, F, Fo, o, ox, x]

options

  • min_gram: minimum length of 1 gram, default is 1
  • max_gram: maximum length of 1 gram, default is 2
  • token_chars: types of characters used to separate grams include: letter (alphanumeric characters), digit (digits), whitespace (” or ‘ n’), symbol

Edge N-gram tokenizer

Same as N-gram but only for the gram is the beginning of a word

3.2 Token filter

Used to normalize tokens generated after tokenizing to index, used to customize an analyzer, can have 0 or more token filters in an analyzer

3.2.1 Standard token filter

In the old versions, it trimmed the letter s after the words, and in later versions, it simply did nothing.

3.2.2 Lowercase / Upper token filter

Lowercase / uppercase the tokens

3.3.3 Stopword token filter

Remove tokens that are stopword, such as the words a, an, the, and … in English, support many different languages https://www.ranks.nl/stopwords

3.3.4 Stemming tokenizers

The tokenizer is used to convert tokens back to their original roots in a grammatical way, for example: “walks”, “walking”, “walked” have the root words “walk”, in es there are 3 stemming algorithms corresponding to 3 tokenizer: porter , snowball, kstem

4. Conclusion

Above is my introduction to the analyzers of elassticsearch as well as the components that make up them. Thank you for your interest ^ _ ^

Reference source:

Share the news now

Source : Viblo