Elasticsearch in action

Tram Ho

What is Elasticsearch?

  • Full-text search engine.
  • NoSQL database.
  • Analytics engine.
  • Written in Java .
  • Lucence based.
  • Inverted indices.
  • Easy to scale
  • RESTful interface (HTTP / JSON)
  • Schemaless .
  • Real-time .
  • ELK stack.

Download Elasticseach.

This article uses Elastichsearch 7.5

https://www.elastic.co/downloads/elasticsearch

After downloading and installing, run Elasticsearch,
You can point your browser to http: // localhost: 9200 (or use curl, I prefer curl) to check that elasticsearch has run successfully, and here is the result:

Some concepts.

If compared to a relational database ( RDBMS ), the following terms can be interpreted as equivalent.

RDBMSElasticsearch
DatabaseIndex
TableType
RowDocument

Index.

To create a database (or in Elasticseach, called Index ) we use the PUT method to name the database, for example, create an index post :

Document.

To create a document, just pass a json , and assign it an id

In the above request, post is the name of the index, doc is the type, 1 is the id.
Say a little more about the type , in Elasticsearch whenever document is saved, there will be 1 index and a mapping type corresponds, for example index twitter type user and tweet , each type can have separate field, user have user_name, email, also tweet content, tweeted_at and also user_name .
(To create a document we do the same: PUT /twitter/user/1 , PUT /twitter/tweet/1

In Elasticsearch, people often see index as database in SQL database, and type is similar to table , this is a bad equivalent and leads to many bad consequences. In SQL database the table are independent, two fields with the same name in two different table are not related to each other. But in Elasticsearch it is not the same, they are supported by an internal Lucence field. This has led to some negative consequences. There are 2 alternatives:

  • Each type, we give a separate index.
  • Or custome type.

Therefore, from Elasticsearch 7.x, specifying the type in the indexing API is no longer needed. From Elasticsearch 8, type declarations in the API will not be supported. See details at: https://www.elastic.co/guide/en/elasticsearch/reference/current/removal-of-types.html

Returning to the example, after the post created, we can retrieve it by the GET method

Mapping.

The first article I mentioned is Elasticsearch schemaless , actually not so. Check the mapping of the index post in the previous example:

( text: analyzed, keyword: nonanalyzed )

We see that all are text, if not specified then Elasticsearch will guess the data type for us.
This is not very good, for example, fields that need a date / time or numeric type will be treated as text.
We manually mapped at the time of creating Index as follows, just pass json on:

Analyzers.

First, see request mapping later (same as above), when declaring analyzer for title field.

What is analyzer ?
Analyzer vs non-analyzer <=> Full-text vs exact value
Analyzer usually has the steps:

  • Character filter. (replace character)
  • Tokenizer. (Breaking text into terms)
  • Token filters. (Add / delete / correct tokens)

See the Elasticsearch built-in analyzer here https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html
Anlyzer example:

Hey man, how are you doing?

  • Whitespace analyzer: Hey | man, | how | are | you | doing? |
  • English analyzer: hei | man | how | you | do |

Test the newly created analyzer as follows

Perform a search, assuming there are many documents, you search from working

Search

First, import this data set into https://gist.githubusercontent.com/lumosnysm/664e4b76c81eacefaa515c7c1133823c/raw/ebbd60808a868bc3626497d77e3f984747dfd9bb/post.json

To retrieve the entire document, use the GET method:
The results returned are paginated as follows

or you can use POST and pass it to the json, the results are identical

similar 2 ways below are equivalent

To count documents, we use count

Filter and Query

Filter :

  • Does the document match? (Yes or no).
  • Not interested in relevance.
  • Fast and cache.
  • Used for non-analyzed fields (as above, I have to be raw).

Query :

  • Is the matched document good?
  • Full-text search.
  • Use for analyzed school.

Example of using filter :

I have to use must , must_not , should .
Can be understood simply: must be AND , must_not is NOT , and should be OR .

Relevance

See the following example using query

Note the sections max_score, _score . The first doc contains both ‘good’ and ‘news’ so the score is 9.7 higher than the second doc, 4.9 when only the word news is included. And the returned results are sorted in order from high-> low

In addition, we can also use the following as an overall score of 1.0, so we will be free to arrange the result of any magnetic field at will. This way the Elasticsearch variable is more like a NoSQL database than a Full-text search engine

See another example:

The above should use should . The special thing should when used in query different from in the filter . In a filter it should be as simple as an OR operation, the result will be returned regardless of whether there should be a match or not. Also in the query , should have if match will increase relevance of that document.
As the above request, remains the same, we search for posts where the title contains ‘good news’, in addition the score will be boosted if the document has the category ‘apps’. Test run to see if the doc has id 5895 whose category contains ‘apps’ after running the above request with a score of 7.7 higher than 4.9 when not searching with should .

In addition, we can manually declare boost queries as follows:

Aggregation

Basic aggregation is group by in SQL database, but stronger.

We can even query nested for further data analysis:

So I went over some basic things in Elasticsearch, hope it will help you.

Refer
https://www.elastic.co/
https://github.com/ThijsFeryn/elasticsearch_tutorial

Share the news now

Source : Viblo