Functions of ElasticSearch and Settings

Tram Ho

This is lesson 3 of the series “What is Elastic? Should I apply to the project?”

The main function of ElasticSearch

ElasticSearch gives you access to Lucene’s indexing and data searching functions. Regarding the indexing function, you have many options on word processing and how to store the processed documents.

When searching, you have many queries and filters to choose from. ElasticSearch displays this functionality through the REST API, which allows you to configure queries in JSON and adjust most configurations, even if they are the same API.

In addition to the functions Lucence provides, ElasticSearch also adds high-level functions (high levels), from caching to real-time analysis.

On another level of abstraction is how you can organize your documents: many indexes can be searched individually or together, and you can put different types of documents in each index.

ElasticSearch is exactly as its name suggests. It is clustered by default – you can call it a cluter even if you only have one server – and you can always add more servers to increase capacity or error tolerance.

Extending Lucene functionality

In many cases, users search based on many criteria.

For example, you may have search criteria in multiple fields, some will be binding and some will be optional.

One of the highly appreciated features of ElasticSearch is the well-structured REST API: you can construct queries in JSON to combine different types of queries in many ways.

In the same REST API you can read and change settings, as well as indexing documents.

What about Apeche Solr?

If you’ve ever heard of Lucence, then you’ve probably heard of Solr, a distributed and open source search engine based on Lucence.

In fact, Lucence and Solr merged into a single Apache project in 2010, so you might be wondering how to compare ElasticSearch with Solr.

Both of these search engines offer the same functionality, and the features are rapidly developing in each new version.

You can search for comparison websites, but I recommend that you be careful with those news. Besides being bound by versions, making comparisons obsolete for several months, and many other reasons make them distorted.

That said, a few historical facts help explain the origin of the two products. Solr was created in 2004 and ElasticSearch in 2010.

When ElasticSearch appeared, its distributed model (discussed in the section below) would make scaling easier than competitors, which is shown in their elastic name. episode). However, during that time, Solr also added sharding (segmented) to version 4.0, which made Solr compete with ElasticSearch in terms of distributed model and other aspects.

When it comes to how documents are indexed, an important aspect is analysis. Through analysis, the words in the text you index become term in TermSearch ElasticSearch.

For example, if you index “bicycle race”, the analysis might provide the terms “bicycle”, “race”, “cycling” and “racing”, when you search for one of these terms, the The corresponding document will add to the result.

The analytical process applies the same way you search, (described as shown below). If you search for “bicycle race” you won’t want to search exactly. Perhaps you need a document containing 2 separate words.

The default parser first divides the text into separate words separated by a comma or space (space). Then convert to lowercase, “Bicycle Race” will become “bicycle” and “race”. There are many different analyzers, and you can build your own.

At this point, you might want to know more about the “data index” described as shown below.

alt text

Data is organized in documents. By default ElasticSearch stores your current data, and it also stores the analyzed terms in the reverted index to allow quick and relevant data searching.

And now we will learn why Elasticsearch stores data in document directions and how to group documents by type and index.

Data structure in ElasticSearch

Unlike relational databases, which store data in rows, ElasticSearch stores data in documents.

However, to some extent, these two concepts are the same. For rows in a table, you have columns, with each column having 1 value. For documents, there are keys and values, likewise.

The difference is that a document is more flexible than a row, the main reason is that in ElasticSearch the document is hierarchical.

For example: The same way you associate a key with the string value “author”: “Joe”. A document can store the “tag” string array: [“cycling”, “bicycles”], or even the “author” value pair: {“first_name”: “Joe”, “last_name”: “Smith”} .

This flexibility is important because it encourages you to store entities with the same logic in a document, as opposed to storing them in different rows and tables.

For example, the easiest (and possibly the fastest) way is to store all the data of an article in the same document. This way, the search will be fast because you don’t need to link to other tables.

If you have a SQL background, you must ignore join functions, at least in version 1.76. When everything is ready, you can download ElasticSearch settings and get started.

Install Java

If you do not already have JRE (Java Runtime Environment) available, you will need to install it first. Any JRE will work, as long as version 1.7 and up.

Usually you install from Oracle ( www.java.com/en/download/index.jsp ).

Troubleshoot “Java not found”

With ElasticSearch, as with other Java applications, this can happen when you download and install Java, but the application cannot start and says “Java not found.”

ElasticSearch’s code looks for Java in two places: the JAVA_HOME environment variable and the system variable.

To check if the JAVA_HOME variable is set, use the env command in UNIX system and the set command in Windows. To check the version of Java you use the java -version command.

If it works, then Java is already set in the path (PATH). If that doesn’t work, you must add JAVA_HOME to environment variables, or comment below, I will support you.

Download and launch ElasticSearch

With Java already installed, you need to download ElasticSearch and launch it.

Download the package that best suits your environment. These packages can be downloaded from www.elastic.co/downloads/elasticsearch: Tar, ZIP, RPM, and DEB.

For UNIX operating systems

If you run on Linux, Mac or UNIX UNIX operating systems, you can download tar.gz package. Then unzip and launch ElasticSearch with the following command:

tar zxf elasticsearch-*.tar.gz

cd elasticsearch-*

bin/elasticsearch

Package management with HOMEBREW for OS X

If you need a simpler installation for a Mac, you can install Homebrew. Installation instructions can be found at http://brew.sh .

With Homebrew, you can install ElasticSearch with the following command:

brew install elasticsearch

Then launch ElasticSearch by:

elasticsearch

ZIP package

If you are running Windows, you should download the ZIP package, unzip it and then run elasticsearch.bat file in the bin directory.

RPM or DEB package

If running on Red Hat Linux, CentOS, SUSE, or any operating system that works with RPMs, or Debian, Ubuntu, or any system that works with DEBs, then you can use the provided RPM and DEB issued by Elastic.

You can see how to install it at www.elastic.co/guide/en/elasticsearch/reference/current/setup-repositories.html .

Once you have installed ElasticSearch, you need to add the repository to the list and run ElasticSearch with the following command:

systemctl start elasticsearch.service

If your system doesn’t have systemctl , you can run the following command:

% /etc/init.d/elasticsearch start

If you want to see what ElasticSearch is doing, take a look at the /var/log/elastic- search/ . If you installed with TAR or ZIP, you can find them in logs/ .

Check the operation of ElasticSearch

ElasticSearch is now installed and launched. We will check the log generated during the startup process and connect to the REST API.

Check the log file

When you first run ElasticSearch, you will see a series of logs appear, showing what is happening. We will consider what these lines mean.

The first line provides information about the node: The default name of the node is kJhkPLX , you can reconfigure the name you want.

There is also information about: usable capacity, available capacity, heap size, node id, PID, version, operating system, and environment variables.

[2020-01-17T15:49:46,058][INFO ][oeeNodeEnvironment ] [kJhkPLX] using [1] data paths, mounts [[/ (/dev/disk1s1)]], net usable_space [24.7gb], net total_space [233.5gb], types [apfs] [2020-01-17T15:49:46,061][INFO ][oeeNodeEnvironment ] [kJhkPLX] heap size [989.8mb], compressed ordinary object pointers [true] [2020-01-17T15:49:46,063][INFO ][oenNode ] [kJhkPLX] node name derived from node ID [kJhkPLXXQBq8pnV3lhX28w]; set [node.name] to override [2020-01-17T15:49:46,063][INFO ][oenNode ] [kJhkPLX] version[6.8.5], pid[83413], build[oss/tar/78990e9/2019-11-13T20:04:24.100411Z], OS[Mac OS X/10.14.5/x86_64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_202/25.202-b08] [2020-01-17T15:49:46,064][INFO ][oenNode ] [kJhkPLX] JVM arguments [-Xms1g, -Xmx1g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.io.tmpdir=/var/folders/c8/8nmz64pn7fs0x8yp50xldv300000gn/T/elasticsearch-7046467428247345908, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=data, -XX:ErrorFile=logs/hs_err_pid%p.log, -XX:+PrintGCDetails, -XX:+PrintGCDateStamps, -XX:+PrintTenuringDistribution, -XX:+PrintGCApplicationStoppedTime, -Xloggc:logs/gc.log, -XX:+UseGCLogFileRotation, -XX:NumberOfGCLogFiles=32, -XX:GCLogFileSize=64m, -Des.path.home=/usr/local/Cellar/elasticsearch/6.8.5/libexec, -Des.path.conf=/usr/local/etc/elasticsearch, -Des.distribution.flavor=oss, -Des.distribution.type=tar]

Plugins are uploaded on launch:

[2020-01-17T15:49:46,968][INFO ][oepPluginsService ] [kJhkPLX] loaded module [aggs-matrix-stats] [2020-01-17T15:49:46,968][INFO ][oepPluginsService ] [kJhkPLX] loaded module [analysis-common] [2020-01-17T15:49:46,968][INFO ][oepPluginsService ] [kJhkPLX] loaded module [ingest-common] [2020-01-17T15:49:46,968][INFO ][oepPluginsService ] [kJhkPLX] loaded module [ingest-geoip] [2020-01-17T15:49:46,968][INFO ][oepPluginsService ] [kJhkPLX] loaded module [ingest-user-agent] [2020-01-17T15:49:46,969][INFO ][oepPluginsService ] [kJhkPLX] loaded module [lang-expression] [2020-01-17T15:49:46,969][INFO ][oepPluginsService ] [kJhkPLX] loaded module [lang-mustache] [2020-01-17T15:49:46,969][INFO ][oepPluginsService ] [kJhkPLX] loaded module [lang-painless] [2020-01-17T15:49:46,969][INFO ][oepPluginsService ] [kJhkPLX] loaded module [mapper-extras] [2020-01-17T15:49:46,969][INFO ][oepPluginsService ] [kJhkPLX] loaded module [parent-join] [2020-01-17T15:49:46,969][INFO ][oepPluginsService ] [kJhkPLX] loaded module [percolator] [2020-01-17T15:49:46,969][INFO ][oepPluginsService ] [kJhkPLX] loaded module [rank-eval] [2020-01-17T15:49:46,970][INFO ][oepPluginsService ] [kJhkPLX] loaded module [reindex] [2020-01-17T15:49:46,970][INFO ][oepPluginsService ] [kJhkPLX] loaded module [repository-url] [2020-01-17T15:49:46,970][INFO ][oepPluginsService ] [kJhkPLX] loaded module [transport-netty4] [2020-01-17T15:49:46,970][INFO ][oepPluginsService ] [kJhkPLX] loaded module [tribe]

Port 9300 is used by default. If you use the Java API instead of the REST API, this is where you need to connect.

[2020-01-17T15:49:50,473][INFO ][oetTransportService ] [kJhkPLX] publish_address {127.0.0.1:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}

Next line: Master node has been set and its name is kJhkPLX . Each cluster will have a master node, responsible for knowing which node in the cluster and where all shards are located. When the master node is not available, the new node will be selected. In case you created the first node, that node is your master node.

[2020-01-17T15:49:53,578][INFO ][oecsMasterService ] [kJhkPLX] zen-disco-elected-as-master ([0] nodes joined), reason: new_master {kJhkPLX}{kJhkPLXXQBq8pnV3lhX28w}{auKqCeD6Q5OKnODTFkpgKw}{127.0.0.1}{127.0.0.1:9300}

Port 9200 is used to communicate with HTTP. This is where you connect to the REST API

[2020-01-17T15:49:53,627][INFO ][oehnNetty4HttpServerTransport] [kJhkPLX] publish_address {127.0.0.1:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200}

The next line informs you that your node is already running. At this point you can connect and send requests.

[2020-01-17T15:49:53,627][INFO ][oenNode ] [kJhkPLX] started

GatewayService is a component of ElasticSearch that maintains disk data retention, so you don’t lose data if the node fails:

[2020-01-17T15:49:53,633][INFO ][oegGatewayService ] [kJhkPLX] recovered [0] indices into cluster_state

When you initialize a node, Gateway looks on the disk to see if any data is saved so that Gateway can recover. In this case, no indexes will be restored.

Much of the information we look at in logs – from node names to gateway settings – is configurable. We will look at configuration, and the implications in another article. Now you do not need to edit anything more because the default values ​​are the standard configuration for developers.

Note: If you want to initialize ElasticSearch on another computer but on the same network, it connects to the same cluster, which will lead to unexpected results, such as shards that will connect. together. To prevent this, change the name of the cluster in the elasticsearch.yml configuration file.

Use the REST API

The easiest way to connect to the REST API is to point the browser to http: // localhost: 9200 / .

If you do not have ElasticSearch installed on your local machine, you can change localhost the address of the connected machine.

By default ElasticSearch listens for HTTP requests from port 9200.

If your request is received, the JSON string will be returned as follows:

alt text

summary

  1. ElasticSearch is open source, distributed search engine built on Apache Lucene.
  2. The use case for ElasticSearch is for indexing with huge amounts of data, so you can use full-text search and statistics in real time.
  3. In addition to full-text search, you can adjust related searches to provide search suggestions.
  4. To get started, download the corresponding package, unzip it, and run ElasticSearch.
  5. To index and search data, as well as to install the cluster, use JSON via the HTTP API to retrieve data returned in JSON.
  6. You can view ElasticSearch as NoSQL with real-time search and analytics capabilities. ElasticSearch is document oriented and can be extended by default.
  7. ElasticSearch automatically divides data into shards, helping to balance the servers in the cluster. This makes it easy to add and remove servers. Shards can be cloned to help you resolve server issues.

Refer

Elasticsearch In Action (Matthew Lee Hinman and Radu Gheorghe)

Feedback

Please give me a minute to help me. Please leave your comments to help the latter to read and understand better.

Thank you for your interest in this article. I wish you a good day!

Share the news now

Source : Viblo