Reindex Elasticsearch data with zero downtime

Tram Ho

Elasticsearch is a great search engine for any project that wants to apply search functionality to their products, with features like near-realtime search, auto-complete, suggestion, and so on. distributed search system architecture, can be easily scaling, failing handle.

However, along with its powerful features, Elasticsearch also has a number of points that we have to trade in to use it, such as: an index once in operation and having many documents cannot. change the data type of a given field, it’s hard to change the structure as well as the number of shards in case we want to maximize our system.

When we want to do one of the above, the only way we can do it is to reindex all our data to a new index. This can be done with insignificant amounts of data, but if your system works in production and has a large amount of text search ~ data ~ more than 100MiB and about 1 million documents, then That is, reindex data from relational databases to ES, or even reindex directly from ES is also a big challenge.

So in this article, let’s learn a method “reindex zero-downtime”, the article is referenced at: https://medium.com/craftsmenltd/rebuild-elasticsearch -index-without-downtime-168363829ea4 .

It is imperative to use index alias for your system

The first and a prerequisite for the reindex process to be light is the use of alias, if you do not know about alias, please refer here .

Simply put, alias is another name for your index. So why do we have to use alias, index name seems to be enough?

  • In the same cluster, when you name your index, that name will not be used again, if you want to reuse that index name, the only thing you can do is to build another cluster and then transfer the data to -> very expensive.
  • If we only use the index name every time we rebuild the index we have to give it a new name, for example:

And with that each time we have to update the code using the new index name, and just looking at the index name is difficult to manage, we sometimes get confused about which index is being used. , which index does not.

  • When using alias, we can give it the most common name and point to the index name we want (the index name can now include a version element or a timestamp to know the current version), for example with kibana dev tool:

  • When we reconstruct a new index article_ver1.1 to optimize article_ver1.0 , the only thing we need to do is re-point the alias to the new index name, in the code we just keep the index which is the name of the alias ( article_production ). .

So, we will use the alias name to communicate between the application and the ES, the alias will point to the old index during the rebuild index. After the reindex is done, we will re-point the alias to the new index and can use the new index normally without having to re-install the application.

Reference diagram uses AWS service, changes in DB will be updated to old index, search is still done normally through old index, new index rebuild in this process.

Don’t use the _reindex API

Although the ES has provided us with the reindex feature, we should not use it as a tool to reindex data. I have also imagined that it would be much better to build reindex mechanism because it is built-in. But the essence of _reindex API is just to give us a snapshot at the moment we call the API and reindex it into the new index, it has two disadvantages:

  • The new index may require data that the old index did not have.
  • Taking a snapshot at call time of API does not guarantee us that the data in the new index will be up-to-date compared to the old index, because during reindex (with zero-downtime) the data will be continuously updated. is the same as the old index, so after reindex is finished, switch to new index, the updated data will be lost.

Of course, using the reference system in the diagram above, the new index will not have the new data updated (because the update is still updating to the old index), so we have to modify our system. A little more to get the data up-to-date.

Update data into index while reindex

In the above system diagram, we can see the presence of 2 aliases, one for write, one for reading.

  • The reindex is done at the new index.
  • When data changes in the DB, it is updated to both the old and the new index.
  • Reading data is still at the old index.
  • After the reindex is done, we will point the read index to the new index -> at the same time to read and write data, this time the data includes the indexed data with the new structure, at the same time. Old data reused from the old index is also up-to-date.

In the reference diagram in the article (referenced link above), we can also see that using a new write alias is also a way to avoid having to fix the code between new index builds.

Applies to applications running Rails and gem searchkick

Recently, in a number of projects I participated in, I also encountered some performance related issues, generally not optimizing the index, the data type of the index and the desire of the projects is also wanting to reindex no downtime. the amount of data is not small.

For a Rails project using gem searchkick, I also tried to demo a process of reindex no downtime and also succeeded. My approach mainly uses overwrite code and some settings, without require to build a lot of lambda code to run by triggering the AWS event, so there won’t be much resource spec increase. But if the project has a large amount of data and may have to change a lot of indexes, recommend that you should still build a standard reindex system as in the reference article above.

If I have a chance, I will update the implementation details in the following article with demo code details.

Share the news now

Source : Viblo