Art index mongodb: 5 strategies may the lower unknown

Tram Ho

Surely the backend dev brothers are now no stranger to a very popular database called mongodb and everyone jumps behind the waist with great moves with this holy database, right? But I would like to introduce below is a half-letter of the tactics of the Mon Mon and 360 tactics that have been handed down in the world but have been lost for a long time but I am very difficult to have, make sure the brothers will perceived the sublime moves that are light, then make peers squint, the billionaire admirers that are heavy, the upper officers admire the admirers.

Let’s start!

alt text

First things first

Wait, don’t hurry. Let’s take a look at the great strategist we don’t know (it’s me). My name is Minh Monmen , I was rumored with many strange names like Minh Fat , Minh Minh Man , Minh Chubby , … but with a good-looking appearance and excellent wind skills, I became not. , I have made many heroes in the world read this line and think there is something awful. But that’s not it, just on New Year’s Day, I also tidy up my messy memory and realize it needs to rearrange so open the bowl always with a neat index mongodb. This is one of the most useful knowledge in the heart, drawn from 4 years of using mongodb so far.

Invite the brothers to follow.

Strategy 1: Best index to lower the field (aka best to see the song)

According to the ancient bibliography about mongodb, this divine DB usually only supports only 1 index for a query . (You should not confuse it with not supporting many indexes 1 query, it does support but only in some strict cases). Therefore, the indexing field for mongodb becomes popular with the compound index (unlike the sql system like mysql, which usually has the index 1 field). This indexing technique allows you to type 1 index using multiple fields instead of 1. Therefore, just one index can comfortably support your brother’s query.

Most indexes that bring down the field are a strategy of utilizing a compound index to solve the problem of queries on multiple fields.

Implementation conditions:

  • Frequently use queries with filters, sorting multiple fields
  • Query has the appearance of entire or prefix field (in filter and sort)

Method of implementation

Determine how many fields are frequently used and indexed.

  • Recommend up to 4 fields . If you have to type 4 or more fields, consider another approach.
  • The 4 fields included in the index usually include types: exact search , range search , sort results .

Determines the order of fields in the index. In general, there are several ways to determine the priority for placing orders as follows:

  • By query specification: chính xác > khoảng > sort
  • According to data characteristics: nhiều giá trị khác nhau > ít giá trị khác nhau
  • By frequency characteristics: field xuất hiện trong nhiều query > field ít được query (this way is optimal for cases that want to reuse index prefix)

Determine the direction of the field in the index. This is to assist with the sorting process. Whereby:

  • If just sorting a field, dimension doesn’t matter
  • If sorting two or more fields, those fields must have the same dimensions that the index supports. For example index {a: 1, b: -1} will only support sort({a: 1, b: -1}) or sort({a: -1, b: 1})

Note

  • Do not take the field to remove the index because it makes the index heavy, difficult to store all on ram.
  • Pay much attention to the order of fields.
  • Balance between changing field order and creating another index based on the frequency of use of each index.

The art here is that the brothers must balance the benefits of changing the order of fields when indexing. There are a number of comments that you can refer to.

  • Fields of type user_id , parent_id , … are usually headed according to business. There is usually moderate dispersion + exact comparison.
  • Fields of type , status , … usually stand in the middle. These fields usually have the exact type of search but the dispersion is low, and the query must usually come with an upper field. Except in the case of administrative elements, report that these fields go individually.
  • The field type of score is usually put middle or end. These fields are often used in range search or sort.
  • Continuous datetime fields usually end. These fields are often used to sort. In the case of using datetime to filter by interval, it usually goes separately.

See more at https://docs.mongodb.com/manual/core/index-compound/

Strategy 2: Cicadas escape hashed (aka cicadas escape corpse)

Sometimes brothers and sisters face fields that contain large lines of text, like slug , login token , session token , security token … whose characteristics are very long . These fields often appear in exact searches. But indexing these text fields has problems:

  • Large index capacity , may not be able to store on ram or reduce performance.
  • Tap limit 1024 characters of mongodb for an indexed field.

So how to overcome strong enemies like this tiger?

Very simple, use the hashed escape cicada . We will avoid the strongest side of the enemy is the length , create a cover to index the field, but actually we will hash their value when entering the index. If the field has a value longer than 1024 characters, it will still index normally, and the storage efficiency of the index increases markedly when the capacity is significantly reduced.

Hashes are algorithms that handle mapping 1 value -> 1 hash (usually of a specified length).

Execution conditions

  • Search exactly the text field
  • Only use 1 field to search. So indexes will work for fields with a unique scatter.

Method of implementation

Fortunately, mongodb supports us in this out of the box , all we need to do is create an index hashed as follows:

Note

  • Hashed index does not support compound index
  • Hashed index does not support interval search , prefix search , sort
  • Hashed index does not support unique

The most common Usecase when using this technique is usually in the authentication systems when searching for a record in the db with a long chain of tokens.

See more at https://docs.mongodb.com/manual/core/index-hashed/

Strategy 3: Borrow a unique completed prefix (aka borrow a corpse)

Unique indexing on a string is considered pain in the ass in indexing because it slows down the process of inserting and updating data due to the need to ensure uniqueness, plus that the unique index cannot be hashed. (because the value after hash does not guarantee uniqueness). So the unique index based on the string will be processed similarly to a normal field and will be very expensive .

This is an enemy we can’t avoid, and there’s no way we can go around it when asking for the problem posed. However, I have tried a trick to make a heavy and useless unique index more useful than borrowing a complete unique prefix .

The essence of this technique is to take advantage of the unique index information to serve a different purpose. Thereby taking advantage of the ability to search by unique index prefix and increase its usability.

Execution conditions

  • Unique index born of a combination of information.
  • The unique index can be a compound index or an index for a normal field string.

Method of implementation

Create a unique key by putting fields of search value and large dispersion on the top of the key. For example, unique key is a combination of some information: item_id, other_id, …, type, user_id. Because the number of elements is unknown, I use the string key, not the individual fields with the compound index.

Unique key: item_123:other_2:type_1:user_1:

This arrangement is similar to the order of a compound index, when I put the data to be searched up, pushing the less important data down. Here is lower or searching by item , other , and user_id is not needed because of external storage and external index.

Thus, when searching all records with item_123 value will have the following Regex search form:

Or deeper to other_2

With a unique index based on a compound index, the same approach makes use of fields that are the index prefix.

Note

  • Only applicable to searching for prefix regex. Cannot search for a non-prefix value.
  • The size of the index does not go away, it is just a kind of taking advantage of the wastage available for this index.

This strategy is often used by the downstream to take advantage of resources, used in cases that need to be updated or deleted in bulk based on a data that appears in a unique key but does not need to appear outside (or is outside but not needed. create index).

See more at https://docs.mongodb.com/manual/reference/operator/query/regex/#index-use

Strategy 4: The value value index (aka the tune of the tiger)

Remember when Ton Sach wanted to conquer Lu Giang but he was caught in the mighty army of Luu Huan who was defending. At that time, Ton Sach used a luring tactic to save Luu Huan and go to Thuong Lieu to take the opportunity to fight Lu Giang. That’s called the Ly Dieu tiger painted glass .

Today, it is facing an extremely powerful force that indexes low value fields such as status , state , type , … are very inefficient and consume space on large collections. I have also used the Value ly index strategy to weaken the capacity of the index, thereby making the index use less space and be more efficient.

This index is called partial index .

Partial index is the type of index based on a certain condition. That is only the records that meet the conditions are indexed. Therefore the index size will be reduced. Entire records that do not meet the criteria will not be indexed.

Execution conditions

This strategy is effectively implemented in a number of military situation as follows:

  • Collection with a large number of records (several million or more)
  • Traditional index is not effective
    • Indexes on flag fields have little value like status , state , type , … but these flags are not evenly distributed . For example status has 10% as 0, 20% as 1, 70% as 2.
    • The index on the fields is evenly distributed, but the business logic is not used entirely . For example index on field score .
  • Queries always use one type of condition and often a minority condition. For example, always query by status = 1 , or score > 5

Method of implementation

Define the minority and set it as a filter condition. If the index is in the compound index form:

  • If the minority is a value, there is no need to include this field in the index.
  • If the minority is 1 range, then it depends on the need to create the index with the normal field order.

Note

  • This indexing scheme is very effective for cases where you need to search and sort on a given minority. Remember the minority and given are two factors that determine the success or failure of this tactic.
  • Ineffective for most cases because the index size is not reduced much.

The value ly index is now implemented effectively on collections of tens of millions to hundreds of millions of records with an index size of only a few megabytes that can be refilled when the minimum value is filtered out for filtering. Saving the index size is important to make sure mongodb will fit our indexes into ram, thereby increasing index processing efficiency.

See more at https://docs.mongodb.com/manual/core/index-partial/

Strategy 5: Invincibility wins against organic index (aka invincible wins skill)

The last strategy and also the secret tip of the leader of the sect who wish to be lower is to conquer the index . Although concise, this plan has seven or four ways of transforming and very elusive. The current servant only understands some of the basic transformations of this plan, and when he practiced it, he was still very uncertain.

In short: The most efficient index is NO INDEX .

Vietnamese soup: The most effective way to index is not to index .

What does that mean? Index always comes with tradeoffs, and those tradeoffs are:

Write performance

Do a basic math:

1 IO: 1 action read or write to disk

insert 1 record = 1 IO (for data) + n IO (n is the index number) delete 1 record = 1 IO (for data) + n IO (n is the index number) update 1 record = 1 IO (for delete old data) + 1 IO (for create new data) + n IO (n is the index number)

The more indexes we have, the more IO operations we need to execute with each action. This reduces the write speed.

So that:

  • Save the number of indexes.
  • No indiscriminate indexing.
  • Create reusable indexes in more cases.

Storage

Not to mention the disk storage for the index (because the disk is very cheap), the index needs to be stored in ram for the best search results, so the more indexes or indexes the heavier the bigger the amount of ram occupied. To index ram overflow is the last thing the brothers want to happen on their mongo.

How to save?

  • Again, create effective index and reuse index.
  • Using strategies 2, 3, 4 in the lower just mentioned above to reduce the size of an index can not be removed.

Personal experience

After the road to get the full of thorns and suffering, I have learned some of the following:

  • On many write operation systems, for example tracking and logging systems, DO NOT CREATE INDEX . Taking advantage of the only default index is the index id . Do not think id is useless when MongoID is made up of timestamp and random elements. This allows us to query based on time and sorting very effectively. With the above batch operations in data analytic systems, I always use id spanning to handle . Extremely effective and no need to create any index at all.
  • Another variation of No index is sharding . That is, instead of creating an index for a certain field, we split it into different collections. This is quite cassava and depends on the system. For example, separating deleted records to another collection rather than saving the isDeleted field, or separating type x , type y into 2 collections x, y, not saving the field type anymore, …
  • Another variation of No index is based on the number of records the index returns in a query. If the collection has a number of records from several hundred thousand to several million and a query pulls most of it out for calculation (for example when aggregate), then indexing (and hoping to use that index) may be slow. more than performing collection scan. The reason is that because of the collection scan, the logs will be read according to the storage block , while the index scan will read random access . So when the number of records matching the query is large, reading each block on the disk in the scan collection will be faster than reading by the index. Therefore removing the index is essential.

summary

Thanks to the application of these divine strategies, I successfully managed the database with the most operation and also the heaviest social network GAPO with only 2 CPU + 12GB RAM + 3000 IOPS disk . Currently, the data is also in the range of hundred million, weighing hundreds of GB, daily adding, editing, deleting tens of millions of records and using ~ 2000 IOPS permanently.

At this point, I also have exhausted strength to conclude this lost mailbox. We wish the fraternal fraternal adventure on the wanderer has any great technique, or any martial art that can share his ideas to help me to open my eyes. Thank you.

Wait a minute, the main ideas in the post:

  • Use compound index from 2-4 fields
  • Use hashed index for long string field
  • Utilize unique index for prefix search
  • Reduce the index size by partial index
  • Please stop indexing me
Share the news now

Source : Viblo