Note when handling large amounts of data in rails

Tram Ho

Rails is a great framework that helps individuals and businesses build their products in a very short amount of time. Rails is used as a backend for many web projects or as an API server for mobile applications of startups. Myself also like the rails’ syntax and speed dev project.

However, the question is, what is Rails processing speed when dealing with large amounts of data? You’ve probably read the story “How We Went from 30 Servers to 2: Go”, if not then try reading it here . Basically the story is a company building a backend system with Rails to run the products they make for customers and use 30 servers to maintain it. As the number of customers increases, the amount of data becomes overloaded for the system and the company is forced to switch from Rails to Go, and as a result they succeed, but especially they only need to use 2 servers to Keep the system down.

Does Rails have a bottleneck when we try to process huge amounts of data? Probably not, if we use the following tips.

1. Do not use ActiveRecord if possible

ActiveRecord makes everything very easy, but it was not created for raw data. When you want to use a series of simple processes into millions of records, you should use plain SQL. If you feel you need an ORM tool to make it easier to work, try SEQUEL.

2. Use update_all to update all records

The following is a common mistake made by people who want to duplicate the entire table and update each element individually:

The code is quite easy to understand but there are fatal drawbacks. If there are 100,000 users with city city “Houston”, the code will run for 24 hours. It’s been a while, huh? There is a much faster and more effective solution:

And this code runs within 30 seconds with the same amount of data as the above.

3. Get only data from the columns you need

The code User.where(city: “Houston”) will get all information from the users in the database. If you do not need to use additional information such as age, gender, marital status, … then you should not take all that information in the first place. Use select_column when you want to retrieve data from several columns:

4. Replace the Model.all.each command with find_in_batches

For small systems, changes like this are not very important. But with a system with 100000 records, the command above can easily occupy 5 GB or more of memory. The server will easily crash. So I think find_in_batches should be used to solve this problem:

5. Don’t use transactions too much

Transaction is run every time the object is saved. It will run millions of times during system run. Even if we use find_in_batches, the only way to effectively restrict transactions is to group the processes. The code in Part 4 can still be optimized as follows:

This way, instead of having to commit every single record, now just commit after every 500 records, much more efficiently.

6. Don’t forget to type the index

Always index the most important columns or column groups you query the most. Otherwise, your command will take a lifetime to run.

7. Destroy occupies a lot of resources

Destroy in ActiveRecord is a very heavy process. Make sure you know what you are doing. One thing you must know is that: although destroy and delete both delete records, destroy will run all callback functions, which is very time consuming. Similar to destroy_all and delete_all . So, if you just want to delete records without touching anything else, you should only use delete_all . In other case is if you want to delete an entire table. For example, if you want to delete all users, you can use TRUNCATE :

Anyway, delete at the database level is still very time consuming. This is why sometimes we should use “soft delete” or “soft delete”, just change the “deleted = 1” field of the record you want to delete.

8. It is not necessary to run the command immediately

Use “Background job”. Resque and Sidekiq are always there for you, use them to execute implicit orders and set order execution schedule, everything will be easier

In short, if you have a large amount of data, do your best to optimize system performance. Although very convenient, we have to admit that ActiveRecord slows down the system a bit. However, through the above tips, you can still keep the other strengths of Rails without wasting too much performance. Enjoy Rails as much as possible!

Reference: https://chaione.com/blog/dealing-massive-data-rails/

Share the news now

Source : Viblo