Note when handling large amounts of data in rails

Tuesday, 11/02/2020

Tram Ho

Rails is a great framework that helps individuals and businesses build their products in a very short amount of time. Rails is used as a backend for many web projects or as an API server for mobile applications of startups. Myself also like the rails’ syntax and speed dev project.

However, the question is, what is Rails processing speed when dealing with large amounts of data? You’ve probably read the story “How We Went from 30 Servers to 2: Go”, if not then try reading it here . Basically the story is a company building a backend system with Rails to run the products they make for customers and use 30 servers to maintain it. As the number of customers increases, the amount of data becomes overloaded for the system and the company is forced to switch from Rails to Go, and as a result they succeed, but especially they only need to use 2 servers to Keep the system down.

Does Rails have a bottleneck when we try to process huge amounts of data? Probably not, if we use the following tips.

1. Do not use ActiveRecord if possible

ActiveRecord makes everything very easy, but it was not created for raw data. When you want to use a series of simple processes into millions of records, you should use plain SQL. If you feel you need an ORM tool to make it easier to work, try SEQUEL.

2. Use update_all to update all records

The following is a common mistake made by people who want to duplicate the entire table and update each element individually:

  <span class="token constant">User</span> <span class="token punctuation">.</span> <span class="token function">where</span> <span class="token punctuation">(</span> city <span class="token punctuation">:</span> “ <span class="token constant">Houston</span> ” <span class="token punctuation">)</span> <span class="token punctuation">.</span> <span class="token keyword">each</span> <span class="token keyword">do</span> <span class="token operator">|</span> user <span class="token operator">|</span>
    user <span class="token punctuation">.</span> note <span class="token operator">=</span> “ <span class="token constant">Houstonian</span> ”
    user <span class="token punctuation">.</span> save
  <span class="token keyword">end</span>

User . where ( city : “ Houston ” ) . each do | user |

user . note = “ Houstonian ”

user . save

end

The code is quite easy to understand but there are fatal drawbacks. If there are 100,000 users with city city “Houston”, the code will run for 24 hours. It’s been a while, huh? There is a much faster and more effective solution:

  <span class="token constant">User</span> <span class="token punctuation">.</span> <span class="token function">update_all</span> <span class="token punctuation">(</span> <span class="token punctuation">{</span> note <span class="token punctuation">:</span> “ <span class="token constant">Houstonian</span> ” <span class="token punctuation">}</span> <span class="token punctuation">,</span> <span class="token punctuation">{</span> city <span class="token punctuation">:</span> “ <span class="token constant">Houston</span> ” <span class="token punctuation">}</span> <span class="token punctuation">)</span>

User . update_all ( { note : “ Houstonian ” } , { city : “ Houston ” } )

And this code runs within 30 seconds with the same amount of data as the above.

3. Get only data from the columns you need

The code User.where(city: “Houston”) will get all information from the users in the database. If you do not need to use additional information such as age, gender, marital status, … then you should not take all that information in the first place. Use select_column when you want to retrieve data from several columns:

  <span class="token constant">User</span> <span class="token punctuation">.</span> <span class="token function">select</span> <span class="token punctuation">(</span> “city” <span class="token punctuation">,</span> “state” <span class="token punctuation">)</span> <span class="token punctuation">.</span> <span class="token function">where</span> <span class="token punctuation">(</span> age <span class="token punctuation">:</span> <span class="token number">29</span> <span class="token punctuation">)</span>

User . select ( “city” , “state” ) . where ( age : 29 )

4. Replace the Model.all.each command with find_in_batches

For small systems, changes like this are not very important. But with a system with 100000 records, the command above can easily occupy 5 GB or more of memory. The server will easily crash. So I think find_in_batches should be used to solve this problem:

  <span class="token constant">User</span> <span class="token punctuation">.</span> <span class="token function">find_in_batches</span> <span class="token punctuation">(</span> conditions <span class="token punctuation">:</span> ‘grade <span class="token operator">=</span> <span class="token number">2</span> ' <span class="token punctuation">,</span> batch_size <span class="token punctuation">:</span> <span class="token number">500</span> <span class="token punctuation">)</span> <span class="token keyword">do</span> <span class="token operator">|</span> students <span class="token operator">|</span>
    students <span class="token punctuation">.</span> <span class="token keyword">each</span> <span class="token keyword">do</span> <span class="token operator">|</span> student <span class="token operator">|</span>
      student <span class="token punctuation">.</span> <span class="token function">find_or_create_by_class_name</span> <span class="token punctuation">(</span> ‘ <span class="token constant">PE</span> ’ <span class="token punctuation">)</span>
    <span class="token keyword">end</span>
  <span class="token keyword">end</span>

User . find_in_batches ( conditions : ‘grade = 2 ' , batch_size : 500 ) do | students |

students . each do | student |

student . find_or_create_by_class_name ( ‘ PE ’ )

end

5. Don’t use transactions too much

  <span class="token punctuation">(</span> <span class="token number">0.2</span> ms <span class="token punctuation">)</span> <span class="token keyword">BEGIN</span>
  <span class="token punctuation">(</span> <span class="token number">0.4</span> ms <span class="token punctuation">)</span> <span class="token constant">COMMIT</span>

( 0.2 ms ) BEGIN

( 0.4 ms ) COMMIT

Transaction is run every time the object is saved. It will run millions of times during system run. Even if we use find_in_batches, the only way to effectively restrict transactions is to group the processes. The code in Part 4 can still be optimized as follows:

  <span class="token constant">User</span> <span class="token punctuation">.</span> <span class="token function">find_in_batches</span> <span class="token punctuation">(</span> conditions <span class="token punctuation">:</span> ‘grade <span class="token operator">=</span> <span class="token number">2</span> ' <span class="token punctuation">,</span> batch_size <span class="token punctuation">:</span> <span class="token number">500</span> <span class="token punctuation">)</span> <span class="token keyword">do</span> <span class="token operator">|</span> students <span class="token operator">|</span>
    <span class="token constant">User</span> <span class="token punctuation">.</span> transaction <span class="token keyword">do</span>
      students <span class="token punctuation">.</span> <span class="token keyword">each</span> <span class="token keyword">do</span> <span class="token operator">|</span> student <span class="token operator">|</span>
        student <span class="token punctuation">.</span> <span class="token function">find_or_create_by_class_name</span> <span class="token punctuation">(</span> ‘ <span class="token constant">PE</span> ’ <span class="token punctuation">)</span>
      <span class="token keyword">end</span>
    <span class="token keyword">end</span>
  <span class="token keyword">end</span>

User . transaction do

end

This way, instead of having to commit every single record, now just commit after every 500 records, much more efficiently.

6. Don’t forget to type the index

Always index the most important columns or column groups you query the most. Otherwise, your command will take a lifetime to run.

7. Destroy occupies a lot of resources

Destroy in ActiveRecord is a very heavy process. Make sure you know what you are doing. One thing you must know is that: although destroy and delete both delete records, destroy will run all callback functions, which is very time consuming. Similar to destroy_all and delete_all . So, if you just want to delete records without touching anything else, you should only use delete_all . In other case is if you want to delete an entire table. For example, if you want to delete all users, you can use TRUNCATE :

  <span class="token constant">ActiveRecord</span> <span class="token punctuation">:</span> <span class="token punctuation">:</span> <span class="token constant">Base</span> <span class="token punctuation">.</span> connection <span class="token punctuation">.</span> <span class="token function">execute</span> <span class="token punctuation">(</span> “ <span class="token constant">TRUNCATE</span> <span class="token constant">TABLE</span> users” <span class="token punctuation">)</span>

ActiveRecord : : Base . connection . execute ( “ TRUNCATE TABLE users” )

Anyway, delete at the database level is still very time consuming. This is why sometimes we should use “soft delete” or “soft delete”, just change the “deleted = 1” field of the record you want to delete.

8. It is not necessary to run the command immediately

Use “Background job”. Resque and Sidekiq are always there for you, use them to execute implicit orders and set order execution schedule, everything will be easier

In short, if you have a large amount of data, do your best to optimize system performance. Although very convenient, we have to admit that ActiveRecord slows down the system a bit. However, through the above tips, you can still keep the other strengths of Rails without wasting too much performance. Enjoy Rails as much as possible!

Reference: https://chaione.com/blog/dealing-massive-data-rails/

Share the news now

Source : Viblo

Note when handling large amounts of data in rails

1. Do not use ActiveRecord if possible

2. Use update_all to update all records

3. Get only data from the columns you need

4. Replace the Model.all.each command with find_in_batches

5. Don’t use transactions too much

6. Don’t forget to type the index

7. Destroy occupies a lot of resources

8. It is not necessary to run the command immediately

TikTok becomes the second largest social platform in South Africa

The fastest depreciating after 9 months of launch, iPhone 14 Pro Max continues to break the bottom in Vietnam

Beginner's guide to R: Introduction

10 essential SublimeText plugins for JavaScript developers