Import large amounts of data in RoR

Tram Ho

In the process of software development, we must have done many times adding or updating data but almost with only a relatively small amount of data changes, this process does not take much time.
I was really wrong to say that, please pay a little attention, usually in rails inserting row-by-row data, meaning that every time you create or change a record of a record, it will create 1 corresponding query. Imagine you want to import a few million copies, how many queries and time will it be?
To solve that problem in ruby ​​is not difficult, in the process of finding out, I was known to gem activerecord-import . Wait, before that, did you know which way to import data?

1. Create a record by creating one by one (row by row)

The example above is effective for a small amount of data, as mentioned above, the problem to be solved here is to import a few million records into your database. In the usual way, you enter rows one by one in the csv file and then insert them into your database via the seed.rb file.

It seems the above way, but the import performance is too slow, you can not sit all day to import only a few million data.

2. Using SQL INSERT statements (code that is hard to read and unsafe)

3. Using activerecord-import gem (faster)

ActiveRecord-import is a ruby ​​gem written by Mr. Zach Dennis. It is much faster than the usual insert row by row and is also very easy to implement.
To be able to use it you need to import the gem "activerecord-import" into Gemfile and remember to type bundle install . The mechanism of this gem’s operation is to reduce the number of your queries into a single query. Instead of having millions of queries, now there is only 1 query.

In addition, you can also apply to insert a new column, add all new data in the table, update the entire data type, … Here is the example code of large data update from csv file:

By using activerecord-import, when importing about 500,000 records from over 1 hour to less than 3 minutes, it’s too fast, isn’t it?

Understand more about manipulating databases
At this point, I am quite happy to import large data, but do you understand why many small queries are much slower than one large query?
I simply understand that with a query, the time to call it is quite time consuming, and the implementation is relatively fast.
Yes, when ActiveRecord performs the operation of insert row by row, it will access the database with an insert statement, and of course running sql for one insert does not take much time but the number of hits enter csdl to open a transaction, then complete it, it will take a lot of time. This leads to frequent hits on the database that have no effect on perfomance.

4. Bulk insert with rails 6

If you are using rails 6 then congratulations, the problem of inserting large amounts of data has been supported already.
Starting with rails 6, I have added insert_all , insert_all! and upsert_all into ActiveRecord :: Persistence
Using insert_all makes it possible to perform bulk inserts like the example below

As mentioned above, notice ON CONFLICT DO NOTHING RETURNING "id" in the query. This is supported by SQLite and PostgreQuery databases. If there is a conflict or unique key violation during bulk insertion, it will skip the conflict record and proceed to insert the next record.
If we need to make sure all rows are inserted, we can use insert_all!

If a record exists, and we want to update it or otherwise create a new record, this is called upert.

The second row in the input array has duplicate id = 1 so the user’s name will be Sam's new instead of Sam new .
The third row in the input array is not duplicated, so it only does updates.
The fourth row with id = 3 is not in the database so it will be created here.


Share the news now

Source : Viblo