Convert data between two different databases via a real-world example with Ruby

Tram Ho

4 years ago

Huge data is not something most of us worry about every day. But sometimes, you may be provided with tons of data that you have to import into your project. The following specific example poses practical problems that you may encounter at any time.

Backstory

Imagine running the Instagram-esque application with Ruby on Rails. Users can upload photos and videos there and other users can leave a comment. Our database model might look like this:

While we were designing the database, there was another project that was successful and people were using their applications. They have 100 000 users, each with 10 photos, 10 videos and 10 posts. Each post has been commented 10 times. But their database looks like this:

We have somehow purchased the application of another party and now have to integrate their data into the data we have above. The problem starts, they are running MySQL while we are using PostgreSQL. Because those two database schemas seem to be quite different and our database already has some data inside, pgloader will not be feasible in this case. We could try writing a simple task to enter data. Although there are hundreds of ways to do this (most of them may not work really well), we will just look at ways that are thought to be the best performers.

Basic approach

To tranfer data, we may want to create a rake task, but to do this, we must first access “foreign data”. Can we create classes for models simply? First, we have to configure the second database in config/database.yml by doing like this:

competitor:
  adapter: mysql2
  database: competitor_prod
  username: secret_user
  password: secret_password

competitor:

adapter: mysql2

database: competitor_prod

username: secret_user

password: secret_password

Next we create some classes using this connection:

class OtherUser &lt; ApplicationRecord
  establish_connection :competitor
  self.table_name = :users
  
  has_many :video_posts, class_name: 'OtherVideoPost', foreign_key: 'user_id'
  has_many :image_posts, class_name: 'OtherImagePost', foreign_key: 'user_id'
  has_many :text_posts, class_name: 'OtherTextPost', foreign_key: 'user_id'
  has_many :profile_comments, class_name: 'OtherProfileComment', foreign_key: 'user_id'
end

class OtherUser < ApplicationRecord

establish_connection :competitor

self.table_name = :users

has_many :video_posts, class_name: 'OtherVideoPost', foreign_key: 'user_id'

has_many :image_posts, class_name: 'OtherImagePost', foreign_key: 'user_id'

has_many :text_posts, class_name: 'OtherTextPost', foreign_key: 'user_id'

has_many :profile_comments, class_name: 'OtherProfileComment', foreign_key: 'user_id'

end

Now create a task rake:

task :import do
  OtherUser.find_each do |other_user|
    user = User.create!(email: other_user.email,
                        name: other_user.name)
    other_user.image_posts.each do |other|
      image = Image.create!(file_path: other.file_path,
                            file_size: other.file_size,
                            height: other.height,
                            width: other.width)
      Post.create!(postable: image,
                   user: user,
                   text: other.text)
      ...
    end
    ...
  end
end

task :import do

OtherUser.find_each do |other_user|

user = User.create!(email: other_user.email,

name: other_user.name)

other_user.image_posts.each do |other|

image = Image.create!(file_path: other.file_path,

file_size: other.file_size,

height: other.height,

width: other.width)

Post.create!(postable: image,

user: user,

text: other.text)

...

end

...

end

The problem begins to arise here, the Comment is associated with both User and Post , how to add this relationship in rake task. Comment of us may have User and lack Post or vice versa. This can be easily solved by keeping the original IDs in the old database and adding them, so they do not conflict with existing data. May look like this:

offset = User.maximum(:id)
...
uid = comment.user_id + offset
Comment.new(commentable: postable,
            text: comment.text,
            user_id: uid).save(validate: false)

offset = User.maximum(:id)

...

uid = comment.user_id + offset

Comment.new(commentable: postable,

text: comment.text,

user_id: uid).save(validate: false)

Warning!

The biggest problem here is that when you use this method, the application must pause if not during running, any new records created will cause serious consequences.

Now, when creating a Comment , we don’t really have to pass the User and Post , there – we can only use the ID, which exists or last (when creating Users , we have to put their ID along one way). But this will still fail because Rails by default keeps our data safe and adds database constraints. We can disable it at the top of the script with:

ALTER TABLE comments DISABLE TRIGGER ALL;

1 2	ALTER TABLE comments DISABLE TRIGGER ALL;

Warning!

Don’t forget to turn it back on when the process is over.

Now we can run the rake, but it can take up to 60 hours to complete, stopping the application for too long can actually cause damage so we need to consider!

Parallel batch inserts with eager loading and Sequel

A lot of things need to be said through the headline above, it’s best we go into solving each problem, looking from the end to the beginning:

Sequel

Sequel is one of the ORM available for Ruby (like ActiveRecord but has 3 times less memory). With ActiveRecord, we will have to allocate nearly 100 GB during our import, while with Sequel, we can keep the level below 35 GB. Fewer objects to build = faster processing.

Eager Loading

If not careful, we may encounter N + 1 query problems. However, it can be easily avoided with Eager Loading. In ActiveRecord we do with .includes and in Sequel we have .eager. But both have the same result, which is to reduce the number of select which we perform.

Batch Insert

Since we have reduced the number of select , we should also try to reduce the number of insert . The easiest way is to avoid saving individual objects, but rather saving them in larger batches. There is a Gem for this, called activerecord-import . In activerecord-import , it creates SQL queries like this:

INSERT INTO `foo` (`bar`, `baz`) VALUES (1, 2), (3, 4), …

1 2	INSERT INTO `foo` (`bar`, `baz`) VALUES (1, 2), (3, 4), …

Parallel

After performing the above optimizations, we can look at the logfile and see that we spend time processing data between reading and writing them to the database. We might be able to run our task in several parallel processes to use the entire database (when one process is writing others have time to prepare the data). You may never have done parallel processing in Ruby, but it’s actually quite easy:

4.times do |i|
  fork do
    ...
  end
end
Process.waitall

4.times do |i|

fork do

...

end

Process.waitall

Now we can run the import process and it completes after only 3.6 hours. With just a few easy changes, we have achieved a speed increase of 15 times. Is there any faster way?

Skipping Ruby

The previous method was really useful because it allowed us to perform any kind of data processing – we could have done things like create a password for the User or extract something from the previous JSON. when saving to database. But if we take a closer look at the problem we are solving, we can see that the only processing we need is the addition of an ID. If so, we can work directly with SQL.

So what happens if we can bypass ActiveRecord, Sequel and all that reduces the above performance and try to move our data with SQL? To load data into Postgres, we can use COPY , which is even faster than batch insert and allows reading data from text files. On the other hand, we have MySQL that can save the selected result to a CSV file. Will we try to make them work together?

file = "/tmp/#{SecureRandom.uuid}"

select = &lt;&lt;~MySQL
  SELECT id + #{@user_offset}, email, name, created_at, updated_at
  FROM users
  INTO OUTFILE '#{file}_users'
  FIELDS TERMINATED BY ','
  OPTIONALLY ENCLOSED BY '"'
  LINES TERMINATED BY 'n';
MySQL

copy = &lt;&lt;~PostgreSQL
  COPY users (id, email, name, created_at, updated_at)
  FROM '#{file}_users'
  WITH CSV;
  SELECT setval('users_id_seq', (SELECT MAX(id) FROM users));
PostgreSQL

OLD.run(select)
ActiveRecord::Base.connection.execute(copy)

file = "/tmp/#{SecureRandom.uuid}"

select = <<~MySQL

SELECT id + #{@user_offset}, email, name, created_at, updated_at

FROM users

INTO OUTFILE '#{file}_users'

FIELDS TERMINATED BY ','

OPTIONALLY ENCLOSED BY '"'

LINES TERMINATED BY 'n';

MySQL

copy = <<~PostgreSQL

COPY users (id, email, name, created_at, updated_at)

FROM '#{file}_users'

WITH CSV;

SELECT setval('users_id_seq', (SELECT MAX(id) FROM users));

PostgreSQL

OLD.run(select)

ActiveRecord::Base.connection.execute(copy)

Could not be easier. And the results exceed expectations. All data entry (takes 60 hours for the first method) is completed in just 8 minutes. More than 430 times!

The article was translated from author Maciek Głowacki

The Fastest Way of Importing Data with Ruby?

Share the news now