Journey from NoSQL to SQL(Part II) – Data Transfer

Part 2: How The Data Transfer Journey From NoSQL To SQL Looks Like!

Content posted here with the permission of the author Meenakshi Kumari, who is currently employed at Josh Software. Original post available here.

In my last blog, I have shared my experience of gem changes and preparation of schema for PostgreSQL database. In this blog i’ll be covering the data transfer process from MongoDB to PostgreSQL database without any inconsistency. And will also cover what all challenges we faced and their solutions.

Data transfer from MongoDB to PostgreSQL database.

As our project was updated with PostgreSQL and database schema prepared, the next hurdle in-front of us was how can we import the data of MongoDB to PostgreSQL without affecting the data quality and disturbing the old data.

After lot of research, we found gem “Sequel”. It is a simple, flexible, and powerful SQL database access toolkit for Ruby. It basically has ability to connect to different type of databases (in our case MongoDB & PostgreSQL), read/write to those connected databases etc. SEQUEL can do all this because it includes a comprehensive Object Relational Mapping (ORM) layer for mapping records to Ruby objects and handling associated records. We used it like connect to destination PostgreSQL DB to write data which is read from our MongoDB.

So let’s start with the data transfer, steps which were followed are:

Switch to the MongoDB branch and there we need to install sequel and pg gem. In my previous blog I had mentioned that we had two separate branches for the MongoDB and PostgreSQL code in the same GitHub project repository. All this data transfer was done on the MongoDB branch.
Now we have to write rake tasks for data mapping. I’ll be taking the previous blog’s table example i.e. Address. You have already seen the table schema in the previous section, now i’ll show how to write rake task for data import for this particular table using Sequel. In the below rake task following this will be covered:

Database connection is established using :

DB = Sequel.postgres(‘database_name’, user: ‘username’, password:     ‘password’, host:‘localhost’, port: port_number)

Extensions for pg_hstore and pg_array are added to support the PostgreSQL hstore and array type to Sequel.

DB.extension :pg_array
DB.extension :pg_hstore

Loop over table in batches (for fast execution) and then make a hash containing one to one mapping for MongoDB record fields with respective PostgreSQL table fields.
In below rake task you will see “safe_task” method, this is to handle exceptions and throw error message. Below is the code for safe_task;

def safe_task(&proc)
  begin
    yield
    return true
  rescue StandardError => e
    puts "Exception Message: #{e.message}"
    puts "Exception Class: #{e.class.name}" 
  end
  false
end

Handle associations: In our table, address belongs_to company and company has_many address. To store company_id inside address table we have to import company records before address data. And then we will fetch associated company_id from the PostgreSQL company table and update the address table accordingly and also we will be keeping the mongo_id for the same company in relation_ids field of address.

## Retrieve associated company data from the PostgreSQL database
company = DB[:companies].select(:id).where(mongo_id: address.company_id.to_s).first

Hstore and array fields are firstly mapped directly to their PostgreSQL fields and as they cannot be directly inserted into the tables through Sequel, we have handled them as shown in the following part of code:

## Delete keys who has empty array values 
mapping.delete_if { |key, value| value.class == Array && value.empty? }

## Edit hash values        
mapping.each do |key, value| 
  Hash.include?(value.class) && mapping.update(key => Sequel.hstore(value)) 
end

In the last we have to insert the mapping hash inside the PostgreSQL database.

record_no = DB[:addresses].insert(mapping)

Below is the consolidated rake task code for the Address table;

desc 'Address  Data Migrate' 
task :addresses => :environment do

  ## Establish Connection
  ## Add extension for hstore and array 
  ## Loop over address table and start one-to-one field mapping
  ## Retrieve associated company data from the PostgreSQL database
 
    task = safe_task do
      mapping = {  
        mongo_id:    address.id.to_s, ## Actual table mongo_id
        ## Map PostgreSQL fields: MongoDB fields
        company_id:  company[:id], ## Storing PostgreSQL parent_id

        ## For storing associated table mongo_id
        relation_ids: {  
          company_id: ## Store company mongo_id
        }
      }

      ## Handle hstore and array fields(if any) before insertion

      ## Insert record          
      record_no = DB[:addresses].insert(mapping)
    end
  end
end

You have observed following things in the above rake task:

“mongo_id” and “relation_ids” fields are used as was mentioned in my previous blog, and we are storing the mongodb id into mongo_id for the record and associated tables mongo_id in relation_ids for future reference. So if anything goes wrong while data transfer task it can be handled while checking these fields and also for cross checking consistency of the data records which are transferred from MongoDB to PostgreSQL.

TIP: You can print number of the records which are imported with their respective mongo_id’s and reason of the failed record insertions(if any) just to keep track of the task status.

NOTE: We needed to store the PostgreSQL id of the parent table into their child tables, so the data for parent is needed to be populated first in the database that’s why we started with the rake tasks of parent tables and then moved on to their children and the process continues till the leaf tables. And the same sequence we followed while running these rake tasks.

One of the biggest challenge we faced was: we had a table in our system with more then 10 millions of records, and our rake task was taking more then 30 hours for data importing and was getting break at many points due to ‘cursor not found error’ and “SIGHUP” as we were trying to fetch the data in batches over the huge table records. It wasn’t getting imported in a single flow as it was taking to much time to load the whole table. We even tried loading in batches but that also didn’t helped .

Solution: Initially we were iterating over the whole table and then mapping their fields and then inserting. Instead of this we started with the up to down approach. Let me explain following example;

class Grandparent < ApplicationRecord
  has_many :parents
end

class Parent < ApplicationRecord
  has_many   :children
  belongs_to :grandparent
end

class Child < ApplicationRecord
  belongs_to :parent
end

Let’s say we have to transfer ‘Child’ table records from MongoDB to PostgreSQL database. So instead of directly looping over the ‘Child’ table, we will start with the grandparent table and then for that particular grandparent we will loop over their parents and then proceed to their children. Have a look into below pseudo code for more explanation;

desc 'Child  Data Migrate' 
task :children => :environment do

  puts "Rake task 'children' start time #{Time.now}"

  ## Establish Connection

  ## Add extension for hstore and array 

  total_count = Child.count
  success_count = 0

  Grandparent.each do |grand_parent|
    grandparent.parents.each do |parent|
      parent.children.each do |child|
        ## Map all the fields of Child table and then insert into DB
      end
    end
  end
end

We followed the above approach and this reduced the time for data import to 5 hours. And we were done with our data transfer. Sequel really made this whole procedure very easy for us.

After this phase we faced many issues related to the updated versions of Ruby on Rails and changed database, which i’ll cover in my next part of this blog.