Embulk: Tool to help reduce data conversion pain

Tram Ho

What is Embulk?

Concept

Embulk is an open source tool whose basic function is to load records from one database and import them to another. In addition, there are functions to import data into other databases through the use of plugins that make the conversion and data processing simpler and more convenient.

Embulk works on java platform, so it can easily work on many different operating systems.

Advantages

  • Simple installation, works on many different operating systems
  • The functions are provided in the form of plugins, so for each specific work there will be the necessary plugins.
  • Plugins are always updated continuously from the developer
  • Embulk and its plugins are provided for free and allow users to freely customize them to suit individual requirements.

Structure of Embulk

Basically, the structure of embulk is divided into 3 main parts:

  • Input plugins data, Decoder plugins, Parser plugins: Provide different input data methods. example: mysql, postgresql, Amazon S3, HDFS, http (get data from api)
  • Data processing plugins: Provide plugins that allow for data filtering (filter plugins)
  • Output plugins data, Formater plugins, Encoder plugins

Some plugins:

input plugins :

  • RDBS (mysql, postgres, jdbc …)
  • NoSQL (redis, mongodb)
  • Cloud Service (redshift, s3)
  • Files (CSV, JSON …)
  • Etc (hdfs, http, elastic search, slack-history, google analitics)

output plugins :

  • RDBS (mysql, postgres, oracle, jdbc …)
  • Cloud Service (redshift, s3, bigquery)
  • NoSQL (redis, hdfs)
  • Files
  • Etc (elastic search, hdfs, swift)

filter plugins :

  • column (cut the column)
  • insert Add columns such as host name to the specified location
  • row Extract only rows that meet certain conditions
  • rearrange Reconstructs one row of data into multiple rows

File parser plugins :

  • json
  • xml
  • csv
  • apache log
  • query_string
  • regex

File formatter Plugin :

  • json
    • A plugin that formats the contents of a record in the format of jsonl (1 json 1 line)
  • poi_excel
    • Plugin to convert to Excel (xls, xlsx) format data

Excutor Plugin :

  • mapreduce
    • Plugin for running Embulk tasks on Hadoop

Details of these plugins can be found here: https://plugins.embulk.org/ . There are many useful flugins here.

How it works

In case of data from database -> other database. We have the following facilities:

The data will be read by input plugin -> data processing -> output plugin will import into new database.

In the more general case. Data can be read from a file or from some specific data type

Install Embulk

To install Embulk, you first need to install java on your device. Note that Embulk only works on java8 for now .

Install embulk

Install plugin for embulk

Above is to install some plugin for importing data from csv or api json -> into database postgresql. For other purposes, see the plugins above.

Run embulk with config file

Some config examples for Embulk

Import data from csv -> into database postgresql

Import data from api json -> into the postgresql database

Refer to embulk’s http plugin here: https://github.com/takumakanari/embulk-input-http

summary

Above, I have introduced about Embulk, an excellent data converter. Hope it can help ease your pain

See more at:

https://www.embulk.org/

https://dev.embulk.org/customization.html

https://plugins.embulk.org

https://www.embulk.org/docs/

https://qiita.com/tashiro_gaku/items/f7fa0f1a99c759d947a7#configxml に mysql プ ラ グ イ ン 情報 を 追記

Share the news now

Source : Viblo