Data Change Capture (CDC) with Debezium

Tram Ho

Preamble

Working with the database system has always been a difficult and tedious job, recently I had the opportunity to work with a new solution for the database to use to detect changes from a database and copy that data to another database of the same or different type. To handle this problem, I use a solution called title – C hange D ata C apture or CDC for short.

image.png

What is Change Data Capture?

True to its name, catching data changes , this is the technique we use to catch changes to the data contained in the database. Capturing changes in data will help us deal with quite a few problems in data processing, which we will explore in the next section.

In order to be able to catch this data change, there are many different ways, the most primitive we can use the TRIGGER mechanism in the already supported databases to catch the ACTION on update, insert, delete, etc. .. Or more gently we can use tools to do this, typically the Debezium tool is most prominent.

Benefits of CDC

The first benefit that everyone will see is copying data to other systems , if you talk about this benefit, some of you will say that: Database systems all support replica mechanisms. use it right away, why use an external tool for headaches that is less stable? Okay, right! If you only need to copy data from databases of the same type (MySQL => MySQL, Mongo => Mongo), then using the database feature is the best. However, now if you want to copy data from MySQL to MongoDB or from MySQL to PostgreSQL, there is no such mechanism. In this problem, CDC will stand in the middle to detect changes in the Database that needs to be monitored and processed, then can use code to process and push data and the system needs to copy data.

image.png

Another equally important benefit is the ability to backup data . Data change events will be stored so if unfortunately your database is dropped at 9am, you can take a backup at 3am and reapply the changes that have been saved since 3 a.m. to 9 p.m. Theoretically, if you don’t miss any events, your data will be fully recovered as before it was dropped. Is it too true? )))

image.png

Continuing with the first benefit, after we copy the data to another system we can use this system for testing instead of interacting directly on the real database system. It’s not uncommon for test developers to run queries that take minutes to process, even worse, can cause system lock. This problem is mild, it causes a decrease in system performance, and when it is heavy, it causes a crash. CDC is also a way to help us reduce cases like this from happening.

image.png

In addition, CDC also supports some specific problems of each system or Big Data processing, if you have ever applied CDC to these problems, then share with me below.

Debezium – CDC Tool

It’s no use talking in theory without any examples to see, that’s why I will introduce a tool that has worked for a while and found it quite delicious, this tool is Debezium. Its core Debezium uses Kafka to generate messages corresponding to data change events. Debezium uses Connectors to connect to database systems and catch changes, currently Debezium 1.9 supports MySQL, PostgreSQL, MongoDB, Oracle, SQL Server, DB2, Cassandra, Vitess. You can see instructions for each connector at the official document: https://debezium.io/documentation/reference/2.0/connectors/index.html

In this section, I will describe the installation steps of MySQL, the type of database that most people work with.

For MySQL, Debezium will rely on binlog to be able to detect data changes, so for systems that need monitoring, you need to enable this binlog feature and make sure the user to connect to needs to have it. binlog read permissions.

I work with Kubernetes a lot, so this tool I will guide everyone to build on K8s, for other environments like VM or Docker, there are basically the same components.

Debezium when running on K8s will use the Strimzi Operator (this is an Operator for Kafka). First we create a separate namespace for this application:

kubectl create ns debezium-example

Then we need to install Strimzi Operator

curl -sL https://github.com/operator-framework/operator-lifecycle-manager/releases/download/v0.20.0/install.sh | bash -s v0.20.0

Create Secret for the demo database

Create User and decentralize Debezium

Now for the important part, we will create a Kafka cluster for storing changes events. The configuration below will create 1 kafka pod corresponding to 1 broker and 1 zookeeper pod.

Next we deploy a MySQL database to test, the user and password of this DB is mysqluser – msqlpw

Now we will deploy the components with the role of connecting to MySQL and detecting changes. First we need to create KafkaConnect to do the job of detecting changes:

Then we deploy additional KafkaConnectors to connect to MySQL attached to the KafkaConnect created above.

That’s it, the setup is done and now we can monitor the changes in the database. We run this command to listen for messages in kafka

Open another terminal, Now we will access the DB and proceed to add a record to test:

Add one more record:

If you see the following JSON message, the setup was successful:

The output message will have 3 main items including source (data source, eg how much is the binlog file), before (data before change) and after (data after change).

Conclude

During the installation process, I encountered many problems :v, partly because I did not have much experience with Kafka. If you install errors, you are not alone.. What error can you comment for me to support. Hope this article has helped you gain some knowledge about Change Data Capture.

Some sources I refer to:

https://luminousmen.com/post/change-data-capture

https://www.striim.com/tutorial/streaming-data-integration-using-cdc-to-stream-database- changes/

Share the news now

Source : Viblo