Scheduling fetch data same day from MongoDB to Hadoop in Apache NiFi

Thursday, 16/02/2023

Tram Ho

Introduction to Apache NiFi

What is Apache NiFi?

Apache NiFi is an open source tool built to automate the flow of data between systems.

Main feature:

Browser-based user interface
Data provenance tracking
Extensive configuration (Extensive configuration)
Extensible design
Secure communication

NiFi . Architecture

Figure 1: NiFi Standlone

NiFi executes in a JVM on the host operating system. The main components of NiFi on the JVM are as follows:

Web server: The purpose of a web server is to host NiFi’s HTTP-based command and control APIs.
Flow controller: The flow controller is the brains of the operation. It provides threads for extensions to run on and manages the schedule when extensions get resources for execution.
Extensions: There are many types of NiFi extensions described in other documents. The key point here is that the extensions work and execute inside the JVM.
Flow file: The FlowFile repository is where NiFi keeps track of the status of what it knows about a given FlowFile that is currently active in the flow. Pluggable repository implementation. The default approach is a Persistent Pre-Write Log located on a specified disk partition.
Content Repository: A content repository is where the actual content bytes of a given FlowFile are stored. Pluggable repository implementation. The default approach is a fairly simple mechanism that stores blocks of data in the file system. More than one file system storage location can be specified to get different physical partitions joined to reduce contention on any drive.
Provenance Repository: The origin repository is where all the origin event data is stored. The storage structure is pluggable with the default implementation using one or more physical drives. Within each location event data is indexed and searchable.

NiFi can also operate in a cluster:

Figure 2: NiFi Cluster

Operation model: Zero-Master Clustering.
Cluster Coordinator: is a node in a NiFi cluster responsible for maintaining information about the state of the cluster.
Primary node: is the node responsible for executing all data flows and managing the cluster stream register.
Each node in the NiFi cluster performs the same tasks on the data, but each node operates on a different set of data.

Scheduling fetch data same day from MongoDB to Hadoop

Figure 3: Description of how to handle the problem

Environment settings

Install Apache Hadoop
Install MongoDB, MongoDB Compass
Install Apache NiFi

Processing in NiFi

We will use 2 NiFi processors which are:

GetMongo
PutHDFS

Figure 4: Processor flow in NiFi

GetMongo processor

Setup Scheduling tab:

Scheduling Strategy: Using the CRON Driven scheduler
Run Schedule: 59 59 23 ? * MON-SUN is to schedule the processor to run at 23:59:59 every Monday to Sunday

Figure 5: Daily scheduler configuration

Setup Properties tab:

Mongo URL: connect to Mongo: mongodb://host1[:port1]
Mongo Database Name: Database Name
Mongo Collection Name: Collection Name
Query: Query with Mongodb in JSON format. In this case we will assume that in the Collection in MongoDb there is a field of time_created to be able to query the data daily.

{
  "$where": "find_data_by_date_range(this.time_created)"
}

{

"$where": "find_data_by_date_range(this.time_created)"

}

Figure 6: MongoDB connection and query configuration

Create a find_data_by_date_range (function procedure):

db.system.js.insertOne({
	_id: "find_data_by_date_range",
	value: function (created_time) {
		var fromDate = new Date(new Date().setHours(0, 0, 0, 0));
		var toDate = new Date(new Date().setHours(23, 59, 59, 999));
		if ((created_time &gt;= fromDate) &amp;&amp; (created_time &lt;= toDate)) {
			return true;
		}
		return false;
	}
});

db.system.js.insertOne({

_id: "find_data_by_date_range",

value: function (created_time) {

var fromDate = new Date(new Date().setHours(0, 0, 0, 0));

var toDate = new Date(new Date().setHours(23, 59, 59, 999));

if ((created_time >= fromDate) && (created_time <= toDate)) {

return true;

}

return false;

}

});

Figure 7: Data query processing flow from MongoDB

PutHDFS processor

Hadoop Configuration Resources: Configure Hadoop
Directory: Directory on HDFS

Figure 8: Configuring put data on HDFS

Figure 9: Job completion

summary

This post is my way of transferring data from MongoDB to Hadoop. Hope can help you to solve the problem. If you have a better solution, please comment in the comments. Thank you!

References

Share the news now

Source : Viblo