Introduction to Apache NiFi
What is Apache NiFi?
Apache NiFi is an open source tool built to automate the flow of data between systems.
Main feature:
- Browser-based user interface
- Data provenance tracking
- Extensive configuration (Extensive configuration)
- Extensible design
- Secure communication
NiFi . Architecture
Figure 1: NiFi Standlone
NiFi executes in a JVM on the host operating system. The main components of NiFi on the JVM are as follows:
- Web server: The purpose of a web server is to host NiFi’s HTTP-based command and control APIs.
- Flow controller: The flow controller is the brains of the operation. It provides threads for extensions to run on and manages the schedule when extensions get resources for execution.
- Extensions: There are many types of NiFi extensions described in other documents. The key point here is that the extensions work and execute inside the JVM.
- Flow file: The FlowFile repository is where NiFi keeps track of the status of what it knows about a given FlowFile that is currently active in the flow. Pluggable repository implementation. The default approach is a Persistent Pre-Write Log located on a specified disk partition.
- Content Repository: A content repository is where the actual content bytes of a given FlowFile are stored. Pluggable repository implementation. The default approach is a fairly simple mechanism that stores blocks of data in the file system. More than one file system storage location can be specified to get different physical partitions joined to reduce contention on any drive.
- Provenance Repository: The origin repository is where all the origin event data is stored. The storage structure is pluggable with the default implementation using one or more physical drives. Within each location event data is indexed and searchable.
NiFi can also operate in a cluster:
Figure 2: NiFi Cluster
- Operation model: Zero-Master Clustering.
- Cluster Coordinator: is a node in a NiFi cluster responsible for maintaining information about the state of the cluster.
- Primary node: is the node responsible for executing all data flows and managing the cluster stream register.
- Each node in the NiFi cluster performs the same tasks on the data, but each node operates on a different set of data.
Scheduling fetch data same day from MongoDB to Hadoop
Figure 3: Description of how to handle the problem
Environment settings
- Install Apache Hadoop
- Install MongoDB, MongoDB Compass
- Install Apache NiFi
Processing in NiFi
We will use 2 NiFi processors which are:
- GetMongo
- PutHDFS
Figure 4: Processor flow in NiFi
GetMongo processor
Setup Scheduling tab:
- Scheduling Strategy: Using the CRON Driven scheduler
- Run Schedule:
59 59 23 ? * MON-SUN
is to schedule the processor to run at 23:59:59 every Monday to Sunday
Figure 5: Daily scheduler configuration
Setup Properties tab:
- Mongo URL: connect to Mongo: mongodb://host1[:port1]
- Mongo Database Name: Database Name
- Mongo Collection Name: Collection Name
- Query: Query with Mongodb in JSON format. In this case we will assume that in the Collection in MongoDb there is a field of
time_created
to be able to query the data daily.
1 2 3 4 | { "$where": "find_data_by_date_range(this.time_created)" } |
Figure 6: MongoDB connection and query configuration
Create a find_data_by_date_range (function procedure):
1 2 3 4 5 6 7 8 9 10 11 12 | db.system.js.insertOne({ _id: "find_data_by_date_range", value: function (created_time) { var fromDate = new Date(new Date().setHours(0, 0, 0, 0)); var toDate = new Date(new Date().setHours(23, 59, 59, 999)); if ((created_time >= fromDate) && (created_time <= toDate)) { return true; } return false; } }); |
Figure 7: Data query processing flow from MongoDB
PutHDFS processor
- Hadoop Configuration Resources: Configure Hadoop
- Directory: Directory on HDFS
Figure 8: Configuring put data on HDFS
Figure 9: Job completion
summary
This post is my way of transferring data from MongoDB to Hadoop. Hope can help you to solve the problem. If you have a better solution, please comment in the comments. Thank you!
References
- Apache NiFi : https://nifi.apache.org/docs.html
- Install Hadoop: https://github.com/DoManhQuang/datasciencecoban/tree/master/blog/hadoop/install-hadoop
- GetMongo Processor: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-mongodb-nar/1.19.1/org.apache.nifi.processors.mongodb.GetMongo/index .html
- PutHDFS Processor: https://nifi.incubator.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-hadoop-nar/1.19.1/org.apache.nifi.processors.hadoop.PutHDFS /index.html