Scheduling fetch data same day from MongoDB to Hadoop in Apache NiFi

Tram Ho

Introduction to Apache NiFi

What is Apache NiFi?

Apache NiFi is an open source tool built to automate the flow of data between systems.

Main feature:

  • Browser-based user interface
  • Data provenance tracking
  • Extensive configuration (Extensive configuration)
  • Extensible design
  • Secure communication

NiFi . Architecture


Figure 1: NiFi Standlone


NiFi executes in a JVM on the host operating system. The main components of NiFi on the JVM are as follows:

  • Web server: The purpose of a web server is to host NiFi’s HTTP-based command and control APIs.
  • Flow controller: The flow controller is the brains of the operation. It provides threads for extensions to run on and manages the schedule when extensions get resources for execution.
  • Extensions: There are many types of NiFi extensions described in other documents. The key point here is that the extensions work and execute inside the JVM.
  • Flow file: The FlowFile repository is where NiFi keeps track of the status of what it knows about a given FlowFile that is currently active in the flow. Pluggable repository implementation. The default approach is a Persistent Pre-Write Log located on a specified disk partition.
  • Content Repository: A content repository is where the actual content bytes of a given FlowFile are stored. Pluggable repository implementation. The default approach is a fairly simple mechanism that stores blocks of data in the file system. More than one file system storage location can be specified to get different physical partitions joined to reduce contention on any drive.
  • Provenance Repository: The origin repository is where all the origin event data is stored. The storage structure is pluggable with the default implementation using one or more physical drives. Within each location event data is indexed and searchable.

NiFi can also operate in a cluster: image.png

Figure 2: NiFi Cluster


  • Operation model: Zero-Master Clustering.
  • Cluster Coordinator: is a node in a NiFi cluster responsible for maintaining information about the state of the cluster.
  • Primary node: is the node responsible for executing all data flows and managing the cluster stream register.
  • Each node in the NiFi cluster performs the same tasks on the data, but each node operates on a different set of data.

Scheduling fetch data same day from MongoDB to Hadoop


Figure 3: Description of how to handle the problem


Environment settings

  • Install Apache Hadoop
  • Install MongoDB, MongoDB Compass
  • Install Apache NiFi

Processing in NiFi

We will use 2 NiFi processors which are:

  • GetMongo
  • PutHDFS


Figure 4: Processor flow in NiFi

GetMongo processor

Setup Scheduling tab:

  • Scheduling Strategy: Using the CRON Driven scheduler
  • Run Schedule: 59 59 23 ? * MON-SUN is to schedule the processor to run at 23:59:59 every Monday to Sunday


Figure 5: Daily scheduler configuration


Setup Properties tab:

  • Mongo URL: connect to Mongo: mongodb://host1[:port1]
  • Mongo Database Name: Database Name
  • Mongo Collection Name: Collection Name
  • Query: Query with Mongodb in JSON format. In this case we will assume that in the Collection in MongoDb there is a field of time_created to be able to query the data daily.


Figure 6: MongoDB connection and query configuration


Create a find_data_by_date_range (function procedure):


Figure 7: Data query processing flow from MongoDB


PutHDFS processor

  • Hadoop Configuration Resources: Configure Hadoop
  • Directory: Directory on HDFS


Figure 8: Configuring put data on HDFS


Figure 9: Job completion


This post is my way of transferring data from MongoDB to Hadoop. Hope can help you to solve the problem. If you have a better solution, please comment in the comments. Thank you!


  1. Apache NiFi :
  2. Install Hadoop:
  3. GetMongo Processor: .html
  4. PutHDFS Processor: /index.html
Share the news now

Source : Viblo