Introducing Apache NiFi

Thursday, 23/02/2023

Tram Ho

For those of you who are having to perform tasks such as: crawling data on websites, getting logs of application systems or getting data from FTP, DBMS, Kafka servers to process and put into a file storage system medium (eg HDFS), … must have faced one of the following problems:

How to ensure the speed of getting data from the source to the storage place to meet the following tasks.
How to build handlers for various data formats and data sources.
Or must ensure data safety, not to lose, falsify or duplicate data.

After being introduced and self-study, I found an effective tool to do the above tasks, which is Apache NiFi. If you have ever read through the article about big data system architecture (1) posted in the group, then you can understand that NiFi is a tool in the data collection group (Data Ingestion), which plays an important role. collect, process and transfer data from data sources (Data Source) to data storage components (Data Storage) or to real-time data collection components (Real-Time Message Ingestion).

What is Apache Nifi?

Apache NiFi is an open-source software written in Java, created to automate the flow of data between software systems. It was built in 2006 based on “NiagaraFiles” software developed by Mr. NSA, then moved to open-source in 2014 (Wiki Source).

NiFi is known for its ability to build automatic data transfer flows between systems. Especially, it supports many different types of source and destination such as:

Types of RDBMS: Oracle, MySql, Postgre, …
Types of NoSQL DBs: Mongo, HBase, Cassandra, …
From web sources such as: HTTP, web-socket
Get or push streaming data into Kafka
Or from: FTP, log

In addition to extracting and pushing data, NiFi also has functions such as routing data by attributes and content, processing data such as filtering, editing, adding or subtracting content of data before sending to storage. store.

What’s good about Apache Nifi?

Nifi’s three prominent feature sets include data flow management; ease of use and operation; and scalability. When choosing whether to use Nifi for the Data Ingestion layer or not, you can refer to these features to decide to use for your project:

The first is about data flow management :

Guaranteed Safety: Each unit of data in your flow will be represented by an Object named FlowFile. It will record all information about the data in the stream such as which block is being processed, where it is being transferred, etc. The processing history of a FlowFile is stored in the Provenance Repo for us to trace. Combined with the Copy-on-Write mechanism, NiFi stores data at each step in the stream before processing, making it easy to replay the data.
Data Buffering: This feature helps to solve the problem that the feed rate is slower than the release rate between two different systems. It works on a Queue mechanism between two blocks in the stream. This data will be kept on RAM, but if it exceeds the threshold you set, the data will be put on the hard drive.
Set priority: in some cases we will need to process this data before processing others. For example, a log labeled error, often we want it to be able to be processed immediately before handling warnings.
Support trade-off between speed and fault tolerance: There are data streams where absolute assurance of the integrity and safety of data is required that accept high latency. And there are streams that we need to get the data to the destination in the shortest time possible. NiFi will assist you with settings to balance these two factors.

The second is about the issue of whether it is easy to use or not :

Creating a stream of your data will be done entirely on the WEB interface, and with a few drag and drop you will quickly create a simple stream.
Reusability is also supported, you can create a template containing a basic stream to reuse as needed.
Visually monitor the processing history of data when error checking is required.
Re-run the whole data at each processing step
You will easily program a processing, control, report or UI component in NiFi when needed. For example, a block to encode or decode data.

Finally, it is worth mentioning that an important feature of applications in distributed systems is horizontal scalability (adding servers to the cluster). If the above data stream on a NiFi server can handle 100MB/s, but the actual requirement is up to 500MB/s, you can install a cluster of many servers to process data in parallel. but without upgrading the server’s configuration.

Who is using Apache NiFi?

With the above advantages and features, Apache Nifi is being used more and more in technology stacks of large companies such as Cloudera, Looker, …

Micron: is a company in the semiconductor manufacturing industry. They use NiFi to collect their worldwide production data and feed it into centralized data warehouses. This data is mined and fed to a comprehensive view of the manufacturing activity. For processing-heavy data streams, the site-to-site NiFi protocol is used to deliver data continuously to Spark jobs on the Hadoop cluster. They also use NiFi’s Rest API to automate the creation of streams and manage them.
Macquarie Telecom Group: operating in the field of cloud and telecommunications. The company uses NiFi to transfer millions of records of events between data centers, and transform and enrich the data.
Dovestech: is a cybersecurity company. They use NiFi to enrich and normalize millions of cybersecurity-related records and feed them into a centralized database for their cybersecurity visualization product, ThreatPop.
Looker: company that provides SaaS and analytics software. Now, all of their new data streams are built with NiFi. Old data streams are also being gradually switched to using NiFi. The company deploys NiFi clusters to collect, transform, and transfer data to various platforms such as: Google big query, Amazon Redshift, and Amazon S3.

You can see more companies that have used NiFi here

Besides Apache Nifi, what other options are there?

For the data collection component in a Big Data system, there may also be other frameworks, you can choose the appropriate tool depending on the specific problem. However, through research and practical implementation, I found that Nifi is easy to install, use and administer, and Nifi’s scalability also meets my Big Data system requirements. *

Apache Sqoop: command line tool, used for bidirectional data transfer between Hadoop and RDBMS.
Apache Flume: is a tool used to push data one-way from several data sources to storage such as Hadoop, Hbase, … Flume handles real-time data well with high throughput and low latency. However, the data flow is configured on the properties file in java, which is quite complicated. In addition, Flume’s data assurance is weaker than that of other systems (data duplication occurs (2)) due to priority of processing time.
Apache Flink: an online data processing framework in a distributed environment. It has high processing speed, performance and availability. It works very well with online data analysis applications (e.g. live report, anomaly detection, etc.). However, the ability to collect data from many different sources is not as diverse as NiFi. Therefore, one should combine NiFi’s data collection capabilities with Flink’s high-speed, complex computation processing power.

Above, I introduced a little bit about the Apache NiFi tool, looking forward to hearing from you about your experience using NiFi or other tools.

Share the news now

Source : Viblo

Introducing Apache NiFi

What is Apache Nifi?

What’s good about Apache Nifi?

TikTok becomes the second largest social platform in South Africa

The fastest depreciating after 9 months of launch, iPhone 14 Pro Max continues to break the bottom in Vietnam

Beginner's guide to R: Introduction

10 essential SublimeText plugins for JavaScript developers