1. Things to know before you learn Kafka
Publish / Subscribe messaging
Before discussing the specifics of Apache Kafka, we need to understand the structure of pub / sub messaging and why it is important. Publish / subscribe messaging is a pattern that is characterized by sending (publisher) data (message) without explicit recipient designation. Instead, the sender will classify the message into classes and somehow the subscriber must register in a certain class to receive messages in that class. The pub / sub system usually has a broker, which is the center where messages are delivered.
How to start
There are a lot of use cases for Pub / sub, the simplest is to start with a message queue (message queue) or interprocess communication channel. For example, you create an application You need to send data to track website activity such as you make a direct connection from your website to an application that can display that data on a table to be able to track. For example in the picture here
This is just a simple example when starting out with data tracking. Later you decide to try data analysis for a long time and vs the current model does not work well. So you’ll start needing a service that can take data, store them, statistic them. To support this, you modify your application so that you can send data to both systems. So now you have new applications that will generate data. And of course it also needs to make the same connection as the previous service. Gradually due to the need you will have many applications and services to serve different purposes (my eyes @@)
The inadequacies of this system building technique are obvious, so you decide to do something. You set up a single application that receives data from all external applications and provides a server to maintain and query those data for any system that needs them.
Individual Queue Systems (personal queuing system)
At the same time you want to conduct other activities such as checking or providing user actions on your website for ML developers to collect and analyze customer habits. And then you realize the similarity of those systems and the picture below shows three pub / sub systems models
This way it is much better to take advantage of point to point connections but it does have a lot of repetition. Suppose your company maintains multiple queuing data systems (all with their own bugs and limitations). You also know there will be many new use cases for messages in the future. So you want a single centralized system that allows you to create the same kind of data that can grow as your business grows.
What is Kafka?
Apache Kafka is a publish / subscribe messaging system designed to solve this problem. It is often described as a “commit log distribution system” or more recently known as a “distributed distribution platform”. Literally a file system or commit log database is designed to provide a consistent record of all transactions to make the system consistent, the data in Kafka is stored. long term storage, in order. In addition, data can be distributed within the system to provide additional or safeguard against system errors that make data inconsistent, as well as the ability to scaling performance.
Messages and Batches
The unit of data in Kafka is called the
message . If you approach Kafka from the perspective of the database platform, you can think of a message similar to a
row or a
record . A message is simply an array. bytes, so the data contained therein has no specific format or meaning. A message may have a bit of metadata option , called a key (meaning that this bit is a key for metadata: v). This key is also a byte array and like the message it has no specific meaning. The keys are used when the message is written to different partitions in a more manageable way. The simplest is to create a consistent hash function of the key and then enter the partition with the number as a result of the key’s hash. This ensures that messages with the same key are always logged in the same areas.
To be effective, messages are written to Kafka in batches (Batches).
batch is just a collection of messages, all of which are being created for the same topic ( partition ) and partition (partition). Saving every message that runs individually on the network is too expensive, so aggregating these messages into a batch (Batches) to minimize this cost. Of course, this is a trade-off between latency and throughput, just two sides of the same coin, the bigger the lots the more messages will have to wait for enough new blocks to be sent so the latency will be large. Batches are also often compressed thus providing more efficient data transfer and storage
Although the messages are byte arrays, they are not meaningful, so we should apply the structure or schema to the message content to make it easier to understand. There are many options available for message schema, depending on what your application needs. Simple systems, such as (JSON) and (XML), are easy to use and read to humans. However, they lack features such as powerful type handling and compatibility between schemas. Many Kafka developers support the use of Apache Avro, a serialization framework originally developed for Hadoop. Avro provides a compact serialization format, Schemas are separate from message payloads and do not require any re-coding when they change.
A consistent data format is important in Kafka, because it allows writing and reading separate messages. When these tasks are tightly combined, applications that register to receive messages must be updated to handle the new data format in parallel with the old format. After that, applications need to fire newly updated messages to use the new format. By using well-defined shemas messages in Kafka can be more easily understood. (If it is too hard to understand, think of it as rpc sending data and need a proto file to reformat the data when receiving comments)
Topics and Partitions (topics and partitions)
Messages in Kafka are categorized into
topics . The closest example to that topic is the table in the db or the folder in the filesystem. The topics are broken down into small partitions. Returning to the description of the
commit log , a partition is interpreted as a unique record of messages. Messages are written in a table of contents and are read from start to finish. Note that since a topic often has multiple partitions, there is no guarantee of timing the messages on the topic in a partition. only. The figure below shows a topic with four partitions, with messages being added to the end of each partition. Each partition can be stored on a different server, which means that a topic can be scaled horizontally across multiple servers to provide performance far beyond the capabilities of a single server.
Stream is the term commonly used when talking about data in systems like Kafka. Typically, a
stream is considered a
single topic of data, regardless of the number of partitions in the topic. This represents a unique
stream data transferred from producers to consumers. The way these messages are referred to is quite common when describing the processing of
stream , which is when some frameworks, such as Kafka Streams, Apache Samza and Storm, operate on messages in real time. This approach can be compared to how offline frameworks, specifically Hadoop, are designed to work on processed data (not real time like the ones above).
Producers and Consumers
Kafka clients are Kafka system users and there are two basic types: producers and consumers. There are also advandced client APIs –
Kafka Connect API to integrate data and
Kafka Streams for stream processing.
Producers are the components that create messages. In a pub / sub system it may be called
writers . In general it creates messages and provides them with a specific topic. By default, producers don’t care which messages the topic is put into. In some cases, producers will send messages to specific partitions. This is usually done by using the key and the partitioner will generate a key hash of the key and map it to a specific partition. This ensures that all messages generated with a given key will be written to the same partition.
Consumers are components that read messages. In other pub / sub systems, consumers can be called subscribers or readers. Consumers subscribe to one or more topics and read the messages in the order in which they were created. Consumers keep track of the messages it has registered by tracking offsets (which are integer values that increase each time a message is created). Each message in a partition has a unique number. Using this offset value enables consumers to stop or restart the message to continue reading without losing their previously read position. In addition, consumers will act as a group. In this way, consumers can expand horizontally to handle topics with a large number of messages. In addition, if a consumer is deactivated, the remaining members of the group will rebalance the partitions being used and take over the work of the deactivated member.
Brokers and Clusters
A Kafka server is called a broker . Broker receives messages from producers, assigns offsets to them and stores them on disk. It also serves the comsumer, responds to partition requests and returns the committed messages on the drive. Depending on the specific hardware and its performance characteristics, a broker can handle thousands of partitions and millions of messages per second.
Kafka brokers are designed to act as part of a cluster . In a cluster of brokers will have a broker acting as a cluster controller (it is automatically elected from the members of that cluster). The controller is responsible for administrative activities, including assigning partitions to brokers and monitoring their failures. A partition that is owned by a single broker in the cluster, that broker is called the leader of that partition. A partition can be assigned to multiple brokers, which will result in the partition being copied. This allows another broker to become the leader if the current broker leader goes down.
The main feature of Apache Kafka is the ability to store messages for extended periods of time. Brokers are configured by default to retain messages in a topic for a period of time (eg 7 days) or until the topic reaches a certain size in bytes (eg 1 GB). When these limits are reached, the messages will expire and be deleted to store new messages. Individual topics can also be configured for separate archiving so that messages will be archived as long as they are useful. For example, a follow-up message may be retained for a few days, while less important messages will be deleted and replaced in just a few hours.
As Kafka grows, it is usually more effective to have multiple Clusters. We have a few reasons to understand why this works:
- Split data types
- Separated due to security requirements
- More datacenter (fix an error that causes the whole system to crash)
When working with multiple datacenter in particular, the message must often make copies between them. For example, if a user changes the public information in their profile, that change will need to be updated on any datacenter. However, the replication mechanisms in Kafka clusters are designed to work in a single cluster, not between multiple clusters.
Kafka has a tool called MirrorMaker , which is used for this purpose. At its core, MirrorMaker is simply a Kafka consumer and producer linked together by a queue so messages from one cluster are delivered to another. The figure below shows an example of an architecture that uses MirrorMaker, aggregates messages from two local clusters into an aggregated cluster, then copies that cluster to other datacenter.
There are many options for creating publish / subscribe messaging systems, so what makes Apache Kafka a good choice?
Kafka can handle multiple producers seamlessly, whether those clients are using multiple topics or the same topic. This makes for an ideal system for aggregating data from multiple frontends. For example, a website that provides content to users through so many microservice means that there will be many producers who shoot data collected from user activities so there will be many topics like that so many formats. However, Kafka can handle all services and can record and use a common format for all. The consumer application can receive a stream of data without having to receive it from multiple topics.
In addition to being able to add more producers, Kafka is designed for many consumers to read any stream of messages without affecting each other. This may be the opposite of queuing systeam, meaning that in that queue when a message is used by one consumer, it is not available to other consumers. Multiple Kafka consumers can choose to act as part of a group and share a stream, ensuring that the entire group processes the given message in one go.
Disk-Based Retention (Disk-Based Retention)
Kafka can not only handle multiple consumers, but can also store messages for a long time (durable) because consumers don’t always need to work in real time. Messages are committed to the drive, and will be archived with certain rules. These options can be selected on a per-topic basis, allowing different message flows to vary in retention count depending on consumer needs. Persistence means that if consumers lag behind, due to slow processing or disruption during transmission, there is no risk of data loss. It also means that maintenance can be performed on consumers, taking offline applications for a short period of time, regardless of the backup or producer messages. The consumer may be stopped and the messages will be retained at Kafka. This allows it to reboot and receive processing messages where they exit without losing data.
Kafka’s flexible scalability makes it easy to handle any amount of data. Users can start with a single broker as a proof of concept , expanding from a development from a small cluster of 3 brokers and then moving to a larger model with a large cluster with hundreds of thousands of brokers developing. Over time as the data increases. Scale can be done inside the cluster while online without affecting the availability of the entire system. This also means that a group of multiple brokers can handle the failure of an individual broker and continue to serve customers. Clusters experiencing more failures can be configured with higher replication factors.
All of these features work together to turn Apache Kafka into a publish / subscribe messaging system with great performance under high load. Producers, consumers and brokers can all scale to handle very large message flows with ease.
The Data Ecosystem (data ecosystem)
Apache Kafka provides computing systems for the data ecosystem. It carries messages between different components of the infrastructure, providing a consistent interface for all clients. When combined with a system to deliver messages, producers and consumers no longer require a close connection or direct connection of any kind. Components can be added and deleted in a variety of situations, and producers need not care who is using the data or how many applications are taking the data.
The most common use case for Kafka, as designed at LinkedIn, is to track user activity. A user on the site interacts with the frontend application, creating information regarding the actions the user is performing. This could be passive information, such as page views and click tracking, or it could be more complex actions, such as information that users add to their profile. Messages are publiched to one or more topics, and then the applications use them in the backend. These applications can create reports, learn ML systems, update search results or perform other activities needed to provide an enriching user experience.
Kafka is also used for messaging, where apps need to send notifications (like email) to users. Those applications can create messages regardless of the format or how the message will actually be sent. An application can then read all the messages sent and handle them consistently, including:
- Format the messages
- Collect multiple messages into one before sending
- Apply user preferences on how they want to receive messages
Use a single application for this to avoid the need to copy functionality in multiple applications, as well as to allow operations such as aggregation that cannot be performed.
Metrics and logging
Kafka is also ideal for data collection and logging. This is a use in the case that it is possible for multiple applications to create the same type of message. Public data applications are frequently used for the Kafka topic and these data can be used by systems to monitor and alert. They can also be used in an offline system like Hadoop to perform longer-term analysis, such as growth forecasts. Messages can be public in the same way, and can be delivered to specialized logging search systems like Elastisearch or security analysis applications.
Because Kafka is based on the concept of a commit log, database changes can be made public on Kafka and applications can easily follow this stream to receive live updates as they occur. This flow of changes can also be used to copy database updates into a remote system or to merge changes from multiple applications into one view. Long-term retention is useful here to provide a buffer for changes, meaning it can be played back in case the application retrieves corrupted data. Alternately, compressed topics can be used to provide longer retention
Oaaa introduction seems long and reading can improve sleep. But rest assured immediately following the installation instructions will be in the following article:
Link: tu bi bi niu