In today’s fast-paced world, software applications must be able to deal with large amounts of data in real time. Data producers must have a way to transfer new information to all applications or people that need to consume it. And this transfer must happen reliably, concurrently, and quickly.
Apache Kafka lets you achieve all this and more. It’s a fault-tolerant, distributed, highly scalable platform for real-time data ingestion, processing, and streaming. In this article, we will take a deep dive into Kafka’s architecture. We’ll discuss its core components, discuss how it processes messages, and share some of its use cases.
Apache Kafka is an event streaming platform that can be used to store high-volume data, build fault-tolerant data pipelines, and scale low-latency application clusters. The platform can also reliably transmit multi-source data to consumer groups, and perform real-time data analytics.
Kafka was originally developed in 2011 by developers at LinkedIn to solve scalability and performance issues. The goal was to develop a distributed message publishing platform that allows LinkedIn to move from a monolithic architecture to one driven by microservices.
Today, Kafka powers the infrastructures of over 80% of all Fortune 500 companies. From financial to social media, IoT to healthcare, and insurance to telecom – Kafka is being used across all sorts of industries.
Apache Kafka is a platform for capturing, storing, processing, and streaming event-driven data. Events can originate from any source, such as a database, API, cloud service, mobile device, IoT sensor, conversational interface, or software application.
These events must either be stored for future retrieval and analysis, processed in real time, or routed to their intended destinations. Kafka is an all-in-one platform that delivers all these functionalities.
A typical event has a key, a value, a timestamp, and optional headers to store metadata. For example:
Key: “Peer2PeerTransfer”
Value: “User Alice made a payment of $200 to Bob”
Timestamp: “November 12, 2020 at 1400”
Metadata: “…”
A Kafka cluster has multiple components that contribute to its overall reliability, fault tolerance, scalability, and high availability. In the following sections, we will explore all these components.
In a typical Kafka-based infrastructure, data producers publish events to applications and services that have subscribed to them. These events are organized, stored, and streamed as topics. A topic is a mechanism to map an event stream to an event category.
For example, you may create a topic named transactions to stream all your financial events. You may have another topic, registrations, used to transmit events related to new registrations. Kafka topics are multi-subscriber and multi-producer, which means that a topic can have zero or more publishers that push events to it, and zero or more consumers that read those events.
Topics are spread over different partitions that exist across different Kafka servers. This data distribution allows for concurrent writing and reading of events, which adds to the system’s scalability and reliability. You can specify the number of partitions for a topic while creating it.
When a producer pushes a new event to a topic, it’s written to one of the topic’s partitions. Kafka ensures that the subscribers/readers of a topic always receive the events in the correct order.
Kafka brokers are standalone servers that make up the storage layer. They store event data directly on the file system. A new subdirectory is created for each topic-partition. Producers connect to a broker to write events, whereas consumers connect to it to read events. Each broker is uniquely identifiable by an integer ID. If there is more than one broker in a cluster (which is usually the case), a broker may contain only some of the data related to a specific topic.
Brokers play a significant role in the cluster’s scalability, fault tolerance, and availability. By adding more brokers, we can create more partitions per topic, which increases the cluster’s ability to handle concurrent connections. If a broker fails, other brokers take over its work, and ensure that the cluster’s operations are not disrupted.
One of the brokers acts as the controller of the cluster. The controller is responsible for maintaining the state of other brokers and reassigning work when an existing broker fails or a new one joins the cluster.
Producers are external applications that connect to brokers and push events to topic-partitions. Examples include a database, a cloud service, or an application using a Kafka client library. Client libraries are available for almost all programming languages and frameworks, including C++, Java, Python, and Node.js.
Typically, producers distribute events across different partitions to not overburden a single broker and degrade performance. However, there are situations where a producer may write to a specific partition – for example, you may want to store and send related events in the exact order they are generated.
In the Kafka world, producers are completely decoupled from consumers. Producers don’t need to wait for consumers to process events or send acknowledgments. This design characteristic lets Kafka be as scalable as it is.
Kafka consumers are applications that subscribe to topics published by producers. An example can be a reporting service receiving events for new transactions and recording them in a database. A consumer can read from more than one topic at once.
Kafka ensures that the events read from a partition are in the same order as they were sent. However, it can’t guarantee order if a consumer reads from multiple partitions simultaneously. By default, consumers only process data that was produced after they connected to the cluster. However, it’s also possible to request historical data.
To increase horizontal scalability and performance of the cluster, different consumers of the same application can be grouped together to form a consumer group. A consumer group can be configured to distribute the message consumption load among themselves.
Kafka implements a “pull-based” consumption mechanism instead of a “push-based” one. In a push-based implementation, event messages are pushed to the consumers as soon as the publisher generates them. Push-based is not a flexible approach, as it assumes that all consumers have the same consumption capabilities.
Conversely, in a pull-based approach, consumers must explicitly request data from the broker to receive it. This allows different consumers to process events at their own pace without the broker/publisher deciding the same transfer rate. Moreover, Kafka also gives consumers the flexibility to choose between real-time or batch processing of events.
In the following sections, we will explore how messaging works in the Kafka world.
A typical Kafka event message has the following fields:
The first step to publishing event messages in Kafka is creating a topic. You can either use the official command line tool to generate a topic or do it programmatically. For example, for Java applications, the NewTopic class can be used to create a new topic with the desired replication factor and number of partitions.
The second step is to create a new publisher. To do that in Java, you can use the KafkaProducer class. Kafka offers several parameters to configure a producer, including compression type, retries, and available memory.
The third step is to send the data. Kafka’s Java client has a class ProducerRecord, which lets us create a message object. You can use this class to set the key, value, headers, timestamp, topic name, and partition number for the message.
The fourth step is to create a consumer. The KafkaConsumer class can be used for this purpose. The consumer can also be configured using parameters like heartbeat interval, session timeout, and max amount of fetched data.
The fifth step is to subscribe to the topic we created in the first step. We can use the subscribe() function exposed by the KafkaConsumer class to do so. Then we can start polling for new events using KafkaConsumer’s poll function. Kafka will now start relaying any published data to us.
A key is a mandatory field in an event message. If it’s not specified, Kafka uses the round-robin routing technique to evenly distribute messages across different partitions. If a key is specified, all messages with the same key are sent and stored in the same partition. Publishers can use any arbitrary value as a key.
Kafka uses message key hashing to determine the appropriate partition for a message. Keys are hashed using the murmur2 algorithm, as follows:
Math.abs(Utils.murmur2(keyBytes)) % (numPartitions - 1)
Using keys is recommended in situations where the message order must be preserved.
Example: Suppose we are building an order-tracking system. We want the real-time data feed to be processed on a per-order basis. For such a use case, we can use the order_id field as the key for messages. This way, data related to individual orders will always be chronologically organized in separate partitions.
Serialization is the process of converting a data structure into a format that’s easily transmittable, like the binary format. Deserialization is the inverse of serialization: it converts binary data back into a data structure.
Kafka expects the key and value fields of a message to be serialized into binary data before transmission. For example, let’s suppose you have set “1234” as the key for your message. Before publishing the message, you will have to serialize the key; i.e. convert “1234” into “10011010010”, its binary representation. Similarly, when the consumer receives the message from the broker, they will have to deserialize; i.e. convert “10011010010” to “1234”.
Most Kafka client libraries have helper classes to serialize and deserialize different data structures. For example, the Java library has classes for serializing/deserializing integers, strings, longs, doubles, and UUIDs.
A consumer must periodically commit the last message they processed so that Kafka can know how far the consumer has read into a topic-partition. This offset is also known as the consumer offset. Kafka uses an internal topic named __consumer_offsets to track these offsets for different consumer groups.
Most Kafka client libraries have a built-in mechanism to regularly commit consumer offsets so the developers don’t have to write additional code. A designated broker copies the committed offsets to the __consumer_offsets topic for tracking. Consumer offsets are a great way to build a resilient system: if a client fails and another one enters the cluster, they can use the consumer offset to know where to resume reading.
Zookeeper is an open-source service for maintaining configuration and monitoring the state of distributed applications. In the Kafka world, Zookeeper is used for the following purposes:
Before Kafka v2.8.0 was released in 2021, it wasn’t possible to use Kafka without Zookeeper. In v2.8.0, the Kafka team removed the dependency on Zookeeper, and made several Zookeeper features a part of the Kafka core. The release was named the Kafka Raft Metadata mode, or KRaft. However, many Zookeeper features, like ACLs and other security controls, were missing from the initial KRaft release. Therefore, its use wasn’t recommended in a production setup.
In subsequent versions, the Kafka team continued to replicate the remaining Zookeeper features inside the Kafka broker. As of v3.3.1, KRaft has been marked as production-ready, which means it’s now safe to use Apache Kafka without Zookeeper.
In this section, we’ll look at some of Kafka’s most prominent use cases.
Microservices are loosely coupled applications that rely on inter-process communication for data exchange. Kafka can be used to build a fault-tolerant microservices architecture where data producers can transmit information to consumer applications with minimal to no latency.
The traditional methods of log aggregation involve collecting files from different servers and putting them in a central location. Using Kafka, you can build a more robust log aggregation system that relies on message streams instead of files. This is not only a faster and less error-prone approach, but also makes it easier to organize multi-source data (using topics and partitions).
An event-driven application executes its business logic in response to events. Such applications are performant, scalable, and resource-effective. You can define different topics in Kafka to drive the execution of your event-driven applications. For example, suppose your application records financial transactions across different databases. Creating a separate topic for each database will allow you to quickly identify where to record an event.
Kafka is an excellent tool for tracking user activity on a website, such as searches, page visits, and average time spent. You can create different topics for different activity types and aggregate data for processing. Insights generated from data analysis can be used to optimize user experience and conversion.
If you have a large, distributed infrastructure, it’s a good idea to collect, aggregate, and track metrics from different applications and environments. Set up different topics in Kafka for different applications or for monitoring use cases. You can also plug in a dashboard application that reads metrics from Kafka in real time.
Kafka is a great choice for building systems that rely on multi-stage pipeline processing. For example, an application may extract data from multiple databases and publish it to a raw_data topic. Another application may subscribe to the raw_data topic, pull data from it, transform the data based on the configured business logic, and push it to the semi_processed_data topic. Another round of processing may follow, in which the data gets moved to the fully_processed_data topic.
Whether you want to capture and analyze IoT sensor data in real time, simultaneously process thousands of financial transactions, or distribute source-agnostic data across cloud platforms, Kafka is the way to go. In this article, we aimed to help you understand Apache Kafka architecture, how messaging works, its diverse use cases and how it relates to Zookeeper; we hope it has served its purpose.
Write for Site24x7 is a special writing program that supports writers who create content for Site24x7 “Learn” portal. Get paid for your writing.
Apply Now