Apache Kafka is a distributed streaming platform used for building real-time data
pipelines and streaming applications. It follows a publish-subscribe messaging
pattern and is known for its scalability, reliability, and fault-tolerance. Here’s
a detailed look at Kafka's architecture:
---
### 1. **Core Components**
Kafka’s architecture includes the following core components:
#### a. **Broker**
- A **Kafka broker** is a server that stores data and serves client requests.
- Kafka is designed to be distributed, so a **cluster** consists of multiple
brokers.
- Each broker is identified by a unique **ID**.
- Brokers handle:
- **Message storage**: Persisting data on disk.
- **Message retrieval**: Serving producer and consumer requests.
#### b. **Topic**
- A **topic** is a category or stream to which records are sent.
- **Producers** write data to topics, and **consumers** read from them.
- Topics are:
- **Partitioned** for scalability.
- **Replicated** for fault-tolerance.
#### c. **Partition**
- Each topic is divided into one or more **partitions**.
- A **partition** is a log file that stores messages in an **append-only** manner.
- Messages in a partition have a sequential **offset**.
- Partitioning provides parallelism by spreading data across brokers.
#### d. **Producer**
- Producers are clients that publish messages to Kafka topics.
- Producers:
- Choose the partition (round-robin, key-based).
- Write data asynchronously for high throughput.
#### e. **Consumer**
- Consumers are clients that read messages from Kafka topics.
- They use **consumer groups**:
- A group ensures only one consumer in the group reads from a partition.
- Multiple consumers in a group can process different partitions in parallel.
#### f. **Zookeeper/Quorum Controller**
- Previously, **ZooKeeper** managed the Kafka cluster (e.g., leader election,
metadata).
- With newer versions, Kafka has introduced the **Quorum Controller** to eliminate
ZooKeeper dependencies.
- This simplifies management and enhances scalability.
#### g. **Replication**
- Kafka ensures data availability via **replication**.
- Each partition has one **leader** and multiple **followers**.
- The leader handles all read/write requests.
- Followers replicate the leader’s data and take over if the leader fails.
---
### 2. **Key Features**
#### a. **Log-Based Storage**
- Kafka stores messages as logs.
- Each partition maintains an immutable sequence of messages.
#### b. **Offset Management**
- Each message in a partition is assigned a unique **offset**.
- Consumers keep track of their progress using these offsets.
#### c. **Durability**
- Kafka persists data to disk, ensuring reliability.
- Configurable **retention policies** allow users to control how long messages are
stored.
#### d. **High Throughput**
- Kafka achieves high throughput by batching and compressing data.
#### e. **Scalability**
- Adding brokers and partitions enables horizontal scaling.
---
### 3. **Data Flow in Kafka**
#### a. **Producer Workflow**
1. Producers send records to a topic.
2. Partitions are chosen based on:
- A specified key.
- Round-robin distribution.
3. The broker writes records to the chosen partition.
#### b. **Broker Workflow**
1. Messages are stored in partitions on the broker.
2. The leader broker replicates data to follower brokers.
3. Metadata (e.g., topic configuration) is shared among brokers.
#### c. **Consumer Workflow**
1. Consumers poll the broker for new messages.
2. Each consumer in a group gets assigned specific partitions.
3. Consumers commit their offsets to track consumption progress.
---
### 4. **Kafka Cluster Example**
```
[ Producer ] ----> [ Kafka Broker Cluster ] ----> [ Consumer Group ]
(Partitions spread across Brokers)
```
- **Producer** writes data to topic `topic-A`.
- Topic `topic-A` is divided into three partitions: `P0`, `P1`, `P2`.
- In a cluster:
- Broker 1 might handle `P0` (leader), replicate `P1` (follower).
- Broker 2 might handle `P1` (leader), replicate `P2` (follower).
- Broker 3 might handle `P2` (leader), replicate `P0` (follower).
---
### 5. **Advanced Features**
#### a. **Kafka Connect**
- Used to integrate Kafka with other systems (databases, storage, etc.).
#### b. **Kafka Streams**
- A lightweight library for processing data streams.
#### c. **Schema Registry**
- Manages message schemas (e.g., Avro) for better compatibility.
#### d. **Security**
- Kafka supports:
- **Authentication** (SASL, Kerberos).
- **Encryption** (SSL/TLS).
- **Authorization** (ACLs).
---