Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views28 pages

Kafka

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views28 pages

Kafka

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Stream Data Platform with Apache Kafka

Gautam Pal
• How often do you see this screen?
• What is your response when you see this screen?
• Are you curious to know what goes behind the scenes to
get this screen?
2
Agenda

• Stream Data Platforms


• Role of Apache Kafka in Stream Data Platforms
• Characteristics of a Stream Data Platform
• Apache Kafka
– Kafka Cluster Components
– Kafka Topologies
– Kafka Scalability & Reliability

• Demo

3
Stream Data Platforms

• What is Stream Data?


– Stream data can be
• Click streams(Navigation of product catalogue on an ecommerce
site)
• Logs (Database logs, middleware logs, application logs)
• System Metrics (CPU utilization of a host in a data centre)
• Why is stream data important?
– Tracking click streams can be used to recommend relevant
products to a customer on an e-commerce site
– Tracking logs from various systems in the stack and
correlating them can help root cause the issues in a system in
real time
– Tracking system metrics can help monitoring the health of a
business critical system

4
What is a Stream Data Platform used for?

• What is a Stream Data platform?


– Its a central hub where all stream data for the organization or a line
of business is available for consumption for various business needs
– The stream data will be fed into the platform by business applications
like
• Sales Channels
• Inventory Management Systems
• CRM systems
• Business Transaction Systems
– Various application will be fed by the stream data from the platform
like
• Data Warehouse
• Recently viewed products in ecommerce
• Monitoring Systems
• Analytics applications

5
Other possible uses of a Stream Data Platform

• Replicate data from On-Premise to Cloud applications


– There are many on premise enterprise software applications
which are gradually moving to cloud and customers are in a
transition phase where they have some functionality still
available on-premise and some migrated to the cloud.

• Reduce the complexities in Datawarehousing Systems


– Move from “Batch Oriented” to “Real Time” ETL

• Keep The production and Pre-Production/Test


environments in synch
– If Pre-Prod or Test environments are synched in near real type
it will help replaying/reproducing production issues..

6
Typical Stream Data Platform in a Social
Networking Site

Profile Media Sharing Instant Activity System


Manager Manager Messenger Manager Monitoring

Stream
Data
Platform

Social Recently viewed Targeted Fraud Smart Infra


Graph products Ads Detection Management

7
Kafka Real Time Integration

8
Characteristics of a Stream Data Platform

• Reliability: Critical data streams should be delivered without any


loss
• High Throughput: Should be able to handle large volumes of log
or event data streams
• Low Latency: Should be able to provide data with low latency to
real-time applications
• Persistence: Should persist data for long durations to be
consumed by batch systems such as Hadoop which may only
perform their loads and processing periodically
• Scalability: Should be able to scale horizontally as the stream
data sources and sinks increase in number and size
• Integration Points: Should have easy integration points for
stream processing applications like Spark, Flink, Storm, Samza

9
Apache Kafka

• Apache Kafka is a fast, scalable, durable and fault


tolerant publish-subscribe messaging system.
• It is used in place of traditional message brokers like
Rabbit MQ, IBM-MQ because of its higher throughput,
reliability and replication capabilities.
• It was originally developed in LinkedIn
• It was subsequently open sourced in early 2011

10
Components of Kafka Cluster

• Broker
– It is the actual Kafka process
– A Kafka cluster can have one or more brokers

• Topic
– They are feed names to which messages are published by the
producers
– It can have partitions to increase degree of parallelism

• Producer
– An application which publishes data to a particular topic and in a
particular partition within a topic if it exists.
• Consumer
– An application which subscribe to a given topic and consume the feed of
published messages.
• Zookeeper
– The coordination interface between the Kafka broker and consumers.

11
Kafka Cluster Topologies

• A Kafka cluster can be configured in the following


topologies
– Single Node Single Broker Cluster
– Single Node Multiple Brokers Cluster
– Multiple Nodes Multiple Brokers Cluster

• Here broker is the Kafka server process

12
Single Node Single Broker Cluster

Zookeeper

Kafka Broker Node

Producers Consumers

Producers Kafka Broker Consumers

Producers Consumers

13
Single Node Multiple Broker Cluster

Zookeeper

Kafka Broker Node

Producers Kafka Broker 1 Consumers

Producers Kafka Broker 2 Consumers

Producers Kafka Broker 3 Consumers

14
Multiple Node Multiple Broker Cluster

Zookeeper

Kafka Node 1

Kafka
Broker 1
Producers Kafka
Broker 2 Consumers
Producers
Kafka Node 2
Consumers
Kafka
Producers
Broker 1
Consumers
Kafka
Broker 2

15
Topic Partitions

• For each topic, the Kafka cluster maintains a partitioned log


• Each partition is an ordered, immutable sequence of messages that is
continually appended to a commit log.
• The messages in the partitions are each assigned a sequential id number
called the offset that uniquely identifies each message within the partition.
• The Kafka cluster retains all published messages—whether or not they have
been consumed—for a configurable period of time.

Image Source: kafka.apache.org

16
Partition and Consumer Groups

Each message is consumed by only one consumer within each consumer group.
If a message has to be consumed by multiple consumers, then the consumers
have to be part of different consumer groups.
17
Consumer Groups & Offset Management

• If multiple consumers which are part of the same consumer


group , are hooked to a same topic then a message is delivered
to only one consumer within the group
• If a message has to be consumed by multiple consumers then
the consumers have to be part of different consumer groups
• Since the same message can be consumed by multiple
consumers in a consumer group, it’s the consumer responsibility
to maintain a log of what has been consumed so far
• The consumer maintains an offset which is a position in the log
till which messages have been consumed
– In earlier versions of Kafka, the offsets were maintained with
Zookeeper
– From V0.8.1 onwards the offsets are maintained in a separate topic
created by Kafka

18
Partition Replication

• Kafka uses partition replication to ensure resilience.

Kafka Broker Node

Kafka Broker 1

Partition 1
(Lead)

Partition 2
(ISR)

Topic

Kafka Broker 2

Partition 1
(ISR)

Partition 2
(Lead)

19
Review Questions

1. Describe some common application areas where Apache Kafka is used. Provide
examples of industries or use cases where Kafka has been successfully applied.
2. Describe the high-level multi-node, multi-broker architecture of Apache Kafka. Include the
main components and their roles in the Kafka ecosystem.
3. How does Kafka ensure fault tolerance and high throughput? Discuss the concepts of
partitions, replication, and leader/follower replicas.
4. Explain the differences between Kafka's at most once, at least once, and exactly once
message delivery semantics. When would you use each of these guarantees?

20
Describe some common application areas where Apache Kafka is used.
Provide examples of industries or use cases where Kafka has been
successfully applied.
1. Real-time Data Processing and Analytics
• Industry: E-commerce, Telecommunications
• Use Cases:
• Real-time recommendation systems, recently viewed products for e-commerce platforms.
• Network monitoring and anomaly detection in telecommunications.

2. Fraud Detection and Real-time Monitoring


• Industry: Banking, Insurance, E-commerce
• Use Cases:
• Real-time fraud detection in banking transactions and credit card payments.
• Monitoring and alerting for security breaches and cyberattacks.

3. IT Infrastructure Monitoring
• Industry: Technology
• Use Case: Real-time server and network monitoring of cloud or on-premise devices to ensure optimal performance, detect
anomalies, and enhance infrastructure security.

4. IoT Data Ingestion and Processing


• Industry: Smart Cities, Agriculture, Energy
• Use Cases:
• Smart city initiatives for collecting and analyzing data from sensors for traffic management, waste management, and
environmental monitoring.

5. Complex Event Processing (CEP):


• Industry: Healthcare
• Use Cases:
• Real-time patient monitoring and alerting in healthcare systems.

21
Describe the high-level multi-node, multi-broker architecture of Apache Kafka.
Include the main components and their roles in the Kafka ecosystem.

1. Producers
• Applications responsible for publishing data records (messages) to Kafka topics.
• Create messages and send them to Kafka brokers.

2. Brokers
• Core components of a Kafka cluster.
• Responsible for storing and serving data records.
• Manage partitions, data replication, and message persistence.
• Receive messages from producers, store them on disk, and serve them to consumers.

3. Topics
• Logical channels or categories representing streams of data in Kafka.
• Primary mechanism for organizing and categorizing data records.
• Divided into one or more partitions for scalability and parallel processing.
• Messages are published to specific topics by producers and consumed by consumers.

4. Partitions
• Individual segments of a topic's data stream.
• Enable parallel processing and scalability by distributing data across multiple brokers.
• Ordered and immutable sequences of messages.
• Kafka guarantees strict message ordering within each partition based on message offset.
22
Contd…Describe the high-level multi-node, multi-broker architecture of Apache Kafka.
Include the main components and their roles in the Kafka ecosystem.

4. Consumers
• Applications responsible for reading data records from Kafka topics.
• Subscribe to one or more topics and consume messages produced by producers.
• Each message is consumed by only one consumer within each consumer group.
• Consumers within a consumer group help parallel processing and load balancing.
• Several consumer groups help applying different processing logic on same data and provide
fault tolerance.

5. ZooKeeper
• Centralized coordination service used by Kafka for cluster management.
• Manages metadata about Kafka brokers, topics, partitions, and consumer groups.
• Handles tasks such as leader election and, configuration management.

23
Contd…Describe the high-level multi-node, multi-broker architecture of Apache Kafka.
Include the main components and their roles in the Kafka ecosystem.

Zookeeper

Kafka Node 1

Kafka
Broker 1
Producers Kafka
Broker 2 Consumers
Producers
Kafka Node 2
Consumers
Kafka
Producers
Broker 1
Consumers
Kafka
Broker 2

24
How does Kafka ensure fault tolerance and high throughput? Discuss the
concepts of partitions, replication, and leader/follower replicas.

Kafka ensures fault tolerance and high throughput through the following mechanisms: partitions,
replication, and leader/follower replicas. Here's how each concept contributes to Kafka's reliability:
1. Partitions:
• Kafka divides each topic into multiple partitions.
• Partitions enable horizontal scalability and parallel processing by allowing messages to be
distributed across multiple brokers.
• By distributing data across partitions, Kafka can handle large volumes of data and spread the
workload across multiple servers.
• Partitions enable Kafka to achieve higher throughput by allowing consumers to process
messages in parallel across different partitions.
2. Replication:
• Kafka replicates partitions across multiple brokers for fault tolerance and data durability.
• Each partition has one leader replica and zero or more follower replicas.
• The leader replica is responsible for handling read and write requests for the partition, while
follower replicas replicate data from the leader.
• Replication ensures that data is not lost in case of broker failures. If a broker hosting a partition
fails, one of the follower replicas can be promoted to become the new leader, ensuring
continuous availability of data.
• Replication also provides durability guarantees by ensuring that data is replicated across multiple
brokers. Even if one or more brokers fail, data remains available and can be accessed from the
remaining replicas.

25
Contd…How does Kafka ensure fault tolerance and high availability? Discuss the concepts
of partitions, replication, and leader/follower replicas.

Kafka Broker Node

Kafka Broker 1

Partition 1
(Lead)

Partition 2
(ISR)

Kafka Broker 2

Partition 1
(ISR)

Partition 2
(Lead)

26
Explain the differences between Kafka's at most once, at least once, and
exactly once message delivery semantics. When would you use each of
these guarantees?
1. At Most Once:
• In at most once semantics, Kafka ensures that messages are delivered to consumers at most
once. This means that there may be cases where messages are lost or not processed by
consumers.
• In this mode, Kafka acknowledges message receipt as soon as it is written to the commit log,
without waiting for confirmation that it has been successfully processed by consumers.
• At most once semantics prioritize low latency and minimize the risk of message duplication.
• Use cases: This semantics is suitable for scenarios where message loss is acceptable or can be
handled gracefully. For example, in scenarios where occasional message loss is tolerable, such
as real-time monitoring or telemetry applications.
2. At Least Once:
• In at least once semantics, Kafka ensures that messages are delivered to consumers at least
once. This means that messages may be duplicated, but they are not lost.
• Kafka acknowledges message receipt only after it has been successfully processed and
committed by consumers. If a consumer fails to commit a message successfully after
processing, it will be redelivered to ensure it is eventually processed again.
• At least once semantics prioritize data integrity and ensure that all messages are consumed,
even if there is a risk of duplication.
• Use cases: This semantics is suitable for scenarios where message duplication can be handled
gracefully or where data consistency is critical. For example, in financial transactions or data
analytics applications where it's important to process every message, even if some duplicates
are encountered.

27
Contd…Explain the differences between Kafka's at most once, at least once, and exactly
once message delivery semantics. When would you use each of these guarantees?

1. Exactly Once:
• In exactly once semantics, Kafka ensures that each message is delivered to consumers exactly
once. This means that messages are neither lost nor duplicated.
• Achieving exactly once semantics requires coordination between producers and consumers, as
well as idempotent processing and transactional guarantees.
• Kafka provides transactional capabilities that allow producers to write messages and consumers
to consume messages within atomic transactions, ensuring that messages are either fully
processed and committed or not processed at all.
• Exactly once semantics prioritize both data integrity and consistency, guaranteeing that
each message is processed exactly once. This mode has the highest latency.
• Use cases: This semantics is suitable for scenarios where data consistency and integrity are
paramount, such as financial transactions, database replication, or complex data processing
pipelines where duplications or omissions are not acceptable.

28

You might also like