0% found this document useful (0 votes)

11 views28 pages

Kafka

Uploaded by

sayantan.mbaa24122

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views28 pages

Kafka

Uploaded by

sayantan.mbaa24122

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Stream Data Platform with Apache Kafka

Gautam Pal
• How often do you see this screen?
• What is your response when you see this screen?
• Are you curious to know what goes behind the scenes to
get this screen?
2
Agenda

• Stream Data Platforms

• Role of Apache Kafka in Stream Data Platforms
• Characteristics of a Stream Data Platform
• Apache Kafka
– Kafka Cluster Components
– Kafka Topologies
– Kafka Scalability & Reliability

• Demo

3
Stream Data Platforms

• What is Stream Data?

– Stream data can be
• Click streams(Navigation of product catalogue on an ecommerce
site)
• Logs (Database logs, middleware logs, application logs)
• System Metrics (CPU utilization of a host in a data centre)
• Why is stream data important?
– Tracking click streams can be used to recommend relevant
products to a customer on an e-commerce site
– Tracking logs from various systems in the stack and
correlating them can help root cause the issues in a system in
real time
– Tracking system metrics can help monitoring the health of a
business critical system

4
What is a Stream Data Platform used for?

• What is a Stream Data platform?

– Its a central hub where all stream data for the organization or a line
of business is available for consumption for various business needs
– The stream data will be fed into the platform by business applications
like
• Sales Channels
• Inventory Management Systems
• CRM systems
• Business Transaction Systems
– Various application will be fed by the stream data from the platform
like
• Data Warehouse
• Recently viewed products in ecommerce
• Monitoring Systems
• Analytics applications

5
Other possible uses of a Stream Data Platform

• Replicate data from On-Premise to Cloud applications

– There are many on premise enterprise software applications
which are gradually moving to cloud and customers are in a
transition phase where they have some functionality still
available on-premise and some migrated to the cloud.

• Reduce the complexities in Datawarehousing Systems

– Move from “Batch Oriented” to “Real Time” ETL

• Keep The production and Pre-Production/Test

environments in synch
– If Pre-Prod or Test environments are synched in near real type
it will help replaying/reproducing production issues..

6
Typical Stream Data Platform in a Social
Networking Site

Profile Media Sharing Instant Activity System

Manager Manager Messenger Manager Monitoring

Stream
Data
Platform

Social Recently viewed Targeted Fraud Smart Infra

Graph products Ads Detection Management

7
Kafka Real Time Integration

8
Characteristics of a Stream Data Platform

• Reliability: Critical data streams should be delivered without any

loss
• High Throughput: Should be able to handle large volumes of log
or event data streams
• Low Latency: Should be able to provide data with low latency to
real-time applications
• Persistence: Should persist data for long durations to be
consumed by batch systems such as Hadoop which may only
perform their loads and processing periodically
• Scalability: Should be able to scale horizontally as the stream
data sources and sinks increase in number and size
• Integration Points: Should have easy integration points for
stream processing applications like Spark, Flink, Storm, Samza

9
Apache Kafka

• Apache Kafka is a fast, scalable, durable and fault

tolerant publish-subscribe messaging system.
• It is used in place of traditional message brokers like
Rabbit MQ, IBM-MQ because of its higher throughput,
reliability and replication capabilities.
• It was originally developed in LinkedIn
• It was subsequently open sourced in early 2011

10
Components of Kafka Cluster

• Broker
– It is the actual Kafka process
– A Kafka cluster can have one or more brokers

• Topic
– They are feed names to which messages are published by the
producers
– It can have partitions to increase degree of parallelism

• Producer
– An application which publishes data to a particular topic and in a
particular partition within a topic if it exists.
• Consumer
– An application which subscribe to a given topic and consume the feed of
published messages.
• Zookeeper
– The coordination interface between the Kafka broker and consumers.

11
Kafka Cluster Topologies

• A Kafka cluster can be configured in the following

topologies
– Single Node Single Broker Cluster
– Single Node Multiple Brokers Cluster
– Multiple Nodes Multiple Brokers Cluster

• Here broker is the Kafka server process

12
Single Node Single Broker Cluster

Zookeeper

Kafka Broker Node

Producers Consumers

Producers Kafka Broker Consumers

Producers Consumers

13
Single Node Multiple Broker Cluster

Zookeeper

Kafka Broker Node

Producers Kafka Broker 1 Consumers

Producers Kafka Broker 2 Consumers

Producers Kafka Broker 3 Consumers

14
Multiple Node Multiple Broker Cluster

Zookeeper

Kafka Node 1

Kafka
Broker 1
Producers Kafka
Broker 2 Consumers
Producers
Kafka Node 2
Consumers
Kafka
Producers
Broker 1
Consumers
Kafka
Broker 2

15
Topic Partitions

• For each topic, the Kafka cluster maintains a partitioned log

• Each partition is an ordered, immutable sequence of messages that is
continually appended to a commit log.
• The messages in the partitions are each assigned a sequential id number
called the offset that uniquely identifies each message within the partition.
• The Kafka cluster retains all published messages—whether or not they have
been consumed—for a configurable period of time.

Image Source: kafka.apache.org

16
Partition and Consumer Groups

Each message is consumed by only one consumer within each consumer group.
If a message has to be consumed by multiple consumers, then the consumers
have to be part of different consumer groups.
17
Consumer Groups & Offset Management

• If multiple consumers which are part of the same consumer

group , are hooked to a same topic then a message is delivered
to only one consumer within the group
• If a message has to be consumed by multiple consumers then
the consumers have to be part of different consumer groups
• Since the same message can be consumed by multiple
consumers in a consumer group, it’s the consumer responsibility
to maintain a log of what has been consumed so far
• The consumer maintains an offset which is a position in the log
till which messages have been consumed
– In earlier versions of Kafka, the offsets were maintained with
Zookeeper
– From V0.8.1 onwards the offsets are maintained in a separate topic
created by Kafka

18
Partition Replication

• Kafka uses partition replication to ensure resilience.

Kafka Broker Node

Kafka Broker 1

Partition 1
(Lead)

Partition 2
(ISR)

Topic

Kafka Broker 2

Partition 1
(ISR)

Partition 2
(Lead)

19
Review Questions

1. Describe some common application areas where Apache Kafka is used. Provide
examples of industries or use cases where Kafka has been successfully applied.
2. Describe the high-level multi-node, multi-broker architecture of Apache Kafka. Include the
main components and their roles in the Kafka ecosystem.
3. How does Kafka ensure fault tolerance and high throughput? Discuss the concepts of
partitions, replication, and leader/follower replicas.
4. Explain the differences between Kafka's at most once, at least once, and exactly once
message delivery semantics. When would you use each of these guarantees?

20
Describe some common application areas where Apache Kafka is used.
Provide examples of industries or use cases where Kafka has been
successfully applied.
1. Real-time Data Processing and Analytics
• Industry: E-commerce, Telecommunications
• Use Cases:
• Real-time recommendation systems, recently viewed products for e-commerce platforms.
• Network monitoring and anomaly detection in telecommunications.

2. Fraud Detection and Real-time Monitoring

• Industry: Banking, Insurance, E-commerce
• Use Cases:
• Real-time fraud detection in banking transactions and credit card payments.
• Monitoring and alerting for security breaches and cyberattacks.

3. IT Infrastructure Monitoring
• Industry: Technology
• Use Case: Real-time server and network monitoring of cloud or on-premise devices to ensure optimal performance, detect
anomalies, and enhance infrastructure security.

4. IoT Data Ingestion and Processing

• Industry: Smart Cities, Agriculture, Energy
• Use Cases:
• Smart city initiatives for collecting and analyzing data from sensors for traffic management, waste management, and
environmental monitoring.

5. Complex Event Processing (CEP):

• Industry: Healthcare
• Use Cases:
• Real-time patient monitoring and alerting in healthcare systems.

21
Describe the high-level multi-node, multi-broker architecture of Apache Kafka.
Include the main components and their roles in the Kafka ecosystem.

1. Producers
• Applications responsible for publishing data records (messages) to Kafka topics.
• Create messages and send them to Kafka brokers.

2. Brokers
• Core components of a Kafka cluster.
• Responsible for storing and serving data records.
• Manage partitions, data replication, and message persistence.
• Receive messages from producers, store them on disk, and serve them to consumers.

3. Topics
• Logical channels or categories representing streams of data in Kafka.
• Primary mechanism for organizing and categorizing data records.
• Divided into one or more partitions for scalability and parallel processing.
• Messages are published to specific topics by producers and consumed by consumers.

4. Partitions
• Individual segments of a topic's data stream.
• Enable parallel processing and scalability by distributing data across multiple brokers.
• Ordered and immutable sequences of messages.
• Kafka guarantees strict message ordering within each partition based on message offset.
22
Contd…Describe the high-level multi-node, multi-broker architecture of Apache Kafka.
Include the main components and their roles in the Kafka ecosystem.

4. Consumers
• Applications responsible for reading data records from Kafka topics.
• Subscribe to one or more topics and consume messages produced by producers.
• Each message is consumed by only one consumer within each consumer group.
• Consumers within a consumer group help parallel processing and load balancing.
• Several consumer groups help applying different processing logic on same data and provide
fault tolerance.

5. ZooKeeper
• Centralized coordination service used by Kafka for cluster management.
• Manages metadata about Kafka brokers, topics, partitions, and consumer groups.
• Handles tasks such as leader election and, configuration management.

23
Contd…Describe the high-level multi-node, multi-broker architecture of Apache Kafka.
Include the main components and their roles in the Kafka ecosystem.

Zookeeper

Kafka Node 1

Kafka
Broker 1
Producers Kafka
Broker 2 Consumers
Producers
Kafka Node 2
Consumers
Kafka
Producers
Broker 1
Consumers
Kafka
Broker 2

24
How does Kafka ensure fault tolerance and high throughput? Discuss the
concepts of partitions, replication, and leader/follower replicas.

Kafka ensures fault tolerance and high throughput through the following mechanisms: partitions,
replication, and leader/follower replicas. Here's how each concept contributes to Kafka's reliability:
1. Partitions:
• Kafka divides each topic into multiple partitions.
• Partitions enable horizontal scalability and parallel processing by allowing messages to be
distributed across multiple brokers.
• By distributing data across partitions, Kafka can handle large volumes of data and spread the
workload across multiple servers.
• Partitions enable Kafka to achieve higher throughput by allowing consumers to process
messages in parallel across different partitions.
2. Replication:
• Kafka replicates partitions across multiple brokers for fault tolerance and data durability.
• Each partition has one leader replica and zero or more follower replicas.
• The leader replica is responsible for handling read and write requests for the partition, while
follower replicas replicate data from the leader.
• Replication ensures that data is not lost in case of broker failures. If a broker hosting a partition
fails, one of the follower replicas can be promoted to become the new leader, ensuring
continuous availability of data.
• Replication also provides durability guarantees by ensuring that data is replicated across multiple
brokers. Even if one or more brokers fail, data remains available and can be accessed from the
remaining replicas.

25
Contd…How does Kafka ensure fault tolerance and high availability? Discuss the concepts
of partitions, replication, and leader/follower replicas.

Kafka Broker Node

Kafka Broker 1

Partition 1
(Lead)

Partition 2
(ISR)

Kafka Broker 2

Partition 1
(ISR)

Partition 2
(Lead)

26
Explain the differences between Kafka's at most once, at least once, and
exactly once message delivery semantics. When would you use each of
these guarantees?
1. At Most Once:
• In at most once semantics, Kafka ensures that messages are delivered to consumers at most
once. This means that there may be cases where messages are lost or not processed by
consumers.
• In this mode, Kafka acknowledges message receipt as soon as it is written to the commit log,
without waiting for confirmation that it has been successfully processed by consumers.
• At most once semantics prioritize low latency and minimize the risk of message duplication.
• Use cases: This semantics is suitable for scenarios where message loss is acceptable or can be
handled gracefully. For example, in scenarios where occasional message loss is tolerable, such
as real-time monitoring or telemetry applications.
2. At Least Once:
• In at least once semantics, Kafka ensures that messages are delivered to consumers at least
once. This means that messages may be duplicated, but they are not lost.
• Kafka acknowledges message receipt only after it has been successfully processed and
committed by consumers. If a consumer fails to commit a message successfully after
processing, it will be redelivered to ensure it is eventually processed again.
• At least once semantics prioritize data integrity and ensure that all messages are consumed,
even if there is a risk of duplication.
• Use cases: This semantics is suitable for scenarios where message duplication can be handled
gracefully or where data consistency is critical. For example, in financial transactions or data
analytics applications where it's important to process every message, even if some duplicates
are encountered.

27
Contd…Explain the differences between Kafka's at most once, at least once, and exactly
once message delivery semantics. When would you use each of these guarantees?

1. Exactly Once:
• In exactly once semantics, Kafka ensures that each message is delivered to consumers exactly
once. This means that messages are neither lost nor duplicated.
• Achieving exactly once semantics requires coordination between producers and consumers, as
well as idempotent processing and transactional guarantees.
• Kafka provides transactional capabilities that allow producers to write messages and consumers
to consume messages within atomic transactions, ensuring that messages are either fully
processed and committed or not processed at all.
• Exactly once semantics prioritize both data integrity and consistency, guaranteeing that
each message is processed exactly once. This mode has the highest latency.
• Use cases: This semantics is suitable for scenarios where data consistency and integrity are
paramount, such as financial transactions, database replication, or complex data processing
pipelines where duplications or omissions are not acceptable.

SURT Service Manual - RMA - Rev3b
91% (11)
SURT Service Manual - RMA - Rev3b
43 pages
Kafka Using Spring Boot
No ratings yet
Kafka Using Spring Boot
136 pages
Kafka
No ratings yet
Kafka
12 pages
Gift Card Carding Method 2021 Guide
100% (7)
Gift Card Carding Method 2021 Guide
4 pages
Apache Kafka
No ratings yet
Apache Kafka
8 pages
Kafka & Spring Boot for Developers
No ratings yet
Kafka & Spring Boot for Developers
150 pages
12lecture - Technology and Tools (Ù SqoobFlume)
No ratings yet
12lecture - Technology and Tools (Ù SqoobFlume)
48 pages
Unit - IV Event Processing With Apache Kafka
No ratings yet
Unit - IV Event Processing With Apache Kafka
91 pages
Empowerment Technology: Quarter 1 - Module 1
100% (3)
Empowerment Technology: Quarter 1 - Module 1
20 pages
Data Engineering Interview
No ratings yet
Data Engineering Interview
52 pages
Kafka for Developers and Engineers
No ratings yet
Kafka for Developers and Engineers
7 pages
5a - Streaming Data Analytics PDF
No ratings yet
5a - Streaming Data Analytics PDF
37 pages
Introduction To Data Ingestion and Processing
No ratings yet
Introduction To Data Ingestion and Processing
28 pages
Kafka Overview
No ratings yet
Kafka Overview
36 pages
Digital Voting Machine
No ratings yet
Digital Voting Machine
8 pages
Why Is Kafka So Fast
No ratings yet
Why Is Kafka So Fast
10 pages
Ultimate Guide To Apache Kafka 1724117506
No ratings yet
Ultimate Guide To Apache Kafka 1724117506
18 pages
ERLPhase USB Driver Instructions
No ratings yet
ERLPhase USB Driver Instructions
9 pages
5 Kafka 2.7m
No ratings yet
5 Kafka 2.7m
46 pages
Apache Kafka for Tech Students
No ratings yet
Apache Kafka for Tech Students
21 pages
Real-Time Streaming in Big Data: Kafka and Spark With Singlestore
100% (1)
Real-Time Streaming in Big Data: Kafka and Spark With Singlestore
23 pages
Kafka
No ratings yet
Kafka
3 pages
Kafka
No ratings yet
Kafka
140 pages
Kafka Notes 20250814
No ratings yet
Kafka Notes 20250814
6 pages
Building Data Streaming Applications With Apache Kafka
No ratings yet
Building Data Streaming Applications With Apache Kafka
4 pages
Kafka
No ratings yet
Kafka
21 pages
BDA Unit V
No ratings yet
BDA Unit V
21 pages
Node.js Basics for Developers
No ratings yet
Node.js Basics for Developers
36 pages
QP Xii Ip Hy 2024-25
No ratings yet
QP Xii Ip Hy 2024-25
9 pages
Apache Kafka
No ratings yet
Apache Kafka
6 pages
Unit 3
No ratings yet
Unit 3
26 pages
Kafkha
No ratings yet
Kafkha
32 pages
Streaming Data and Stream Processing With Apache Kafka ™: David Tucker, Director of Partner Engineering
No ratings yet
Streaming Data and Stream Processing With Apache Kafka ™: David Tucker, Director of Partner Engineering
44 pages
Kafka Clustering v1.0.0
No ratings yet
Kafka Clustering v1.0.0
20 pages
Kafka Architecture
No ratings yet
Kafka Architecture
5 pages
HD Mod011 Kafka
No ratings yet
HD Mod011 Kafka
29 pages
Kafka
No ratings yet
Kafka
15 pages
SITA1603 Unit 3 Material
No ratings yet
SITA1603 Unit 3 Material
45 pages
08 Apache Kafka
No ratings yet
08 Apache Kafka
45 pages
Kafka
No ratings yet
Kafka
43 pages
Lecture 7 - Data Acquisition
No ratings yet
Lecture 7 - Data Acquisition
45 pages
Assignment No. 3 For Business Data Analytics
No ratings yet
Assignment No. 3 For Business Data Analytics
16 pages
Unit 5 Apache Kafka Notes
No ratings yet
Unit 5 Apache Kafka Notes
54 pages
Apache Kafka 101
No ratings yet
Apache Kafka 101
26 pages
Apache Kafka
No ratings yet
Apache Kafka
17 pages
Cracking Codes With Python Al Sweigart Download
100% (1)
Cracking Codes With Python Al Sweigart Download
47 pages
Big Data Concepts - Spark & Streaming
No ratings yet
Big Data Concepts - Spark & Streaming
35 pages
Asotić, Din - Arrays in C++ - The Thrid Step in Mastering C++ Programming (2023, Independent) - Libgen - Li
No ratings yet
Asotić, Din - Arrays in C++ - The Thrid Step in Mastering C++ Programming (2023, Independent) - Libgen - Li
322 pages
Big Data - Group 14
No ratings yet
Big Data - Group 14
26 pages
Kafka
No ratings yet
Kafka
1 page
Introduction To Apache Kafka
No ratings yet
Introduction To Apache Kafka
18 pages
Kafka & Confluent: A Technical Guide
No ratings yet
Kafka & Confluent: A Technical Guide
72 pages
CyberArk Vault Setup Guide
No ratings yet
CyberArk Vault Setup Guide
66 pages
Kafka
No ratings yet
Kafka
5 pages
Understanding Apache Kafka White Paper
No ratings yet
Understanding Apache Kafka White Paper
7 pages
Construction Methods Course Guide
No ratings yet
Construction Methods Course Guide
5 pages
Kafka
No ratings yet
Kafka
23 pages
Instaclustr Understanding Apache Kafka White Paper
No ratings yet
Instaclustr Understanding Apache Kafka White Paper
8 pages
Configuring Kafka For High Throughput
No ratings yet
Configuring Kafka For High Throughput
11 pages
Introduction To Docker: Ajeet Singh Raina Docker Captain - Docker, Inc
No ratings yet
Introduction To Docker: Ajeet Singh Raina Docker Captain - Docker, Inc
56 pages
Design Patterns For Working With Fast Data: © 2016 Mapr Technologies © 2016 Mapr Technologies
No ratings yet
Design Patterns For Working With Fast Data: © 2016 Mapr Technologies © 2016 Mapr Technologies
64 pages
Big Data 3rd Assignment Answers
No ratings yet
Big Data 3rd Assignment Answers
8 pages
Simatic Net PG/PC - Industrial Ethernet CP 1623
No ratings yet
Simatic Net PG/PC - Industrial Ethernet CP 1623
22 pages
Kafka Reference Architecture
No ratings yet
Kafka Reference Architecture
12 pages
Sequential Circuits
No ratings yet
Sequential Circuits
19 pages
KAFKA
No ratings yet
KAFKA
11 pages
Fundamentals and Architecture of Apache Kafka
No ratings yet
Fundamentals and Architecture of Apache Kafka
30 pages
Apache Kafka
No ratings yet
Apache Kafka
9 pages
Week 10 - Getting Started in HTML5 Program
No ratings yet
Week 10 - Getting Started in HTML5 Program
24 pages
AOS7 Troubleshooting
No ratings yet
AOS7 Troubleshooting
179 pages
Peer Tutoor Platform
No ratings yet
Peer Tutoor Platform
9 pages
Cyber Security Unit 3 Unit 3
No ratings yet
Cyber Security Unit 3 Unit 3
28 pages
Venu Babu Ravur
No ratings yet
Venu Babu Ravur
1 page
Real Log Book
No ratings yet
Real Log Book
24 pages
Kafka for Big Data Professionals
No ratings yet
Kafka for Big Data Professionals
14 pages
Buffer Cache Algorithms: Session No:5 Operating System Design @KL University, 2020
No ratings yet
Buffer Cache Algorithms: Session No:5 Operating System Design @KL University, 2020
21 pages
PAK - STUDIES - PRESENTATION Report
No ratings yet
PAK - STUDIES - PRESENTATION Report
21 pages
Learning Apache Kafka - Second Edition - Sample Chapter
No ratings yet
Learning Apache Kafka - Second Edition - Sample Chapter
12 pages
Matroska File Format Guide
No ratings yet
Matroska File Format Guide
51 pages
05 Programmer's Reference, With Instructions On How To Execute The Program
No ratings yet
05 Programmer's Reference, With Instructions On How To Execute The Program
43 pages
Kafka and Mongodb
No ratings yet
Kafka and Mongodb
15 pages
Presentation Template Guide
No ratings yet
Presentation Template Guide
19 pages
RUA Form 2022 - New
No ratings yet
RUA Form 2022 - New
5 pages
Application Model For Travel Recommendations Based On Android
No ratings yet
Application Model For Travel Recommendations Based On Android
8 pages
(Chapter 2) Desktop, Icons, and Settings
No ratings yet
(Chapter 2) Desktop, Icons, and Settings
5 pages
Odoo 14 IAP Services Guide
No ratings yet
Odoo 14 IAP Services Guide
3 pages
Procedural Lab Use The Teamcenter Environment Manager To Deploy The Template Project
No ratings yet
Procedural Lab Use The Teamcenter Environment Manager To Deploy The Template Project
2 pages

Kafka

Uploaded by

Kafka

Uploaded by

Stream Data Platform with Apache Kafka

• Stream Data Platforms

• What is Stream Data?

• What is a Stream Data platform?

• Replicate data from On-Premise to Cloud applications

• Reduce the complexities in Datawarehousing Systems

• Keep The production and Pre-Production/Test

Profile Media Sharing Instant Activity System

Social Recently viewed Targeted Fraud Smart Infra

• Reliability: Critical data streams should be delivered without any

• Apache Kafka is a fast, scalable, durable and fault

• A Kafka cluster can be configured in the following

• Here broker is the Kafka server process

Kafka Broker Node

Producers Kafka Broker Consumers

Kafka Broker Node

Producers Kafka Broker 1 Consumers

Producers Kafka Broker 2 Consumers

Producers Kafka Broker 3 Consumers

• For each topic, the Kafka cluster maintains a partitioned log

Image Source: kafka.apache.org

• If multiple consumers which are part of the same consumer

• Kafka uses partition replication to ensure resilience.

Kafka Broker Node

2. Fraud Detection and Real-time Monitoring

4. IoT Data Ingestion and Processing

5. Complex Event Processing (CEP):

Kafka Broker Node

You might also like