Himani Arora & Prabhat Kashyap
Software Consultant
@_himaniarora @pk_official
Who we are?
Himani Arora Prabhat Kashyap
@_himaniarora @pk_official
Software Consultant @ Knoldus Software Consultant @ Knoldus Software
Software LLP LLP
Contributed in Apache Kafka, Juypter, Contributed in Apache Kafka and
Apache CarbonData, Lightbend Apache CarbonData and Lightbend
Lagom etc Templates
Currently learning Apache Kafka Currently learning Apache Kafka
Agenda
What is Stream processing
Paradigms of programming
Stream Processing with Kafka
What are Kafka Streams
Inside Kafka Streams
Demonstration of stream processing using Kafka
Streams
Overview of Kafka Connect
Demo with Kafka Connect
What is stream processing?
Real-time processing of data
Does not treat data as static tables or files
Data has to be processed fast, so that a firm can
react to changing business conditions in real time.
This is required for trading, fraud detection, system
monitoring, and many other examples.
A too late architecture cannot realize these use
cases.
BIG DATA VERSUS FAST DATA
3 PARADIGMS OF PROGRAMMING
REQUEST/RESPONSE
BATCH SYSTEMS
STREAM PROCESSING
REQUEST/RESPONSE
batch systems
STREAM PROCESSING
STREAM PROCESSING with kafka
2 APPROACHES:
DO IT YOURSELF (DIY ! ) STREAM PROCESSING
STREAM PROCESSING FRAMEWORK
DIY STREAM PROCESSING
Major Challenges:
FAULT TOLERANCE
PARTITIONING AND SCALABILITY
TIME
STATE
REPROCESSING
STREAM PROCESSING FRAMEWORK
Many already available stream processing framework
are:
SPARK
STORM
SAMZA
FLINK ETC...
KAFKA STREAMS : ANOTHER WAY OF STREAM PROCESSING
Lets starts with Kafka Stream but wait.. What is KAFKA?
Hello Apache Kafka
Apache Kafka is an Open Source project under Apache
Licence 2.0
Apache Kafka was originally developed by LinkedIn.
On 23 October 2012 Apache Kafka graduated from
incubator to top level projects.
Components of Apache Kafka
Producer
Consumer
Broker
Topic
Data
Parallelism
Enterprises that use Kafka
What is Kafka Streams
It is Streams API of Apache Kafka, available through a Java
library.
Kafka Streams is built on top of functionality provided by
Kafkas.
It is , by deliberate design, tightly integrated with Apache
Kafka.
It can be used to build highly scalable, elastic, fault-
tolerant, distributed applications and microservices.
Kafka Streams API allows you to create real-time
applications.
It is the easiest yet the most powerful technology to
If we look closer
A key motivation of the Kafka Streams API is to bring stream
processing out of the Big Data niche into the world of
mainstream application development.
Using the Kafka Streams API you can implement standard Java
applications to solve your stream processing needs.
Your applications are fully elastic: you can run one or more
instances of your application.
This lightweight and integrative approach of the Kafka Streams
API Build applications, not infrastructure! .
Deployment-wise you are free to chose from any technology
that can deploy Java applications
Capabilities of Kafka Stream
Powerful
Makes your applications highly scalable, elastic,
distributed, fault-tolerant.
Stateful and stateless processing
Event-time processing with windowing, joins,
aggregations
Lightweight
Low barrier to entry
No processing cluster required
No external dependencies other than Apache Kafka
Capabilities of Kafka Stream
Real-time
Millisecond processing latency
Record-at-a-time processing (no micro-batching)
Seamlessly handles late-arriving and out-of-order data
High throughput
Fully integrated
100% compatible with Apache Kafka 0.10.2 and 0.10.1
Easy to integrate into existing applications and microservices
Runs everywhere: on-premises, public clouds, private clouds,
containers, etc.
Integrates with databases through continous change data
capture (CDC) performed by Kafka Connect
Key concepts of Kafka Streams
Stateful Stream Processing
KStream
KTable
Time
Aggregations
Joins
Windowing
Key concepts of Kafka Streams
Stateful Stream Processing
Some stream processing applications dont require
state they are stateless.
In practice, however, most applications require state
they are stateful.
The state must be managed in a fault-tolerant
manner.
Application is stateful whenever, for example, it needs
to join, aggregate, or window its input data.
Key concepts of Kafka Streams
Kstream
A KStream is an abstraction of a record stream.
Each data record represents a self-contained datum in
the unbounded data set.
Using the table analogy, data records in a record
stream are always interpreted as an INSERT .
Lets imagine the following two data records are being
sent to the stream:
("alice", 1) --> ("alice", 3)
Key concepts of Kafka Streams
Ktable
A KStream is an abstraction of a changelog stream.
Each data record represents an update.
Using the table analogy, data records in a record
stream are always interpreted as an UPDATE .
Lets imagine the following two data records are being
sent to the stream:
("alice", 1) --> ("alice", 3)
Key concepts of Kafka Streams
Time
A critical aspect in stream processing is the the notion
of time.
Kafka Streams supports the following notions of time:
Event Time
Processing Time
Ingestion Time
Kafka Streams assigns a timestamp to every data
record via so-called timestamp extractors.
Key concepts of Kafka Streams
Aggregations
An aggregation operation takes one input stream or
table, and yields a new table.
It is done by combining multiple input records into a
single output record.
In the Kafka Streams DSL, an input stream of an
aggregation operation can be a KStream or a KTable,
but the output stream will always be a KTable.
Key concepts of Kafka Streams
Joins
A join operation merges two input streams and/or
tables based on the keys of their data records, and
yields a new stream/table.
Key concepts of Kafka Streams
Windowing
Windowing lets you control how to group records that
have the same key for stateful operations such as
aggregations or joins into so-called windows.
Windows are tracked per record key.
When working with windows, you can specify a
retention period for the window.
This retention period controls how long Kafka Streams
will wait for out-of-order or late-arriving data records
for a given window.
If a record arrives after the retention period of a
window has passed, the record is discarded and will not
be processed in that window.
Inside Kafka Stream
Processor Topology
Stream Partitions and Tasks
Each stream partition is a totally ordered sequence of data
records and maps to a Kafka topic partition.
A data record in the stream maps to a Kafka message from that
topic.
The keys of data records determine the partitioning of data in
both Kafka and Kafka Streams, i.e., how data is routed to
specific partitions within topics.
Threading Model
Kafka Streams allows the user to configure the number of
threads that the library can use to parallelize processing within
an application instance.
Each thread can execute one or more stream tasks with their
processor topologies independently.
State
Kafka Streams provides so-called state stores.
State can be used by stream processing applications to store
and query data, which is an important capability when
implementing stateful operations.
Backpressure
Kafka Streams does not use a backpressure mechanism
because it does not need one.
It uses depth-first processing strategy.
Each record consumed from Kafka will go through the whole
processor (sub-)topology for processing and for (possibly) being
written back to Kafka before the next record will be processed.
No records are being buffered in-memory between two
connected stream processors.
Kafka Streams leverages Kafkas consumer client behind the
scenes.
DEMO
Kafka Streams
HOW TO GET DATA IN AND OUT OF KAFKA?
KAFKA CONNECT
Kafka connect
So-called Sources import data into Kafka, and Sinks export data
from Kafka.
An implementation of a Source or Sink is a Connector. And users
deploy connectors to enable data flows on Kafka
All Kafka Connect sources and sinks map to partitioned streams
of records.
This is a generalization of Kafkas concept of topic partitions: a
stream refers to the complete set of records that are split into
independent infinite sequences of records
CONFIGURING connectors
Connector configurations are key-value mappings.
For standalone mode these are defined in a properties file
and passed to the Connect process on the command line.
In distributed mode, they will be included in the JSON
payload sent over the REST API for the request that
creates the connector.
Configuring connectors
Few settings common that are common to all connectors:
name - Unique name for the connector. Attempting to register
again with the same name will fail.
connector.class - The Java class for the connector
tasks.max - The maximum number of tasks that should be
created for this connector. The connector may create fewer
tasks if it cannot achieve this level of parallelism.
DEMO
Kafka CONNECT
REFERENCES
https://www.slideshare.net/ConfluentInc/demystifying-stream-
processing-with-apache-kafka-69228952
https://www.confluent.io/blog/introducing-kafka-streams-
stream-processing-made-simple/
http://docs.confluent.io/3.2.0/streams/index.html
http://docs.confluent.io/3.2.0/connect/index.html
Any
Questions?
Thank You