INF4101-Big Data et NoSQL
8. STREAM PROCESSING
Dr Mouhim Sanaa
INTRODUCTION TO STREAM PROCESSING
• Stream processing is a data management technique that involves ingesting a
continuous data stream to quickly analyze, filter, transform or enhance the
data in real time.
INTRODUCTION TO STREAM PROCESSING
Common stream processing use cases include:
• Fraud detection
• Detecting anomalous events
• Tuning business application features
• Managing location data
• Personalizing customer experience
• Stock market trading
• Analyzing and responding to IT infrastructure
events
• Digital experience monitoring
• Customer journey mapping
• Predictive analytics
STREAM PROCESSING: KEY CONCEPTS
• At its core, stream processing embodies an event-driven architecture that processes
data as it arrives.
• Event
an event refers to an individual data unit generated by a real-time source and
transmitted in a continuous data stream.
Concrete examples of events:
•A user clicks on a button. {
•A sensor sends a temperature reading. "event_type": "order_created",
•A message is received on a server.
•A transaction is made on an e-commerce site.
"order_id": "ORD123",
•A file is modified or downloaded. "customer_id": "CUST567",
Each event often contains: "timestamp": "2025-04-
•A value (e.g., temperature = 25°C), 16T12:05:00Z"
•A timestamp, }
•And sometimes a context (e.g., sensor #12, user ID 9876…).
STREAM PROCESSING: KEY CONCEPTS
• At its core, stream processing embodies an event-driven architecture that processes
data as it arrives.
• Streaming data, also known as real-time data, event data, stream data
processing, or data-in-motion, refers to a continuous flow of information
generated by various sources, such as sensors, applications, social media, or
other digital platforms.
•Sensor data: Information from IoT devices, like temperature readings, GPS coordinates, or
health metrics.
•Social media updates: Posts, tweets, comments, and likes from platforms like Twitter,
Facebook, and Instagram.
•Financial transactions: Real-time stock prices, cryptocurrency exchanges, and credit card
transactions.
•Application logs: Records of user interactions, system events, and error messages.
•Web clickstreams: Data related to user interactions on websites, such as page views, clicks,
and session durations.
STREAM PROCESSING: KEY CONCEPTS
• Windowing in Stream Processing
Windowing is a technique used in stream processing to divide the infinite stream of
data into finite chunks or “windows”. These windows can be based on time
intervals, number of events, or custom triggers. It allows the system to perform
computations like counting, averaging, or joining on manageable subsets of the
stream.
Example:
Imagine a temperature sensor that sends a reading every second. If you want
to calculate the average temperature every 5 minutes, you would use a
tumbling window of 5 minutes.
STREAM PROCESSING: KEY CONCEPTS
• Windowing in Stream Processing
There are several types of windows:
Window Type Description
Fixed, non-overlapping windows (e.g., every 5
Tumbling window
minutes).
Overlapping windows with a defined step (e.g., every
Sliding window
5 minutes with a 1-minute step).
Dynamic windows based on periods of inactivity
Session window
between events.
Windows that group data by the number of events
Count-based window
instead of time.
GENERAL ARCHITECTURE OF A STREAM PROCESSING
SYSTEM
At its core, stream processing embodies an event-driven architecture that processes data
as it arrives.
Event-driven architecture
• It operates on the principle of responding to events as they occur, such as data
arriving from sensors, user interactions, or other sources.
• A stream processing pipeline refers
to a series of stages or components that
process data in real-time as it flows
through the system, typically from a
continuous stream of events or data
sources. It allows for the extraction,
transformation, and analysis of data as it
arrives, without having to store it first in
a database or other persistent storage. Streaming data pipeline
GENERAL ARCHITECTURE OF A STREAM PROCESSING
SYSTEM
Event-driven architecture
Input sources
Data is read from various input sources like social media platforms and IoT sensors or
data ingestion platforms like Redpanda. These sources generate the events that serve
as a trigger for the system and are ingested into the event-driven system.
Components of an event-driven architecture
GENERAL ARCHITECTURE OF A STREAM PROCESSING
SYSTEM
Event-driven architecture
Stream processor
The stream processor is the heart of the system. It efficiently handles continuous data
flow, reacting to events in real-time.
• Processing logic: This logic defines event-driven rules, enabling dynamic responses to incoming data.
• State management: Stream processor is also responsible for state management that allows the
architecture to keep track of its progress and maintain state information.
Components of an event-driven architecture
GENERAL ARCHITECTURE OF A STREAM PROCESSING
SYSTEM
Event-driven architecture
Outputs
Once the data is processed, it is typically written to output streams or storage systems
like HDFS, Cassandra, HBase, or other databases. These output streams can be
consumed by downstream applications or used for analysis and reporting.
Components of an event-driven architecture
GENERAL ARCHITECTURE OF A STREAM PROCESSING
SYSTEM
Event-driven architecture
Outputs
Once the data is processed, it is typically written to output streams or storage systems
like HDFS, Cassandra, HBase, or other databases. These output streams can be
consumed by downstream applications or used for analysis and reporting.
Components of an event-driven architecture
IT INFRASTRUCTURE FOR STREAMING DATA
Managing streaming data requires a robust IT infrastructure capable of handling data
ingestion, storage, processing, and scalability. Here are the key components of this
infrastructure:
• Data Ingestion: Data ingestion is the process of collecting streaming data from various
sources and making it available for processing. Popular tools for data ingestion include
Apache Kafka, Amazon Kinesis, and RabbitMQ.
• Data Storage: Most traditional relational databases are typically ill-suited for streaming data
due to their static nature. Instead, organizations have turned to specialized data stores that
can handle high volumes of real-time data such as NoSQL databases.
• Data processing: Apache Flink, Apache Storm, Apache Spark and Hazelcast Platform are
examples of popular data processing frameworks used to analyze streaming data in real-
time or near real-time.
• Scalability: The volume of streaming data can fluctuate significantly, and organizations need
to be able to dynamically scale their systems to handle the load in a cost-effective and
efficient way. Cloud-based solutions, such as Amazon Web Services (AWS), Google Cloud
Platform, and Microsoft Azure, offer elasticity and scalability features that allow organizations
IT INFRASTRUCTURE FOR STREAMING DATA
Managing streaming data requires a robust IT infrastructure capable of handling data
ingestion, storage, processing, and scalability. Here are the key components of this
infrastructure:
• Data Ingestion: Data ingestion is the process of collecting streaming data from various
sources and making it available for processing. Popular tools for data ingestion include
Apache Kafka, Amazon Kinesis, and RabbitMQ.
• Data Storage: Most traditional relational databases are typically ill-suited for streaming data
due to their static nature. Instead, organizations have turned to specialized data stores that
can handle high volumes of real-time data such as NoSQL databases.
• Data processing: Apache Flink, Apache Storm, Apache Spark and Hazelcast Platform are
examples of popular data processing frameworks used to analyze streaming data in real-
time or near real-time.
• Scalability: The volume of streaming data can fluctuate significantly, and organizations need
to be able to dynamically scale their systems to handle the load in a cost-effective and
efficient way. Cloud-based solutions, such as Amazon Web Services (AWS), Google Cloud
Platform, and Microsoft Azure, offer elasticity and scalability features that allow organizations
KAFKA STREAMS
• The term "streaming" describes continuous, never-ending data streams
with no beginning or end, providing a constant feed of data that can be
utilized/acted upon without needing to be downloaded first.
• Kafka Streams is a client library for building applications and microservices,
where the input and output data are stored in Kafka clusters.
• You don’t need to ask for new records, you just receive them
• Records are key-value pairs
KAFKA STREAMS
• The KAFKA streams API supports per-record stream processing with millisecond
latency.
• Write standard java applications and micro services to process data in real time.
• No separate processing cluster required
KAFKA STREAMS
• Call the Kafka stream API from your java or Scala applications
• The kafka stream application interacts with a Kafka cluster.
• No separate processing cluster required
• The application does not run directly on Kafka Brokers
CONFIGURING THE KAFKA STREAMS APPLICATION
1. Configure your Kafka streams application with a StreamsConfig instance
• Specify Configuration parameters:
APPLICATION_ID_CONFIG
BOOTSTRAP_SERVER_CONFIG
………
CONFIGURING THE KAFKA STREAMS APPLICATION
2. Define a Topology
A topology refers to the logical structure or data
processing pipeline that defines how the stream
processing application processes, transforms, and
routes the data from input to output. It's
essentially a graph of stream processing components,
where each node represents a transformation,
aggregation, or operation on the data, and the edges
represent the flow Source
of data between the operations.
Topology Transformations
StreamBuilder
Sink
CONFIGURING THE KAFKA STREAMS APPLICATION
2. Define a Topology
StreamsBuilder builder = new StreamsBuilder();
// Consommer des données depuis un topic "input-topic"
KStream<String, String> sourceStream = builder.stream("input-topic");
// Appliquer une transformation : convertir les messages en majuscules
KStream<String, String> transformedStream = sourceStream.mapValues(value
-> value.toUpperCase());
// Écrire les résultats dans un autre topic "output-topic"
transformedStream.to("output-topic");
CONFIGURING THE KAFKA STREAMS APPLICATION
2. Define a Topology
StreamsBuilder builder = new StreamsBuilder();
// Consommer des données depuis un topic "input-topic"
KStream<String, String> sourceStream = builder.stream("input-topic");
// Appliquer une transformation : convertir les messages en majuscules
KStream<String, String> transformedStream = sourceStream.mapValues(value
-> value.toUpperCase());
// Écrire les résultats dans un autre topic "output-topic"
transformedStream.to("output-topic");
CONFIGURING THE KAFKA STREAMS APPLICATION
2. Define a Topology
StreamsBuilder builder = new StreamsBuilder();
// Consommer des données depuis un topic "input-topic"
KStream<String, String> sourceStream = builder.stream("input-topic");
// Appliquer une transformation : convertir les messages en majuscules
KStream<String, String> transformedStream = sourceStream.mapValues(value
-> value.toUpperCase());
// Écrire les résultats dans un autre topic "output-topic"
transformedStream.to("output-topic");
CONFIGURING THE KAFKA STREAMS APPLICATION
3. Define a KafkaStreams
KafkaStreams is the core engine of a Kafka Streams application. It is responsible for:
• Taking the topology you define using StreamsBuilder
• Using a set of configuration properties (like bootstrap servers, application ID,
SerDes, etc.)
• And executing the streaming application on data coming from Kafka topics.
KafkaStreams kafkastreams=new KafkaStreams(builder.build(),props);
kafkastreams.start();
CONFIGURING THE KAFKA STREAMS APPLICATION
3. Define a KafkaStreams
Function Description
KafkaStreams is the core engine of a Kafka Streams application. It is responsible for:
Builds the logical topology that KafkaStreams
builder.build()
• Taking the topology you define using StreamsBuilder
will execute
• Using a set of configuration properties (like bootstrap servers, application ID,
Instantiates the application with the topology
SerDes, etc.)
KafkaStreams(topology, props)
and configuration
• And executing the streaming application on data coming from Kafka topics.
Starts the stream processing in the
start()
background
Gracefully stops the Kafka Streams
close()
application
Cleans up local state stores (used for testing
cleanUp()
or resetting state)
KafkaStreams kafkastreams=new KafkaStreams(builder.build(),props);
kafkastreams.start();
DATA MODELS FOR STREAM PROCESSING
In Kafka Streams, the KStream and KTable objects represent two fundamental data models
for stream processing. They allow you to express different logic depending on how you want
to interpret your data.
KStream – a stream of events
A KStream is a continuous sequence of records, like an event log arriving in real-time.
• Each message is independent.
• Used for stateless transformations (e.g., mapping, filtering, routing).
• Ideal for immutable data (e.g., transactions, clicks, logs).
DATA MODELS FOR STREAM PROCESSING
In Kafka Streams, the KStream and KTable objects represent two fundamental data models
for stream processing. They allow you to express different logic depending on how you want
to interpret your data.
Ktable – a state table
A KTable is a materialized view of the latest state for
each key, which updates over time.
• Represents the current value for each key.
• Each new record updates the previous value
(like a database update).
• Useful for joins and stateful aggregations
(e.g., inventory count, user status).
DATA MODELS FOR STREAM PROCESSING
In Kafka Streams, the KStream and KTable objects represent two fundamental data models
for stream processing. They allow you to express different logic depending on how you want
to interpret your data.
Ktable – a state table
A KTable is a materialized view of the latest state for
each key, which updates over time.
• Represents the current value for each key.
• Each new record updates the previous value
(like a database update).
• Useful for joins and stateful aggregations
(e.g., inventory count, user status).
TRANSFORMATION OPERATIONS
The KStream and KTable interfaces support a variety of transformation operations.
• Stateless transformations
• Stateful transformations
CONFIGURING THE KAFKA STREAMS APPLICATION
The KStream and KTable interfaces support a variety of transformation operations.
• Stateless transformations
flatMap: takes one record and produces zero, one, or more records. You can
modify the record keys and values, including their types.
GroupByKey: Groups the record by key
Map: Takes one record and produces one record. You can modify the record key
and value, including their types.
CONFIGURING THE KAFKA STREAMS APPLICATION
The KStream and KTable interfaces support a variety of transformation operations.
• Statefull transformations
Aggregate: Aggregates the values of (non-windowed) records by the grouped
key.
Count: Counts the number of records by the grouped key.
CONFIGURING THE KAFKA STREAMS APPLICATION
CONFIGURING THE KAFKA STREAMS APPLICATION
textLines.flatMapValues(textline -> Arrays.asList(textline.split("\\
s+"))
CONFIGURING THE KAFKA STREAMS APPLICATION