0% found this document useful (0 votes)

4 views33 pages

Stream Processing

Stream processing is a real-time data management technique that analyzes continuous data streams for applications like fraud detection and predictive analytics. It relies on an event-driven architecture that processes data as it arrives, utilizing concepts such as windowing to manage data flow. Key components of a stream processing system include data ingestion, storage, processing frameworks, and scalability solutions, with Kafka Streams being a notable tool for building streaming applications.

Uploaded by

badrooxe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views33 pages

Stream Processing

Uploaded by

badrooxe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 33

INF4101-Big Data et NoSQL

8. STREAM PROCESSING

Dr Mouhim Sanaa
INTRODUCTION TO STREAM PROCESSING
• Stream processing is a data management technique that involves ingesting a
continuous data stream to quickly analyze, filter, transform or enhance the
data in real time.
INTRODUCTION TO STREAM PROCESSING

Common stream processing use cases include:

• Fraud detection
• Detecting anomalous events
• Tuning business application features
• Managing location data
• Personalizing customer experience
• Stock market trading
• Analyzing and responding to IT infrastructure
events
• Digital experience monitoring
• Customer journey mapping
• Predictive analytics
STREAM PROCESSING: KEY CONCEPTS

• At its core, stream processing embodies an event-driven architecture that processes

data as it arrives.

• Event
an event refers to an individual data unit generated by a real-time source and
transmitted in a continuous data stream.

Concrete examples of events:

•A user clicks on a button. {
•A sensor sends a temperature reading. "event_type": "order_created",
•A message is received on a server.
•A transaction is made on an e-commerce site.
"order_id": "ORD123",
•A file is modified or downloaded. "customer_id": "CUST567",
Each event often contains: "timestamp": "2025-04-
•A value (e.g., temperature = 25°C), 16T12:05:00Z"
•A timestamp, }
•And sometimes a context (e.g., sensor #12, user ID 9876…).
STREAM PROCESSING: KEY CONCEPTS

• At its core, stream processing embodies an event-driven architecture that processes

data as it arrives.

• Streaming data, also known as real-time data, event data, stream data
processing, or data-in-motion, refers to a continuous flow of information
generated by various sources, such as sensors, applications, social media, or
other digital platforms.

•Sensor data: Information from IoT devices, like temperature readings, GPS coordinates, or
health metrics.
•Social media updates: Posts, tweets, comments, and likes from platforms like Twitter,
Facebook, and Instagram.
•Financial transactions: Real-time stock prices, cryptocurrency exchanges, and credit card
transactions.
•Application logs: Records of user interactions, system events, and error messages.
•Web clickstreams: Data related to user interactions on websites, such as page views, clicks,
and session durations.
STREAM PROCESSING: KEY CONCEPTS

• Windowing in Stream Processing

Windowing is a technique used in stream processing to divide the infinite stream of

data into finite chunks or “windows”. These windows can be based on time
intervals, number of events, or custom triggers. It allows the system to perform
computations like counting, averaging, or joining on manageable subsets of the
stream.
Example:
Imagine a temperature sensor that sends a reading every second. If you want
to calculate the average temperature every 5 minutes, you would use a
tumbling window of 5 minutes.
STREAM PROCESSING: KEY CONCEPTS

• Windowing in Stream Processing

There are several types of windows:

Window Type Description

Fixed, non-overlapping windows (e.g., every 5
Tumbling window
minutes).
Overlapping windows with a defined step (e.g., every
Sliding window
5 minutes with a 1-minute step).
Dynamic windows based on periods of inactivity
Session window
between events.
Windows that group data by the number of events
Count-based window
instead of time.
GENERAL ARCHITECTURE OF A STREAM PROCESSING
SYSTEM
At its core, stream processing embodies an event-driven architecture that processes data
as it arrives.
Event-driven architecture

• It operates on the principle of responding to events as they occur, such as data

arriving from sensors, user interactions, or other sources.

• A stream processing pipeline refers

to a series of stages or components that
process data in real-time as it flows
through the system, typically from a
continuous stream of events or data
sources. It allows for the extraction,
transformation, and analysis of data as it
arrives, without having to store it first in
a database or other persistent storage. Streaming data pipeline
GENERAL ARCHITECTURE OF A STREAM PROCESSING
SYSTEM
Event-driven architecture

Input sources

Data is read from various input sources like social media platforms and IoT sensors or
data ingestion platforms like Redpanda. These sources generate the events that serve
as a trigger for the system and are ingested into the event-driven system.

Components of an event-driven architecture

GENERAL ARCHITECTURE OF A STREAM PROCESSING
SYSTEM
Event-driven architecture

Stream processor
The stream processor is the heart of the system. It efficiently handles continuous data
flow, reacting to events in real-time.
• Processing logic: This logic defines event-driven rules, enabling dynamic responses to incoming data.
• State management: Stream processor is also responsible for state management that allows the
architecture to keep track of its progress and maintain state information.

Components of an event-driven architecture

GENERAL ARCHITECTURE OF A STREAM PROCESSING
SYSTEM
Event-driven architecture

Outputs
Once the data is processed, it is typically written to output streams or storage systems
like HDFS, Cassandra, HBase, or other databases. These output streams can be
consumed by downstream applications or used for analysis and reporting.

Components of an event-driven architecture

GENERAL ARCHITECTURE OF A STREAM PROCESSING
SYSTEM
Event-driven architecture

Components of an event-driven architecture

IT INFRASTRUCTURE FOR STREAMING DATA
Managing streaming data requires a robust IT infrastructure capable of handling data
ingestion, storage, processing, and scalability. Here are the key components of this
infrastructure:
• Data Ingestion: Data ingestion is the process of collecting streaming data from various
sources and making it available for processing. Popular tools for data ingestion include
Apache Kafka, Amazon Kinesis, and RabbitMQ.
• Data Storage: Most traditional relational databases are typically ill-suited for streaming data
due to their static nature. Instead, organizations have turned to specialized data stores that
can handle high volumes of real-time data such as NoSQL databases.
• Data processing: Apache Flink, Apache Storm, Apache Spark and Hazelcast Platform are
examples of popular data processing frameworks used to analyze streaming data in real-
time or near real-time.
• Scalability: The volume of streaming data can fluctuate significantly, and organizations need
to be able to dynamically scale their systems to handle the load in a cost-effective and
efficient way. Cloud-based solutions, such as Amazon Web Services (AWS), Google Cloud
Platform, and Microsoft Azure, offer elasticity and scalability features that allow organizations
IT INFRASTRUCTURE FOR STREAMING DATA
Managing streaming data requires a robust IT infrastructure capable of handling data
ingestion, storage, processing, and scalability. Here are the key components of this
infrastructure:
• Data Ingestion: Data ingestion is the process of collecting streaming data from various
sources and making it available for processing. Popular tools for data ingestion include
Apache Kafka, Amazon Kinesis, and RabbitMQ.
• Data Storage: Most traditional relational databases are typically ill-suited for streaming data
due to their static nature. Instead, organizations have turned to specialized data stores that
can handle high volumes of real-time data such as NoSQL databases.
• Data processing: Apache Flink, Apache Storm, Apache Spark and Hazelcast Platform are
examples of popular data processing frameworks used to analyze streaming data in real-
time or near real-time.
• Scalability: The volume of streaming data can fluctuate significantly, and organizations need
to be able to dynamically scale their systems to handle the load in a cost-effective and
efficient way. Cloud-based solutions, such as Amazon Web Services (AWS), Google Cloud
Platform, and Microsoft Azure, offer elasticity and scalability features that allow organizations
KAFKA STREAMS

• The term "streaming" describes continuous, never-ending data streams

with no beginning or end, providing a constant feed of data that can be
utilized/acted upon without needing to be downloaded first.
• Kafka Streams is a client library for building applications and microservices,
where the input and output data are stored in Kafka clusters.
• You don’t need to ask for new records, you just receive them
• Records are key-value pairs
KAFKA STREAMS

• The KAFKA streams API supports per-record stream processing with millisecond
latency.
• Write standard java applications and micro services to process data in real time.

• No separate processing cluster required

KAFKA STREAMS

• Call the Kafka stream API from your java or Scala applications

• The kafka stream application interacts with a Kafka cluster.

• No separate processing cluster required

• The application does not run directly on Kafka Brokers

CONFIGURING THE KAFKA STREAMS APPLICATION

1. Configure your Kafka streams application with a StreamsConfig instance

• Specify Configuration parameters:

 APPLICATION_ID_CONFIG
 BOOTSTRAP_SERVER_CONFIG
 ………
CONFIGURING THE KAFKA STREAMS APPLICATION

2. Define a Topology

A topology refers to the logical structure or data

processing pipeline that defines how the stream
processing application processes, transforms, and
routes the data from input to output. It's
essentially a graph of stream processing components,
where each node represents a transformation,
aggregation, or operation on the data, and the edges
represent the flow Source
of data between the operations.

Topology Transformations

StreamBuilder
Sink
CONFIGURING THE KAFKA STREAMS APPLICATION

2. Define a Topology
StreamsBuilder builder = new StreamsBuilder();

// Consommer des données depuis un topic "input-topic"

KStream<String, String> sourceStream = builder.stream("input-topic");

// Appliquer une transformation : convertir les messages en majuscules

KStream<String, String> transformedStream = sourceStream.mapValues(value
-> value.toUpperCase());

// Écrire les résultats dans un autre topic "output-topic"

transformedStream.to("output-topic");
CONFIGURING THE KAFKA STREAMS APPLICATION

2. Define a Topology
StreamsBuilder builder = new StreamsBuilder();

// Consommer des données depuis un topic "input-topic"

KStream<String, String> sourceStream = builder.stream("input-topic");

// Appliquer une transformation : convertir les messages en majuscules

KStream<String, String> transformedStream = sourceStream.mapValues(value
-> value.toUpperCase());

// Écrire les résultats dans un autre topic "output-topic"

transformedStream.to("output-topic");
CONFIGURING THE KAFKA STREAMS APPLICATION

2. Define a Topology
StreamsBuilder builder = new StreamsBuilder();

// Consommer des données depuis un topic "input-topic"

KStream<String, String> sourceStream = builder.stream("input-topic");

// Appliquer une transformation : convertir les messages en majuscules

KStream<String, String> transformedStream = sourceStream.mapValues(value
-> value.toUpperCase());

// Écrire les résultats dans un autre topic "output-topic"

transformedStream.to("output-topic");
CONFIGURING THE KAFKA STREAMS APPLICATION

3. Define a KafkaStreams
KafkaStreams is the core engine of a Kafka Streams application. It is responsible for:

• Taking the topology you define using StreamsBuilder

• Using a set of configuration properties (like bootstrap servers, application ID,
SerDes, etc.)
• And executing the streaming application on data coming from Kafka topics.

KafkaStreams kafkastreams=new KafkaStreams(builder.build(),props);

kafkastreams.start();
CONFIGURING THE KAFKA STREAMS APPLICATION

3. Define a KafkaStreams
Function Description
KafkaStreams is the core engine of a Kafka Streams application. It is responsible for:
Builds the logical topology that KafkaStreams
builder.build()
• Taking the topology you define using StreamsBuilder
will execute
• Using a set of configuration properties (like bootstrap servers, application ID,
Instantiates the application with the topology
SerDes, etc.)
KafkaStreams(topology, props)
and configuration
• And executing the streaming application on data coming from Kafka topics.
Starts the stream processing in the
start()
background
Gracefully stops the Kafka Streams
close()
application
Cleans up local state stores (used for testing
cleanUp()
or resetting state)
KafkaStreams kafkastreams=new KafkaStreams(builder.build(),props);
kafkastreams.start();
DATA MODELS FOR STREAM PROCESSING

In Kafka Streams, the KStream and KTable objects represent two fundamental data models
for stream processing. They allow you to express different logic depending on how you want
to interpret your data.

KStream – a stream of events

A KStream is a continuous sequence of records, like an event log arriving in real-time.

• Each message is independent.

• Used for stateless transformations (e.g., mapping, filtering, routing).
• Ideal for immutable data (e.g., transactions, clicks, logs).
DATA MODELS FOR STREAM PROCESSING

A KTable is a materialized view of the latest state for

each key, which updates over time.

• Represents the current value for each key.

• Each new record updates the previous value
(like a database update).
• Useful for joins and stateful aggregations
(e.g., inventory count, user status).
DATA MODELS FOR STREAM PROCESSING

A KTable is a materialized view of the latest state for

each key, which updates over time.

• Represents the current value for each key.

• Each new record updates the previous value
(like a database update).
• Useful for joins and stateful aggregations
(e.g., inventory count, user status).
TRANSFORMATION OPERATIONS

The KStream and KTable interfaces support a variety of transformation operations.

• Stateless transformations
• Stateful transformations
CONFIGURING THE KAFKA STREAMS APPLICATION

The KStream and KTable interfaces support a variety of transformation operations.

• Stateless transformations

 flatMap: takes one record and produces zero, one, or more records. You can
modify the record keys and values, including their types.
 GroupByKey: Groups the record by key
 Map: Takes one record and produces one record. You can modify the record key
and value, including their types.
CONFIGURING THE KAFKA STREAMS APPLICATION

The KStream and KTable interfaces support a variety of transformation operations.

• Statefull transformations

 Aggregate: Aggregates the values of (non-windowed) records by the grouped

key.
 Count: Counts the number of records by the grouped key.
CONFIGURING THE KAFKA STREAMS APPLICATION
CONFIGURING THE KAFKA STREAMS APPLICATION

textLines.flatMapValues(textline -> Arrays.asList(textline.split("\\

s+"))
CONFIGURING THE KAFKA STREAMS APPLICATION

Transcript of Pivotal Climate-Change Hearing 1988
100% (4)
Transcript of Pivotal Climate-Change Hearing 1988
216 pages
Operation Strategy
100% (1)
Operation Strategy
22 pages
Finite Element Method For Electromagnetics
No ratings yet
Finite Element Method For Electromagnetics
360 pages
Operating System Concepts Test
No ratings yet
Operating System Concepts Test
11 pages
FAINT YET PURSUING by KELLY JOEL
No ratings yet
FAINT YET PURSUING by KELLY JOEL
13 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
11 pages
UBS Business Plan - Stategic Planning and Financing Basis - Model For Generating A Business Plan - (UBS AG) PDF
No ratings yet
UBS Business Plan - Stategic Planning and Financing Basis - Model For Generating A Business Plan - (UBS AG) PDF
26 pages
Module-2-MINING DATA STREAMS
100% (3)
Module-2-MINING DATA STREAMS
17 pages
Stream Processing and Analytics Handout
No ratings yet
Stream Processing and Analytics Handout
8 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
Unit 3
No ratings yet
Unit 3
51 pages
StreamProcessingAndAnalytics Handout
No ratings yet
StreamProcessingAndAnalytics Handout
7 pages
Screening and Assessment LD
No ratings yet
Screening and Assessment LD
63 pages
Putting Apache Kafka To Use!: Building A Real-Time Data Platform For Event Streams!
No ratings yet
Putting Apache Kafka To Use!: Building A Real-Time Data Platform For Event Streams!
48 pages
Anchoring Script For Sports Day
No ratings yet
Anchoring Script For Sports Day
17 pages
COC III Set Up Computer Server
No ratings yet
COC III Set Up Computer Server
77 pages
Getting Started With Real-Time Analytics With Kafka and Spark in Microsoft Azure - Joe Plumb.
No ratings yet
Getting Started With Real-Time Analytics With Kafka and Spark in Microsoft Azure - Joe Plumb.
44 pages
4 Building Blocks of A Streaming Data Architecture
No ratings yet
4 Building Blocks of A Streaming Data Architecture
11 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
14 pages
s15 Pin Out
No ratings yet
s15 Pin Out
4 pages
Bigdata Unit-Ii
No ratings yet
Bigdata Unit-Ii
33 pages
Ec PDF
No ratings yet
Ec PDF
1,602 pages
Tendrel Nyesel - Rigpa Wiki052150
No ratings yet
Tendrel Nyesel - Rigpa Wiki052150
6 pages
BDA UNIT-2 (Final)
No ratings yet
BDA UNIT-2 (Final)
27 pages
Artistic Skills and Techniques To Contemporary Art Creations
No ratings yet
Artistic Skills and Techniques To Contemporary Art Creations
40 pages
Real-Time Data Stream Applications
No ratings yet
Real-Time Data Stream Applications
18 pages
Uint 4miningdatastream 230810162429 9d7c02a7
No ratings yet
Uint 4miningdatastream 230810162429 9d7c02a7
11 pages
USPCAS-E Manual
No ratings yet
USPCAS-E Manual
119 pages
Physical Properties of Metals
No ratings yet
Physical Properties of Metals
4 pages
DataStreaming L-4
No ratings yet
DataStreaming L-4
16 pages
Big Data Stream Processing Guide
No ratings yet
Big Data Stream Processing Guide
22 pages
What Is Stream Processing
No ratings yet
What Is Stream Processing
3 pages
MySQL Backup & Recovery Basics
No ratings yet
MySQL Backup & Recovery Basics
15 pages
Big Data IV Nit
No ratings yet
Big Data IV Nit
15 pages
Unit 2
No ratings yet
Unit 2
10 pages
Stream Processing in Big Data
No ratings yet
Stream Processing in Big Data
39 pages
Chapter 6
No ratings yet
Chapter 6
26 pages
CSF Anatomy & Physiology
No ratings yet
CSF Anatomy & Physiology
20 pages
MA6452 S&NM 1 - by Civildatas - Com 12
No ratings yet
MA6452 S&NM 1 - by Civildatas - Com 12
50 pages
Stream Processing for IT/CSE Students
No ratings yet
Stream Processing for IT/CSE Students
57 pages
6 - Streaming Part 1
No ratings yet
6 - Streaming Part 1
44 pages
Unit 3-6
No ratings yet
Unit 3-6
14 pages
Lec 19
No ratings yet
Lec 19
23 pages
Big Data 3rd Unit
No ratings yet
Big Data 3rd Unit
16 pages
Assignment No. 3 For Business Data Analytics
No ratings yet
Assignment No. 3 For Business Data Analytics
16 pages
A Brief Biography of Hazrat Maqdum Fakhi Ali Al-Mahaimi
No ratings yet
A Brief Biography of Hazrat Maqdum Fakhi Ali Al-Mahaimi
13 pages
BDA Unit-4
No ratings yet
BDA Unit-4
12 pages
Lec 19
No ratings yet
Lec 19
24 pages
DSR Ss 03 January 2023 Indordb
No ratings yet
DSR Ss 03 January 2023 Indordb
19 pages
Real-Time Streaming for Tech Pros
No ratings yet
Real-Time Streaming for Tech Pros
5 pages
Halal Industry Master Plan (2008 - 2020) : The Evolution of The Halal Industry in Malaysia
No ratings yet
Halal Industry Master Plan (2008 - 2020) : The Evolution of The Halal Industry in Malaysia
2 pages
Streaming Data Insights for Tech Pros
No ratings yet
Streaming Data Insights for Tech Pros
4 pages
BDA Lec10
No ratings yet
BDA Lec10
33 pages
Critical Thinking Exercise: "Wild Child: The Story of Feral Children"
No ratings yet
Critical Thinking Exercise: "Wild Child: The Story of Feral Children"
2 pages
Unit 2 BD Mining Data Streams
No ratings yet
Unit 2 BD Mining Data Streams
34 pages
SA Unit 1 PPT 2
No ratings yet
SA Unit 1 PPT 2
27 pages
BDA Unit 3
No ratings yet
BDA Unit 3
18 pages
Big Data PDF
No ratings yet
Big Data PDF
10 pages
For Green Marketing Project
No ratings yet
For Green Marketing Project
16 pages
Stream Processing and Website Tracking
No ratings yet
Stream Processing and Website Tracking
2 pages
Big Data Analytics - Unit 2 Notes
No ratings yet
Big Data Analytics - Unit 2 Notes
44 pages
Unit-2 BDA
No ratings yet
Unit-2 BDA
30 pages
Unit 1 Windowing
No ratings yet
Unit 1 Windowing
23 pages
NARAYANI MAHAL Job Fare
No ratings yet
NARAYANI MAHAL Job Fare
2 pages
Education, Arts, and Sciences
No ratings yet
Education, Arts, and Sciences
1 page
Action Plan For NLC
No ratings yet
Action Plan For NLC
9 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
Unit 3 Data Analytics
No ratings yet
Unit 3 Data Analytics
15 pages
BDA Unit 3
No ratings yet
BDA Unit 3
42 pages
Streaming Data
No ratings yet
Streaming Data
33 pages
Streaming Graph Processing Unit5
No ratings yet
Streaming Graph Processing Unit5
7 pages
Chapter 1-1
No ratings yet
Chapter 1-1
34 pages
AI Lesson: Conditionals & Vocabulary
No ratings yet
AI Lesson: Conditionals & Vocabulary
6 pages
Calculus and Its Applications 11th Edition Bittinger Solutions Manualpdf Download
100% (13)
Calculus and Its Applications 11th Edition Bittinger Solutions Manualpdf Download
42 pages
010.4 - Streaming Data Sources
No ratings yet
010.4 - Streaming Data Sources
2 pages
Unit4 2
No ratings yet
Unit4 2
40 pages
What Is Stream Processing
No ratings yet
What Is Stream Processing
11 pages
Data Stream in Data Analytics
No ratings yet
Data Stream in Data Analytics
4 pages
Egsh064784 (1) - 060844
No ratings yet
Egsh064784 (1) - 060844
1 page
Mining Data Streams in Data Analytics Refers To The Process of Extracting Useful Patterns
No ratings yet
Mining Data Streams in Data Analytics Refers To The Process of Extracting Useful Patterns
30 pages
SPA Session 10 Stream Platforms
No ratings yet
SPA Session 10 Stream Platforms
26 pages
Streaming Data Ingestion v1 181001151203
No ratings yet
Streaming Data Ingestion v1 181001151203
59 pages
Real Time Data Sentiment Analysis Report
No ratings yet
Real Time Data Sentiment Analysis Report
23 pages
Stream Computing
No ratings yet
Stream Computing
18 pages

Stream Processing

Uploaded by

Stream Processing

Uploaded by

INF4101-Big Data et NoSQL

Common stream processing use cases include:

• At its core, stream processing embodies an event-driven architecture that processes

Concrete examples of events:

• At its core, stream processing embodies an event-driven architecture that processes

• Windowing in Stream Processing

Windowing is a technique used in stream processing to divide the infinite stream of

• Windowing in Stream Processing

There are several types of windows:

Window Type Description

• It operates on the principle of responding to events as they occur, such as data

• A stream processing pipeline refers

Components of an event-driven architecture

Components of an event-driven architecture

Components of an event-driven architecture

Components of an event-driven architecture

• The term "streaming" describes continuous, never-ending data streams

• No separate processing cluster required

• The kafka stream application interacts with a Kafka cluster.

• No separate processing cluster required

• The application does not run directly on Kafka Brokers

1. Configure your Kafka streams application with a StreamsConfig instance

A topology refers to the logical structure or data

// Consommer des données depuis un topic "input-topic"

// Appliquer une transformation : convertir les messages en majuscules

// Écrire les résultats dans un autre topic "output-topic"

// Consommer des données depuis un topic "input-topic"

// Appliquer une transformation : convertir les messages en majuscules

// Écrire les résultats dans un autre topic "output-topic"

// Consommer des données depuis un topic "input-topic"

// Appliquer une transformation : convertir les messages en majuscules

// Écrire les résultats dans un autre topic "output-topic"

• Taking the topology you define using StreamsBuilder

KafkaStreams kafkastreams=new KafkaStreams(builder.build(),props);

KStream – a stream of events

A KStream is a continuous sequence of records, like an event log arriving in real-time.

• Each message is independent.

A KTable is a materialized view of the latest state for

• Represents the current value for each key.

A KTable is a materialized view of the latest state for

• Represents the current value for each key.

The KStream and KTable interfaces support a variety of transformation operations.

The KStream and KTable interfaces support a variety of transformation operations.

The KStream and KTable interfaces support a variety of transformation operations.

 Aggregate: Aggregates the values of (non-windowed) records by the grouped

textLines.flatMapValues(textline -> Arrays.asList(textline.split("\\

You might also like