0% found this document useful (0 votes)

47 views32 pages

Stream Data Processing

The document discusses streaming data processing, highlighting its characteristics such as high velocity, unbounded data, and high volume. It covers use cases like notifications, real-time reporting, and online machine learning, along with challenges and requirements for stream processing systems. Additionally, it explains concepts like stateful processing, discretized stream processing, and structured streaming in Spark, including various window operations and output modes.

Uploaded by

zh4hyrd87y

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views32 pages

Stream Data Processing

Uploaded by

zh4hyrd87y

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Big Data Processing

Jiaul Paik
Lecture 16
Streaming Data Processing

• High Velocity
• continuously incorporating new data to compute a result.

• Data is unbounded
• no predetermined beginning or end

• High volume
• Millions of records/second
Stream Processing: Use Cases
• Notifications and alerting
• Given some series of events, a notification or alert should be triggered if some
sort of event or series of events occurs.

• Real-time reporting
• real-time dashboards for a employee to look at (server load, # user visiting, etc)

• Online machine learning

• Credit card fraud detection: company may want to continuously update a model
from all customers’ behavior and test each transaction against it
Stream Processing: Issues

• Processing out-of-order data based on event time

• You want only to trigger some action based on a specific sequence of

values received, say for example, 2 -> 10 -> 5

• Why this is challenging?

• streaming system is going to receive each event individually
• the data can reach out of order because of network delay
• volume of data is also very high
Requirements for Stream Processing System

§ Scalable to large clusters

§ Quick response time

§ Simple programming model

§ Integrated batch & interactive processing

Stateful Stream Processing

§ Traditional streaming systems have a record-at-a-time

processing model
- Each node has mutable state mutable state
- For each record, update state & send new records
input
§ State is lost if node dies! records
node 1

node 3
input
records
node 2

6
Distributed Stream Processing

7
Discretized Stream Processing

Run a streaming computation as a series of

very small, deterministic batch jobs
live data stream
Spark
§ Chop up the live stream into batches of t seconds Streaming
§ Spark treats each batch of data as RDDs and
processes them using RDD operations batches of t seconds
§ Finally, the processed results of the RDD
operations are returned in batches Spark
processed
results

8
Discretized Stream Processing

Run a streaming computation as a series of very

small, deterministic batch jobs

§ Batch sizes as low as ½ second, latency ~ 1 second

§ Potential for combining batch processing and
streaming processing in the same system

gives processed results

9
Spark Streaming Context
• StreamingContext is the main module for all streaming operations

import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._

// Create a local StreamingContext

val conf = new SparkConf().setMaster("local[10]").setAppName(“myApp")
val ssc = new StreamingContext(conf, Seconds(2))
Example: Find hashtags from Twitter
tweets = ssc.twitterStream(username, password)

hashtags = tweets.flatMap(t => getTags(t))

Hashtags.saveAsTextFile(“output”)

time t time t+1 time t+2

tweets Dstream

flatMap

hashtags

saveAsTextFile

11
Fault-tolerance

§ RDDs remember the sequence of tweets input data

RDD replicated
operations that created it from the
in memory
original fault-tolerant input data

flatMap
§ Batches of input data are replicated
in memory of multiple worker nodes,
therefore fault-tolerant hashTags
RDD lost partitions
recomputed on
§ Data lost due to worker failure, can other workers
be recomputed from input data
Key concepts

• DStream – sequence of RDDs representing a stream of data

• Twitter, HDFS, Kafka

• Transformations – modify data from on DStream to another

• Standard RDD operations – map, countByValue, reduce, join, …
• Stateful operations – window, countByValueAndWindow, …

• Output Operations – send data to external storage/show on screen

• saveAsHadoopFiles – saves to HDFS
• foreach – do anything with each batch of results
Example: Count the hashtags

tweets = ssc.twitterStream(username, password)

hashtags = tweets.flatMap(t => getTags(t))

count = hashtags.map(t => (t,1)).reduceByKey(_ + _)

Windowed computations

• Allow you to apply transformations over a sliding window of data.

• window length - The duration of the window

• sliding interval - The interval at which the window operation is

performed
1 2 3 4 5

Original
Dstream

windowed
Dstream 1 2
Window Operations

Window length Sliding interval

window(WL, SI) Return a new DStream based on windowed batches of the source
DStream.

countByWindow(WL, SI) Return a sliding window count of elements in the stream.

reduceByWindow(Func, WL, SI) Return a new single-element stream, by aggregating elements

over sliding interval using Func.

reduceByKeyAndWindow(Func Returns a new DStream of (K, V) pairs where the values for each
, WL, SI, [numTasks]) key are aggregated using the given reduce function Func over
batches in a sliding window.

countByValueAndWindow(WL, Returns a new DStream of (K, V) pairs where the value of each
SI, [numTasks]) key is its frequency within a sliding window.
Count the hashtags over last n minutes

tweets = ssc.twitterStream(username, password)

hashtags = tweets.flatMap(t => getTags(t))

val wtags = hashtag.window(Minutes(10), Minutes(2))

val tagcount = wtags.countByValue()

17
Structured Stream Processing
Structured Stream in Spark: Overview

• Structured Streaming treats live data stream as a table that is being continuously appended.

• Similar to a batch processing model.

• Streaming computation as standard batch-like query as on a static table,

• Spark runs it as an incremental query on the unbounded input table.

Stream as Unbounded Table

Data stream

Ap
p en
Unbounded
table
de
dd
ata
Output Modes

• Complete Mode
• The entire updated Result Table is written to the external storage.

• Append Mode
• Only the new rows appended in the Result Table since the last trigger will
be written to the external storage.

• Update Mode
• Only the rows that were updated in the Result Table since the last trigger
will be written to the external storage
Querying on Structured Stream: Example

Word count example

1 2 3
(1 second interval)
Time

cat, dog cat, dog, dog, cat

Input data
cat, dog, dog, cat, rat

Result cat, 1 cat, 2 cat, 2

(word count) dog, 1 dog, 2 dog, 2
rat, 1

Output output output output

(t=1) (t=2) (t=3)
(complete mode)
Creating streaming DataFrames

• Streaming DataFrames can be created through

• DataStreamReader interface provided by SparkSession.readStream()

Streaming DataFrames: Input Sources

• File source –
• Reads files written in a directory as a stream of data.
• Files are processed in the order of file modification time.
• Supported file formats are text, CSV, JSON, etc.

• Kafka source
• Reads data from Kafka.

• Socket source
• Reads UTF8 text data from a socket connection.

• Rate source (for testing)

• Generates data at the specified number of rows per second, each output row contains
a timestamp and value.
Input Sources: Reading from socket

spark = SparkSession. ...

socketdf = spark.readStream.format("socket").option("host",
"localhost").option("port", 9999).load()

socketdf.printSchema()
Basic Operations of Stream Dataframes

• Selection, aggregation

• df = ...

• df.select(“name").where(“value > 10")

• df.groupBy(“name").count()
Window operations on Structured Streaming

• Aggregations over a sliding event-time window

• Tumbling window

• Sliding window

• Session window
Window Operation Types

Tumbling windows (5 minutes) w1 w2 w3 w4

12.00 12.05 12.10 12.15

Window Operations

• Sliding windows (10 min, 5 min)

• Fixed sized window length
• Windows are overlapped
Window Operations

• Session windows
• Session window has a dynamic size of the window length,
depending on the inputs.

• A session window starts with an input, and expands itself if

following input has been received within gap duration
Window Operations

Count words within 10 minute windows, updating every 5 minutes

words = ...

windowedCounts = words.groupBy(
window(words.timestamp, "10 minutes", "5 minutes"),
words.word
).count()
Thank you!

Batch Control Standards Guide
No ratings yet
Batch Control Standards Guide
4 pages
Structured Streaming and Basic Concepts
No ratings yet
Structured Streaming and Basic Concepts
4 pages
Spark Streaming
No ratings yet
Spark Streaming
14 pages
Big Data With Spark Detailed Presentation
No ratings yet
Big Data With Spark Detailed Presentation
13 pages
Stream Processing Chapter 5
No ratings yet
Stream Processing Chapter 5
23 pages
Bda Unit-Iii-1
No ratings yet
Bda Unit-Iii-1
29 pages
Unit 5 (Big Data Analytics)
No ratings yet
Unit 5 (Big Data Analytics)
11 pages
8 - Streaming 3 - Spark Flink
No ratings yet
8 - Streaming 3 - Spark Flink
52 pages
Bài Giảng Spark Streaming
No ratings yet
Bài Giảng Spark Streaming
75 pages
Databricks Streaming and Delta Live Tables
No ratings yet
Databricks Streaming and Delta Live Tables
69 pages
09 - Apache Spark Streaming
No ratings yet
09 - Apache Spark Streaming
31 pages
Chapter-5 Stream Processing Part1
No ratings yet
Chapter-5 Stream Processing Part1
32 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
Week5 Lesson6
No ratings yet
Week5 Lesson6
8 pages
Continuous Application 1725280881
No ratings yet
Continuous Application 1725280881
72 pages
Lecture 7 - 1-Spark - Streaming
No ratings yet
Lecture 7 - 1-Spark - Streaming
25 pages
Spark Streaming Workflow Guide
No ratings yet
Spark Streaming Workflow Guide
25 pages
Hands-On Guide To Apache Spark 3: Build Scalable Computing Engines For Batch and Stream Data Processing Alfonso Antolínez García Download
No ratings yet
Hands-On Guide To Apache Spark 3: Build Scalable Computing Engines For Batch and Stream Data Processing Alfonso Antolínez García Download
77 pages
Report Control - User Guide: Release - R18AMR
100% (1)
Report Control - User Guide: Release - R18AMR
15 pages
Call Transaction 2. Session Method 3. Direct Input Method
No ratings yet
Call Transaction 2. Session Method 3. Direct Input Method
9 pages
Unit - 5 FBDA
No ratings yet
Unit - 5 FBDA
7 pages
Unit I Digital
No ratings yet
Unit I Digital
29 pages
Data Lake 1
No ratings yet
Data Lake 1
19 pages
Streaming Graph Processing Unit5
No ratings yet
Streaming Graph Processing Unit5
7 pages
Big Data Notes
No ratings yet
Big Data Notes
37 pages
Real-Time Data Pipelines Made Easy With Structured Streaming in Apache Spark
No ratings yet
Real-Time Data Pipelines Made Easy With Structured Streaming in Apache Spark
51 pages
Structured Streaming Guide
No ratings yet
Structured Streaming Guide
1 page
Bda U-5
No ratings yet
Bda U-5
30 pages
Spark Streaming
No ratings yet
Spark Streaming
99 pages
History of Operating Systems
No ratings yet
History of Operating Systems
4 pages
Important Question Operating System 1
No ratings yet
Important Question Operating System 1
26 pages
Lecture 11
No ratings yet
Lecture 11
31 pages
DSPL Casestidy
No ratings yet
DSPL Casestidy
3 pages
Spark Streaming for Data Engineers
No ratings yet
Spark Streaming for Data Engineers
34 pages
Computer Science Questions and Answers P 2
No ratings yet
Computer Science Questions and Answers P 2
3 pages
Introduction To The SysOperation Framework
No ratings yet
Introduction To The SysOperation Framework
59 pages
b0m33bdt 7p Spark Databricks Streaming - 2023 - en
No ratings yet
b0m33bdt 7p Spark Databricks Streaming - 2023 - en
50 pages
Spark Streaming: Tathagata "TD" Das
No ratings yet
Spark Streaming: Tathagata "TD" Das
28 pages
Kafka
No ratings yet
Kafka
78 pages
Unit Iii
No ratings yet
Unit Iii
19 pages
BDA Lec10
No ratings yet
BDA Lec10
33 pages
(Wiley) Software Factories - Assembling Applications With Patterns, Models
100% (1)
(Wiley) Software Factories - Assembling Applications With Patterns, Models
503 pages
Spark Streaming API Guide
No ratings yet
Spark Streaming API Guide
37 pages
Big Data Analytics - Unit 2 Notes
No ratings yet
Big Data Analytics - Unit 2 Notes
44 pages
Informatica Resume
No ratings yet
Informatica Resume
7 pages
7 - Streaming 2 - Calcite
No ratings yet
7 - Streaming 2 - Calcite
45 pages
BDA Unit 3
No ratings yet
BDA Unit 3
42 pages
Group 6 CET2
No ratings yet
Group 6 CET2
11 pages
Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
No ratings yet
Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
407 pages
Anlagenbau Moebel Englisch 72 CNC Machine
No ratings yet
Anlagenbau Moebel Englisch 72 CNC Machine
22 pages
SA Unit 1 PPT 2
No ratings yet
SA Unit 1 PPT 2
27 pages
Production and Operations Management (POM) - SEM-II - (GTU)
No ratings yet
Production and Operations Management (POM) - SEM-II - (GTU)
129 pages
LSMW - Pavan Golesar - Direct Input Approach v1.5 PDF
No ratings yet
LSMW - Pavan Golesar - Direct Input Approach v1.5 PDF
25 pages
Computer-Assisted Audit Techniques (Caats) : It Auditing & Assurance, 2E, Hall & Singleton
No ratings yet
Computer-Assisted Audit Techniques (Caats) : It Auditing & Assurance, 2E, Hall & Singleton
35 pages
Sigmod Structured Streaming
No ratings yet
Sigmod Structured Streaming
13 pages
Spark Streaming for Data Engineers
No ratings yet
Spark Streaming for Data Engineers
22 pages
Lec 19
No ratings yet
Lec 19
24 pages
3311 SAP Loans Management Completely Integrated For Public Sector PDF
0% (2)
3311 SAP Loans Management Completely Integrated For Public Sector PDF
25 pages
Operating System Concepts (Exercises and Answers) Part I
71% (7)
Operating System Concepts (Exercises and Answers) Part I
3 pages
Opendtectnj
No ratings yet
Opendtectnj
348 pages
O2C Tables
100% (1)
O2C Tables
2 pages
Teradata Utilities - Breaking The Barriers
No ratings yet
Teradata Utilities - Breaking The Barriers
128 pages
Sterling Connect Direct Z-OS Users Guide
No ratings yet
Sterling Connect Direct Z-OS Users Guide
234 pages
Lec 05
No ratings yet
Lec 05
10 pages
Lec 19
No ratings yet
Lec 19
23 pages
Spark Streaming for Developers
100% (1)
Spark Streaming for Developers
28 pages
Spark Streaming Through Dynamic Batch Sizing
No ratings yet
Spark Streaming Through Dynamic Batch Sizing
4 pages
Distributed Computing for Developers
No ratings yet
Distributed Computing for Developers
15 pages
Bda Unit-4 PDF
No ratings yet
Bda Unit-4 PDF
63 pages
Recipe Studio Deltav
No ratings yet
Recipe Studio Deltav
4 pages
TRabl StreamProcessing
No ratings yet
TRabl StreamProcessing
79 pages
ECS765P - W11 - Stream Processing II
No ratings yet
ECS765P - W11 - Stream Processing II
47 pages
Spark Structured Streaming
No ratings yet
Spark Structured Streaming
655 pages
Ade Mod 1 Incremental Processing With Spark Structured Streaming
No ratings yet
Ade Mod 1 Incremental Processing With Spark Structured Streaming
73 pages
High-Performance Computing Seminar
No ratings yet
High-Performance Computing Seminar
10 pages
T09 Data Streaming
No ratings yet
T09 Data Streaming
52 pages
Stream Processing in Big Data
No ratings yet
Stream Processing in Big Data
39 pages
Osy Micro Project
No ratings yet
Osy Micro Project
22 pages
Databricks - Spark Streaming
No ratings yet
Databricks - Spark Streaming
55 pages
Types of Operating Systems Explained
No ratings yet
Types of Operating Systems Explained
3 pages
Unit 1 and II Operating System DR - Ashish
No ratings yet
Unit 1 and II Operating System DR - Ashish
89 pages
Accounting Information System Reviewer
No ratings yet
Accounting Information System Reviewer
26 pages
Operating System Overview & Types
No ratings yet
Operating System Overview & Types
24 pages
Operating System Design and Implementation 2nd Edition by Andrew Tanenbaum, Albert Woodhull ISBN 0136386776 9780136386773 PDF Download
100% (2)
Operating System Design and Implementation 2nd Edition by Andrew Tanenbaum, Albert Woodhull ISBN 0136386776 9780136386773 PDF Download
39 pages
Public - Crash Course - Apache Spark - Berlin - 2018 PDF
No ratings yet
Public - Crash Course - Apache Spark - Berlin - 2018 PDF
76 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
28 pages

Stream Data Processing

Uploaded by

Stream Data Processing

Uploaded by

Big Data Processing

• Online machine learning

• Processing out-of-order data based on event time

• You want only to trigger some action based on a specific sequence of

• Why this is challenging?

§ Scalable to large clusters

§ Quick response time

§ Simple programming model

§ Integrated batch & interactive processing

§ Traditional streaming systems have a record-at-a-time

Run a streaming computation as a series of

Run a streaming computation as a series of very

§ Batch sizes as low as ½ second, latency ~ 1 second

gives processed results

// Create a local StreamingContext

hashtags = tweets.flatMap(t => getTags(t))

time t time t+1 time t+2

§ RDDs remember the sequence of tweets input data

• DStream – sequence of RDDs representing a stream of data

• Transformations – modify data from on DStream to another

• Output Operations – send data to external storage/show on screen

tweets = ssc.twitterStream(username, password)

hashtags = tweets.flatMap(t => getTags(t))

count = hashtags.map(t => (t,1)).reduceByKey(_ + _)

• Allow you to apply transformations over a sliding window of data.

• sliding interval - The interval at which the window operation is

Window length Sliding interval

countByWindow(WL, SI) Return a sliding window count of elements in the stream.

reduceByWindow(Func, WL, SI) Return a new single-element stream, by aggregating elements

tweets = ssc.twitterStream(username, password)

hashtags = tweets.flatMap(t => getTags(t))

val wtags = hashtag.window(Minutes(10), Minutes(2))

val tagcount = wtags.countByValue()

• Similar to a batch processing model.

• Streaming computation as standard batch-like query as on a static table,

• Spark runs it as an incremental query on the unbounded input table.

Word count example

cat, dog cat, dog, dog, cat

Result cat, 1 cat, 2 cat, 2

Output output output output

• Streaming DataFrames can be created through

• DataStreamReader interface provided by SparkSession.readStream()

• Rate source (for testing)

spark = SparkSession. ...

• df.select(“name").where(“value > 10")

• Aggregations over a sliding event-time window

Tumbling windows (5 minutes) w1 w2 w3 w4

12.00 12.05 12.10 12.15

• Sliding windows (10 min, 5 min)

• A session window starts with an input, and expands itself if

Count words within 10 minute windows, updating every 5 minutes

You might also like