Big Data Processing
Jiaul Paik
Lecture 16
Streaming Data Processing
• High Velocity
• continuously incorporating new data to compute a result.
• Data is unbounded
• no predetermined beginning or end
• High volume
• Millions of records/second
Stream Processing: Use Cases
• Notifications and alerting
• Given some series of events, a notification or alert should be triggered if some
sort of event or series of events occurs.
• Real-time reporting
• real-time dashboards for a employee to look at (server load, # user visiting, etc)
• Online machine learning
• Credit card fraud detection: company may want to continuously update a model
from all customers’ behavior and test each transaction against it
Stream Processing: Issues
• Processing out-of-order data based on event time
• You want only to trigger some action based on a specific sequence of
values received, say for example, 2 -> 10 -> 5
• Why this is challenging?
• streaming system is going to receive each event individually
• the data can reach out of order because of network delay
• volume of data is also very high
Requirements for Stream Processing System
§ Scalable to large clusters
§ Quick response time
§ Simple programming model
§ Integrated batch & interactive processing
Stateful Stream Processing
§ Traditional streaming systems have a record-at-a-time
processing model
- Each node has mutable state mutable state
- For each record, update state & send new records
input
§ State is lost if node dies! records
node 1
node 3
input
records
node 2
6
Distributed Stream Processing
7
Discretized Stream Processing
Run a streaming computation as a series of
very small, deterministic batch jobs
live data stream
Spark
§ Chop up the live stream into batches of t seconds Streaming
§ Spark treats each batch of data as RDDs and
processes them using RDD operations batches of t seconds
§ Finally, the processed results of the RDD
operations are returned in batches Spark
processed
results
8
Discretized Stream Processing
Run a streaming computation as a series of very
small, deterministic batch jobs
§ Batch sizes as low as ½ second, latency ~ 1 second
§ Potential for combining batch processing and
streaming processing in the same system
gives processed results
9
Spark Streaming Context
• StreamingContext is the main module for all streaming operations
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
// Create a local StreamingContext
val conf = new SparkConf().setMaster("local[10]").setAppName(“myApp")
val ssc = new StreamingContext(conf, Seconds(2))
Example: Find hashtags from Twitter
tweets = ssc.twitterStream(username, password)
hashtags = tweets.flatMap(t => getTags(t))
Hashtags.saveAsTextFile(“output”)
time t time t+1 time t+2
tweets Dstream
flatMap
hashtags
saveAsTextFile
11
Fault-tolerance
§ RDDs remember the sequence of tweets input data
RDD replicated
operations that created it from the
in memory
original fault-tolerant input data
flatMap
§ Batches of input data are replicated
in memory of multiple worker nodes,
therefore fault-tolerant hashTags
RDD lost partitions
recomputed on
§ Data lost due to worker failure, can other workers
be recomputed from input data
Key concepts
• DStream – sequence of RDDs representing a stream of data
• Twitter, HDFS, Kafka
• Transformations – modify data from on DStream to another
• Standard RDD operations – map, countByValue, reduce, join, …
• Stateful operations – window, countByValueAndWindow, …
• Output Operations – send data to external storage/show on screen
• saveAsHadoopFiles – saves to HDFS
• foreach – do anything with each batch of results
Example: Count the hashtags
tweets = ssc.twitterStream(username, password)
hashtags = tweets.flatMap(t => getTags(t))
count = hashtags.map(t => (t,1)).reduceByKey(_ + _)
Windowed computations
• Allow you to apply transformations over a sliding window of data.
• window length - The duration of the window
• sliding interval - The interval at which the window operation is
performed
1 2 3 4 5
Original
Dstream
windowed
Dstream 1 2
Window Operations
Window length Sliding interval
window(WL, SI) Return a new DStream based on windowed batches of the source
DStream.
countByWindow(WL, SI) Return a sliding window count of elements in the stream.
reduceByWindow(Func, WL, SI) Return a new single-element stream, by aggregating elements
over sliding interval using Func.
reduceByKeyAndWindow(Func Returns a new DStream of (K, V) pairs where the values for each
, WL, SI, [numTasks]) key are aggregated using the given reduce function Func over
batches in a sliding window.
countByValueAndWindow(WL, Returns a new DStream of (K, V) pairs where the value of each
SI, [numTasks]) key is its frequency within a sliding window.
Count the hashtags over last n minutes
tweets = ssc.twitterStream(username, password)
hashtags = tweets.flatMap(t => getTags(t))
val wtags = hashtag.window(Minutes(10), Minutes(2))
val tagcount = wtags.countByValue()
17
Structured Stream Processing
Structured Stream in Spark: Overview
• Structured Streaming treats live data stream as a table that is being continuously appended.
• Similar to a batch processing model.
• Streaming computation as standard batch-like query as on a static table,
• Spark runs it as an incremental query on the unbounded input table.
Stream as Unbounded Table
Data stream
Ap
p en
Unbounded
table
de
dd
ata
Output Modes
• Complete Mode
• The entire updated Result Table is written to the external storage.
• Append Mode
• Only the new rows appended in the Result Table since the last trigger will
be written to the external storage.
• Update Mode
• Only the rows that were updated in the Result Table since the last trigger
will be written to the external storage
Querying on Structured Stream: Example
Word count example
1 2 3
(1 second interval)
Time
cat, dog cat, dog, dog, cat
Input data
cat, dog, dog, cat, rat
Result cat, 1 cat, 2 cat, 2
(word count) dog, 1 dog, 2 dog, 2
rat, 1
Output output output output
(t=1) (t=2) (t=3)
(complete mode)
Creating streaming DataFrames
• Streaming DataFrames can be created through
• DataStreamReader interface provided by SparkSession.readStream()
Streaming DataFrames: Input Sources
• File source –
• Reads files written in a directory as a stream of data.
• Files are processed in the order of file modification time.
• Supported file formats are text, CSV, JSON, etc.
• Kafka source
• Reads data from Kafka.
• Socket source
• Reads UTF8 text data from a socket connection.
• Rate source (for testing)
• Generates data at the specified number of rows per second, each output row contains
a timestamp and value.
Input Sources: Reading from socket
spark = SparkSession. ...
socketdf = spark.readStream.format("socket").option("host",
"localhost").option("port", 9999).load()
socketdf.printSchema()
Basic Operations of Stream Dataframes
• Selection, aggregation
• df = ...
• df.select(“name").where(“value > 10")
• df.groupBy(“name").count()
Window operations on Structured Streaming
• Aggregations over a sliding event-time window
• Tumbling window
• Sliding window
• Session window
Window Operation Types
Tumbling windows (5 minutes) w1 w2 w3 w4
12.00 12.05 12.10 12.15
Window Operations
• Sliding windows (10 min, 5 min)
• Fixed sized window length
• Windows are overlapped
Window Operations
• Session windows
• Session window has a dynamic size of the window length,
depending on the inputs.
• A session window starts with an input, and expands itself if
following input has been received within gap duration
Window Operations
Count words within 10 minute windows, updating every 5 minutes
words = ...
windowedCounts = words.groupBy(
window(words.timestamp, "10 minutes", "5 minutes"),
words.word
).count()
Thank you!
32