Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
47 views32 pages

Stream Data Processing

The document discusses streaming data processing, highlighting its characteristics such as high velocity, unbounded data, and high volume. It covers use cases like notifications, real-time reporting, and online machine learning, along with challenges and requirements for stream processing systems. Additionally, it explains concepts like stateful processing, discretized stream processing, and structured streaming in Spark, including various window operations and output modes.

Uploaded by

zh4hyrd87y
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views32 pages

Stream Data Processing

The document discusses streaming data processing, highlighting its characteristics such as high velocity, unbounded data, and high volume. It covers use cases like notifications, real-time reporting, and online machine learning, along with challenges and requirements for stream processing systems. Additionally, it explains concepts like stateful processing, discretized stream processing, and structured streaming in Spark, including various window operations and output modes.

Uploaded by

zh4hyrd87y
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Big Data Processing

Jiaul Paik
Lecture 16
Streaming Data Processing

• High Velocity
• continuously incorporating new data to compute a result.

• Data is unbounded
• no predetermined beginning or end

• High volume
• Millions of records/second
Stream Processing: Use Cases
• Notifications and alerting
• Given some series of events, a notification or alert should be triggered if some
sort of event or series of events occurs.

• Real-time reporting
• real-time dashboards for a employee to look at (server load, # user visiting, etc)

• Online machine learning


• Credit card fraud detection: company may want to continuously update a model
from all customers’ behavior and test each transaction against it
Stream Processing: Issues

• Processing out-of-order data based on event time

• You want only to trigger some action based on a specific sequence of


values received, say for example, 2 -> 10 -> 5

• Why this is challenging?


• streaming system is going to receive each event individually
• the data can reach out of order because of network delay
• volume of data is also very high
Requirements for Stream Processing System

§ Scalable to large clusters

§ Quick response time

§ Simple programming model

§ Integrated batch & interactive processing


Stateful Stream Processing

§ Traditional streaming systems have a record-at-a-time


processing model
- Each node has mutable state mutable state
- For each record, update state & send new records
input
§ State is lost if node dies! records
node 1

node 3
input
records
node 2

6
Distributed Stream Processing

7
Discretized Stream Processing

Run a streaming computation as a series of


very small, deterministic batch jobs
live data stream
Spark
§ Chop up the live stream into batches of t seconds Streaming
§ Spark treats each batch of data as RDDs and
processes them using RDD operations batches of t seconds
§ Finally, the processed results of the RDD
operations are returned in batches Spark
processed
results

8
Discretized Stream Processing

Run a streaming computation as a series of very


small, deterministic batch jobs

§ Batch sizes as low as ½ second, latency ~ 1 second


§ Potential for combining batch processing and
streaming processing in the same system

gives processed results

9
Spark Streaming Context
• StreamingContext is the main module for all streaming operations

import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._

// Create a local StreamingContext


val conf = new SparkConf().setMaster("local[10]").setAppName(“myApp")
val ssc = new StreamingContext(conf, Seconds(2))
Example: Find hashtags from Twitter
tweets = ssc.twitterStream(username, password)

hashtags = tweets.flatMap(t => getTags(t))

Hashtags.saveAsTextFile(“output”)

time t time t+1 time t+2

tweets Dstream

flatMap

hashtags

saveAsTextFile

11
Fault-tolerance

§ RDDs remember the sequence of tweets input data


RDD replicated
operations that created it from the
in memory
original fault-tolerant input data

flatMap
§ Batches of input data are replicated
in memory of multiple worker nodes,
therefore fault-tolerant hashTags
RDD lost partitions
recomputed on
§ Data lost due to worker failure, can other workers
be recomputed from input data
Key concepts

• DStream – sequence of RDDs representing a stream of data


• Twitter, HDFS, Kafka

• Transformations – modify data from on DStream to another


• Standard RDD operations – map, countByValue, reduce, join, …
• Stateful operations – window, countByValueAndWindow, …

• Output Operations – send data to external storage/show on screen


• saveAsHadoopFiles – saves to HDFS
• foreach – do anything with each batch of results
Example: Count the hashtags

tweets = ssc.twitterStream(username, password)

hashtags = tweets.flatMap(t => getTags(t))

count = hashtags.map(t => (t,1)).reduceByKey(_ + _)


Windowed computations

• Allow you to apply transformations over a sliding window of data.


• window length - The duration of the window

• sliding interval - The interval at which the window operation is


performed
1 2 3 4 5

Original
Dstream

windowed
Dstream 1 2
Window Operations

Window length Sliding interval

window(WL, SI) Return a new DStream based on windowed batches of the source
DStream.

countByWindow(WL, SI) Return a sliding window count of elements in the stream.

reduceByWindow(Func, WL, SI) Return a new single-element stream, by aggregating elements


over sliding interval using Func.

reduceByKeyAndWindow(Func Returns a new DStream of (K, V) pairs where the values for each
, WL, SI, [numTasks]) key are aggregated using the given reduce function Func over
batches in a sliding window.

countByValueAndWindow(WL, Returns a new DStream of (K, V) pairs where the value of each
SI, [numTasks]) key is its frequency within a sliding window.
Count the hashtags over last n minutes

tweets = ssc.twitterStream(username, password)

hashtags = tweets.flatMap(t => getTags(t))

val wtags = hashtag.window(Minutes(10), Minutes(2))

val tagcount = wtags.countByValue()

17
Structured Stream Processing
Structured Stream in Spark: Overview

• Structured Streaming treats live data stream as a table that is being continuously appended.

• Similar to a batch processing model.

• Streaming computation as standard batch-like query as on a static table,

• Spark runs it as an incremental query on the unbounded input table.


Stream as Unbounded Table

Data stream

Ap
p en
Unbounded
table
de
dd
ata
Output Modes

• Complete Mode
• The entire updated Result Table is written to the external storage.

• Append Mode
• Only the new rows appended in the Result Table since the last trigger will
be written to the external storage.

• Update Mode
• Only the rows that were updated in the Result Table since the last trigger
will be written to the external storage
Querying on Structured Stream: Example

Word count example


1 2 3
(1 second interval)
Time

cat, dog cat, dog, dog, cat


Input data
cat, dog, dog, cat, rat

Result cat, 1 cat, 2 cat, 2


(word count) dog, 1 dog, 2 dog, 2
rat, 1

Output output output output


(t=1) (t=2) (t=3)
(complete mode)
Creating streaming DataFrames

• Streaming DataFrames can be created through

• DataStreamReader interface provided by SparkSession.readStream()


Streaming DataFrames: Input Sources

• File source –
• Reads files written in a directory as a stream of data.
• Files are processed in the order of file modification time.
• Supported file formats are text, CSV, JSON, etc.

• Kafka source
• Reads data from Kafka.

• Socket source
• Reads UTF8 text data from a socket connection.

• Rate source (for testing)


• Generates data at the specified number of rows per second, each output row contains
a timestamp and value.
Input Sources: Reading from socket

spark = SparkSession. ...

socketdf = spark.readStream.format("socket").option("host",
"localhost").option("port", 9999).load()

socketdf.printSchema()
Basic Operations of Stream Dataframes

• Selection, aggregation

• df = ...

• df.select(“name").where(“value > 10")

• df.groupBy(“name").count()
Window operations on Structured Streaming

• Aggregations over a sliding event-time window

• Tumbling window

• Sliding window

• Session window
Window Operation Types

Tumbling windows (5 minutes) w1 w2 w3 w4

12.00 12.05 12.10 12.15


Window Operations

• Sliding windows (10 min, 5 min)


• Fixed sized window length
• Windows are overlapped
Window Operations

• Session windows
• Session window has a dynamic size of the window length,
depending on the inputs.

• A session window starts with an input, and expands itself if


following input has been received within gap duration
Window Operations

Count words within 10 minute windows, updating every 5 minutes

words = ...

windowedCounts = words.groupBy(
window(words.timestamp, "10 minutes", "5 minutes"),
words.word
).count()
Thank you!

32

You might also like