0% found this document useful (0 votes)

11 views14 pages

Spark Streaming

Spark Streaming is a module of Apache Spark designed for processing real-time data streams, allowing users to leverage a similar API to batch jobs for developing streaming applications. It processes live data in micro-batches, enabling high-throughput and fault-tolerant stream processing from various sources like Kafka and Twitter. Key components include DStreams for data representation, StreamingContext for application entry, and various operations for data transformation and output.

Uploaded by

vudayk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views14 pages

Spark Streaming

Uploaded by

vudayk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 14

Spark Streaming

Spark Streaming is Spark’s module for real time applications (for

ex: track statistics about page views in real time, automatically
detect anomalies). It lets users write streaming applications using
a very similar API to batch jobs, and thus reuse a lot of the skills
and even code they built for those.

Spark Streaming is a distributed data stream processing

framework. It makes it easy to develop distributed
applications for processing live data streams in near real time. It
not only provides a simple programming model but also enables
an application to process high-velocity stream data. It also
allows the combining of data streams and historical data for
processing.

Spark Streaming is an extension of the core Spark API that

enables scalable, high-throughput, fault-tolerant stream
processing of live data streams. Data can be ingested from many
sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis or TCP
sockets can be processed using complex algorithms expressed
with high-level functions like map, reduce, join and window.
Finally, processed data can be pushed out to file systems,
databases, and live dashboards
Process Flow in Spark Streaming

Spark Streaming works in the below fashion:

 Spark Streaming receives live input data streams and

divides the data into batches.
 Spark Engine will process the same data
 Once processing is done Spark engine will generate the final
stream of results in batches.

High-Level Architecture

 Spark Streaming processes a data stream in micro-batches.

 It splits a data stream into batches of very small fixed-sized
time intervals.
 Data in each micro-batch is stored as an RDD, which is
then processed using Spark core Any RDD operation can be
applied to an RDD created by Spark Streaming.
 The results of the RDD operations are streamed out in
batches
.

StreamingContext

 StreamingContext, a class defined in the Spark Streaming

library, is the main entry point into the Spark
Streaming library.

 It allows a Spark Streaming application to connect to a Spark

cluster.

 It also provides methods for creating an instance of the data

stream abstraction provided by Spark Streaming.

 Every Spark Streaming application must create an instance

of this class.

import org.apache.spark._
import org.apache.spark.streaming._

val config = new SparkConf().setMaster("spark://host:port")

.setAppName("big streaming app")
val batchInterval = 10
val ssc = new StreamingContext(conf, Seconds(batchInterval))
NOTE: The batch size can be as small as 500 milliseconds. The upper
bound for the batch size is determined by the latency requirements
of your application and the available memory

Starting Stream Computation

The start method begins stream computation. Nothing really

happens in a Spark Streaming application until the start method is
called on an instance of the StreamingContext class.

A Spark Streaming application begins receiving data after it calls

the start method.

ssc.start()

Waiting for Stream Computation to

Finish

The awaitTermination method in the StreamingContext class

makes an application thread wait for stream computation to stop.

It’s syntax is:

ssc.awaitTermination()

DStreams or discretized streams

Spark is built on the concept of RDDs, Spark Streaming provides
an abstraction called DStreams, or discretized streams.

DStreams can be created either from input data streams from

sources such as Kafka, Flume, and Kinesis, or by applying
high-level operations on other DStreams. Internally, a DStream is
represented as a sequence of RDDs.
A DStream is a sequence of data arriving over time. Internally,
each DStream is represented as a sequence of RDDs arriving at
each time step (hence the name “discretized”).

Spark Streaming provides a high-level abstraction

called discretized stream or DStream, which represents a
continuous stream of data

DStream is represented by a continuous series of RDDs, which is

Spark’s abstraction of an immutable, distributed dataset. Each
RDD in a DStream contains data from a certain interval, as shown
in the following figure.

DStreams offer two types of operations:

1. Transformations: which yield a new DStream.

2. Output operations: which write data to an external system

NOTE: DStreams provide many of the same operations available

on RDDs, plus new operations related to time, such as sliding
windows
Unlike batch programs, Spark Streaming applications need
additional setup in order to operate 24/7.

Check pointing: main mechanism Spark Streaming provides for

streaming purpose, which lets it store data in a reliable file
system such as HDFS.

NOTE: In Spark 1.1, Spark Streaming is available only in Java

and Scala. Experimental Python support was added in Spark 1.2

Transform
 The transform method returns a DStream by applying an
RDD => RDD function to each RDD in the source DStream.
 It takes as argument a function that takes an RDD as
argument and returns an RDD.
 Thus, it gives us direct access to the underlying RDDs of a
DStream.
 This method allows you to use methods provided by the RDD
API, but which do not have equivalent operations in the
DStream API. For example, sortBy is a transformation
available in the RDD API, but not in the DStream API.
 If you want to sort the elements within each RDD of a
DStream, you can use the transform

Method as shown in the following example.

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap{line => line.split(" ")}
val sorted = words.transform{rdd => rdd.sortBy((w)=> w)}
NOTE: The transform method is also useful for applying machine
and graph computation algorithms to data streams. The machine
learning and graph processing libraries provide classes and
methods that operate at the RDD level. Within the transform
method, you can use the API provided by these libraries

Spark Streaming Use Case:

We will receive a stream of newline-delimited lines of text from a
server running at port 7777, filter only the lines that contain the
word error, and print them.

Maven coordinates for Spark Streaming

groupId = org.apache.spark
artifactId = spark-streaming_2.10
version = 1.2.0

Scala streaming imports

import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.Duration
import org.apache.spark.streaming.Seconds

StreamingContext:

 Main entry point for streaming functionality.

 This also sets up an underlying SparkContext that it will use
to process the data.
 It takes as input a batch interval specifying how often to
process new data... which we set to 1 second.
socketTextStream()

 We use socketTextStream() to create a

DStream based on text data received on port 7777 of the
local machine.

Then we transform the DStream with filter() to get only the lines
that contain error

Finally, we apply the output operation print() to print some of

the filtered lines.

// Create a StreamingContext with a 1-second batch size from a

SparkConf
val ssc = new StreamingContext(conf, Seconds(1))

// Create a DStream using data received after connecting to port

7777 on the local machine

val lines = ssc.socketTextStream("localhost", 7777)

// Filter our DStream for lines with "error"

val errorLines = lines.filter(_.contains("error"))

// Print out the lines with errors

errorLines.print()

For example, in the above example of converting a stream of

lines to words, the flatMap operation is applied on each RDD in
the lines DStream to generate the RDDs of the words DStream.
This is shown in the following figure.
Input DStreams and Receivers

Input DStreams are DStreams representing the stream of input

data received from streaming sources. In the above example of
converting streaming of lines into words, lines was an input
DStream as it represented the stream of data received from the
netcat server

Every input DStream is associated with a Receiver (Scala

doc, Java doc) object which receives the data from a source and
stores it in Spark’s memory for processing.

Spark Streaming provides two categories of built-in streaming

sources.
 Basic sources: Sources directly available in the
StreamingContext API. Example: file systems, socket
connections, and Akka actors.
 Advanced sources: Sources like Kafka, Flume, Kinesis,
Twitter, etc. are available through extra utility classes. These
require linking against extra dependencies.

Basic Sources

File Streams: For reading data from files on any file system
compatible with the HDFS API (that is, HDFS, S3, NFS, etc.), a
DStream can be created as:
streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory)

Spark Streaming will monitor the directory dataDirectory and

process any files created in that directory (files written in nested
directories not supported). Note that

 The files must have the same data format.

 The files must be created in the data Directory by
atomically moving or renaming them into the data directory.
 Once moved, the files must not be changed. So if the files are
being continuously appended, the new data will not be read.

For simple text files, there is an easier

method streamingContext.textFileStream(dataDirectory).
And file streams do not require running a receiver, hence does
not require allocating cores.
Advanced Sources

This category of sources requires interfacing with external

non-Spark libraries, some of them with complex dependencies
(e.g., Kafka and Flume).

Hence, to minimize issues related to version conflicts of

dependencies, the functionality to create DStreams from these
sources have been moved to separate libraries

Example:

If we want to create a DStream using data from Twitter’s stream

of tweets, you have to do the following.

Linking: Add the artifact spark-streaming-twitter_2.10 to the

SBT/Maven project dependencies

Programming: Import the TwitterUtils class and create a

DStream with TwitterUtils.createStream as shown below

Some of these advanced sources are as

follows.
 Twitter: Spark Streaming’s TwitterUtils uses Twitter4j 3.0.3 to
get the public stream of tweets using Twitter’s Streaming API.
Authentication information can be provided by any of
the methods supported by Twitter4J library. You can either get
the public stream, or get the filtered stream based on a
keywords.
 Flume: Spark Streaming 1.2.0 can received data from Flume
1.4.0 .
 Kafka: Spark Streaming 1.2.0 can receive data from Kafka
0.8.0.

Window Operations

Spark Streaming also provides windowed computations, which

allow you to apply transformations over a sliding window of data.
This following figure explaines this sliding window.
Scala Code

// Reduce last 30 seconds of data, every 10 seconds

val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a +

b), Seconds(30), Seconds(10))

NOTE:Above Code generating word counts over last 30

seconds of data, every 10 seconds. To do this, we have to
apply the reduceByKey operation on the pairs DStream
of (word, 1) pairs over the last 30 seconds of data. This is
done using the operation reduceByKeyAndWindow.

Performance Considerations
Batch and Window Sizes

The most common question is what minimum batch size Spark

Streaming can use. In general, 500 milliseconds has proven to be
a good minimum size for many applications.

The best approach is to start with a larger batch size (around 10

seconds) andwork your way down to a smaller batch size. If the
processing times reported in the Streaming UI remain consistent,
then you can continue to decrease the batch size.

Level of Parallelism

A common way to reduce the processing time of batches is to

increase the parallelism. There are three ways to increase the
parallelism.

1. Increasing the number of receivers

2. Explicitly repartitioning received data
explicitly repartitioning the input stream (or the union of multiple
streams) using DStream.repartition.

3. Increasing parallelism in aggregation

For operations like reduceByKey(), you can specify the parallelism as a
second parameter

SAP PS Configuration Blogpost Collection Dnjxfi
0% (1)
SAP PS Configuration Blogpost Collection Dnjxfi
76 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Samsung Sync Master 226bw 206bw Service Manual
No ratings yet
Samsung Sync Master 226bw 206bw Service Manual
79 pages
Bda Unit-Iii-1
No ratings yet
Bda Unit-Iii-1
29 pages
Spark Streaming: Tathagata "TD" Das
No ratings yet
Spark Streaming: Tathagata "TD" Das
28 pages
Spark Streaming for Developers
100% (1)
Spark Streaming for Developers
28 pages
Bài Giảng Spark Streaming
No ratings yet
Bài Giảng Spark Streaming
75 pages
Stream Data Processing
No ratings yet
Stream Data Processing
32 pages
Unit - 5 FBDA
No ratings yet
Unit - 5 FBDA
7 pages
Week5 Lesson6
No ratings yet
Week5 Lesson6
8 pages
Lec 05
No ratings yet
Lec 05
10 pages
Unit Iii
No ratings yet
Unit Iii
19 pages
Databricks - Spark Streaming
No ratings yet
Databricks - Spark Streaming
55 pages
Spark Streaming Workflow Guide
No ratings yet
Spark Streaming Workflow Guide
25 pages
Data Lake 1
No ratings yet
Data Lake 1
19 pages
8 - Streaming 3 - Spark Flink
No ratings yet
8 - Streaming 3 - Spark Flink
52 pages
09 - Apache Spark Streaming
No ratings yet
09 - Apache Spark Streaming
31 pages
Bda U-5
No ratings yet
Bda U-5
30 pages
BDA Unit 3
No ratings yet
BDA Unit 3
42 pages
Spark Streaming API Guide
No ratings yet
Spark Streaming API Guide
37 pages
Spark Streaming for Data Engineers
No ratings yet
Spark Streaming for Data Engineers
22 pages
Analyzing Real-Time Data With Spark
No ratings yet
Analyzing Real-Time Data With Spark
7 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Lec 19
No ratings yet
Lec 19
23 pages
Real-Time Streaming for Tech Pros
No ratings yet
Real-Time Streaming for Tech Pros
5 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Learning Real-Time Processing With Spark Streaming - Sample Chapter
No ratings yet
Learning Real-Time Processing With Spark Streaming - Sample Chapter
30 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
SA Unit 1 PPT 2
No ratings yet
SA Unit 1 PPT 2
27 pages
Big Data With Spark Detailed Presentation
No ratings yet
Big Data With Spark Detailed Presentation
13 pages
Unit 5 (Big Data Analytics)
No ratings yet
Unit 5 (Big Data Analytics)
11 pages
Structured Streaming Guide
No ratings yet
Structured Streaming Guide
1 page
Lec 19
No ratings yet
Lec 19
24 pages
7 - Streaming 2 - Calcite
No ratings yet
7 - Streaming 2 - Calcite
45 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
Unit 4 (Big Data Analytics)
No ratings yet
Unit 4 (Big Data Analytics)
28 pages
BDA1
No ratings yet
BDA1
17 pages
Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
No ratings yet
Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
407 pages
Assignment No. 3 For Business Data Analytics
No ratings yet
Assignment No. 3 For Business Data Analytics
16 pages
T09 Data Streaming
No ratings yet
T09 Data Streaming
52 pages
Spark
No ratings yet
Spark
96 pages
Streaming Graph Processing Unit5
No ratings yet
Streaming Graph Processing Unit5
7 pages
Public - Crash Course - Apache Spark - Berlin - 2018 PDF
No ratings yet
Public - Crash Course - Apache Spark - Berlin - 2018 PDF
76 pages
BDA Lec10
No ratings yet
BDA Lec10
33 pages
Bda Unit-4 PDF
No ratings yet
Bda Unit-4 PDF
63 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Spark Streaming Through Dynamic Batch Sizing
No ratings yet
Spark Streaming Through Dynamic Batch Sizing
4 pages
Continuous Application 1725280881
No ratings yet
Continuous Application 1725280881
72 pages
Lecture 7 - 1-Spark - Streaming
No ratings yet
Lecture 7 - 1-Spark - Streaming
25 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Customizing Kafka Stream Procssing
No ratings yet
Customizing Kafka Stream Procssing
4 pages
Sigmod Structured Streaming
No ratings yet
Sigmod Structured Streaming
13 pages
Spark Streaming for Data Engineers
No ratings yet
Spark Streaming for Data Engineers
34 pages
Spark Streaming
No ratings yet
Spark Streaming
99 pages
Lecture 11
No ratings yet
Lecture 11
31 pages
Creating Streaming Datasets, Schema Inference, Partitioning of Streaming Datasets
No ratings yet
Creating Streaming Datasets, Schema Inference, Partitioning of Streaming Datasets
3 pages
Big Data Analytics - Unit 2 Notes
No ratings yet
Big Data Analytics - Unit 2 Notes
44 pages
Japanese Diet Benefits Breakthrough
No ratings yet
Japanese Diet Benefits Breakthrough
55 pages
My Brondong Mistake PDF
50% (2)
My Brondong Mistake PDF
248 pages
AI Foundations and Challenges1
No ratings yet
AI Foundations and Challenges1
31 pages
APIs All The Way Down
No ratings yet
APIs All The Way Down
2 pages
Cat I - Set 2 - E-Commerce Security
No ratings yet
Cat I - Set 2 - E-Commerce Security
6 pages
Unit 1 2 3
0% (1)
Unit 1 2 3
50 pages
Royal College Grade 11 Information and Communication Technology Second Term Paper 2022 English Medium
No ratings yet
Royal College Grade 11 Information and Communication Technology Second Term Paper 2022 English Medium
15 pages
Case Study
100% (1)
Case Study
15 pages
DLCO. 5th Unit Questions Wise
No ratings yet
DLCO. 5th Unit Questions Wise
39 pages
Toslink
No ratings yet
Toslink
20 pages
7 Question Quiz How Good Is Your Ad Backup and Recovery Solution Ebook 29604
No ratings yet
7 Question Quiz How Good Is Your Ad Backup and Recovery Solution Ebook 29604
10 pages
AY24 - 25 S1 Week 2 Engineering Reasoning Framework Traits and Elements Tutorial Handout
No ratings yet
AY24 - 25 S1 Week 2 Engineering Reasoning Framework Traits and Elements Tutorial Handout
7 pages
Armitage Use, Backtrack 5
No ratings yet
Armitage Use, Backtrack 5
5 pages
Procedural Lab Use The Teamcenter Environment Manager To Deploy The Template Project
No ratings yet
Procedural Lab Use The Teamcenter Environment Manager To Deploy The Template Project
2 pages
Ramdump Modem 2024-09-16 16-47-44 Props
No ratings yet
Ramdump Modem 2024-09-16 16-47-44 Props
25 pages
Central Finance Overview
No ratings yet
Central Finance Overview
12 pages
Logcat 1751382613831
No ratings yet
Logcat 1751382613831
1 page
T4 Client-Server Networks 2
No ratings yet
T4 Client-Server Networks 2
37 pages
Homework List Template
100% (1)
Homework List Template
5 pages
Android App Uninstallation Guide
No ratings yet
Android App Uninstallation Guide
3 pages
Function and Relations (Part 1)
No ratings yet
Function and Relations (Part 1)
23 pages
DBMS Aryan Assignment 2
No ratings yet
DBMS Aryan Assignment 2
12 pages
Programming With Uni Cod
No ratings yet
Programming With Uni Cod
63 pages
UNV【Datasheet】 IPC2122LB-SF28 (40) -A-BY 2MP Mini Fixed Bullet Network Camera Datasheet V1.1-EN
No ratings yet
UNV【Datasheet】 IPC2122LB-SF28 (40) -A-BY 2MP Mini Fixed Bullet Network Camera Datasheet V1.1-EN
4 pages
Brave MMA Event Expenses 2016
No ratings yet
Brave MMA Event Expenses 2016
18 pages
CV 5
No ratings yet
CV 5
5 pages
Python Revision Tour I QB
100% (1)
Python Revision Tour I QB
23 pages
B&E 105: T B S: Echnology For Usiness Olutions
No ratings yet
B&E 105: T B S: Echnology For Usiness Olutions
11 pages

Spark Streaming

Uploaded by

Spark Streaming

Uploaded by

Spark Streaming

Spark Streaming is Spark’s module for real time applications (for

Spark Streaming is a distributed data stream processing

Spark Streaming is an extension of the core Spark API that

Spark Streaming works in the below fashion:

 Spark Streaming receives live input data streams and

 Spark Streaming processes a data stream in micro-batches.

 StreamingContext, a class defined in the Spark Streaming

 It allows a Spark Streaming application to connect to a Spark

 It also provides methods for creating an instance of the data

 Every Spark Streaming application must create an instance

val config = new SparkConf().setMaster("spark://host:port")

Starting Stream Computation

The start method begins stream computation. Nothing really

A Spark Streaming application begins receiving data after it calls

Waiting for Stream Computation to

The awaitTermination method in the StreamingContext class

It’s syntax is:

DStreams or discretized streams

DStreams can be created either from input data streams from

Spark Streaming provides a high-level abstraction

DStream is represented by a continuous series of RDDs, which is

DStreams offer two types of operations:

1. Transformations: which yield a new DStream.

NOTE: DStreams provide many of the same operations available

Check pointing: main mechanism Spark Streaming provides for

NOTE: In Spark 1.1, Spark Streaming is available only in Java

Method as shown in the following example.

val lines = ssc.socketTextStream("localhost", 9999)

Spark Streaming Use Case:

Maven coordinates for Spark Streaming

Scala streaming imports

 Main entry point for streaming functionality.

 We use socketTextStream() to create a

Finally, we apply the output operation print() to print some of

// Create a StreamingContext with a 1-second batch size from a

// Create a DStream using data received after connecting to port

val lines = ssc.socketTextStream("localhost", 7777)

// Filter our DStream for lines with "error"

val errorLines = lines.filter(_.contains("error"))

// Print out the lines with errors

For example, in the above example of converting a stream of

Input DStreams are DStreams representing the stream of input

Every input DStream is associated with a Receiver (Scala

Spark Streaming provides two categories of built-in streaming

Spark Streaming will monitor the directory dataDirectory and

 The files must have the same data format.

For simple text files, there is an easier

This category of sources requires interfacing with external

Hence, to minimize issues related to version conflicts of

If we want to create a DStream using data from Twitter’s stream

Linking: Add the artifact spark-streaming-twitter_2.10 to the

Programming: Import the TwitterUtils class and create a

Some of these advanced sources are as

Spark Streaming also provides windowed computations, which

// Reduce last 30 seconds of data, every 10 seconds

val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a +

NOTE:Above Code generating word counts over last 30

The most common question is what minimum batch size Spark

The best approach is to start with a larger batch size (around 10

A common way to reduce the processing time of batches is to

1. Increasing the number of receivers

3. Increasing parallelism in aggregation

You might also like