Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views14 pages

Spark Streaming

Spark Streaming is a module of Apache Spark designed for processing real-time data streams, allowing users to leverage a similar API to batch jobs for developing streaming applications. It processes live data in micro-batches, enabling high-throughput and fault-tolerant stream processing from various sources like Kafka and Twitter. Key components include DStreams for data representation, StreamingContext for application entry, and various operations for data transformation and output.

Uploaded by

vudayk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views14 pages

Spark Streaming

Spark Streaming is a module of Apache Spark designed for processing real-time data streams, allowing users to leverage a similar API to batch jobs for developing streaming applications. It processes live data in micro-batches, enabling high-throughput and fault-tolerant stream processing from various sources like Kafka and Twitter. Key components include DStreams for data representation, StreamingContext for application entry, and various operations for data transformation and output.

Uploaded by

vudayk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Spark Streaming

Spark Streaming is Spark’s module for real time applications (for


ex: track statistics about page views in real time, automatically
detect anomalies). It lets users write streaming applications using
a very similar API to batch jobs, and thus reuse a lot of the skills
and even code they built for those.

Spark Streaming is a distributed data stream processing


framework. It makes it easy to develop distributed
applications for processing live data streams in near real time. It
not only provides a simple programming model but also enables
an application to process high-velocity stream data. It also
allows the combining of data streams and historical data for
processing.

Spark Streaming is an extension of the core Spark API that


enables scalable, high-throughput, fault-tolerant stream
processing of live data streams. Data can be ingested from many
sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis or TCP
sockets can be processed using complex algorithms expressed
with high-level functions like map, reduce, join and window.
Finally, processed data can be pushed out to file systems,
databases, and live dashboards
Process Flow in Spark Streaming

Spark Streaming works in the below fashion:

 Spark Streaming receives live input data streams and


divides the data into batches.
 Spark Engine will process the same data
 Once processing is done Spark engine will generate the final
stream of results in batches.

High-Level Architecture

 Spark Streaming processes a data stream in micro-batches.


 It splits a data stream into batches of very small fixed-sized
time intervals.
 Data in each micro-batch is stored as an RDD, which is
then processed using Spark core Any RDD operation can be
applied to an RDD created by Spark Streaming.
 The results of the RDD operations are streamed out in
batches
.

StreamingContext

 StreamingContext, a class defined in the Spark Streaming


library, is the main entry point into the Spark
Streaming library.

 It allows a Spark Streaming application to connect to a Spark


cluster.

 It also provides methods for creating an instance of the data


stream abstraction provided by Spark Streaming.

 Every Spark Streaming application must create an instance


of this class.

import org.apache.spark._
import org.apache.spark.streaming._

val config = new SparkConf().setMaster("spark://host:port")


.setAppName("big streaming app")
val batchInterval = 10
val ssc = new StreamingContext(conf, Seconds(batchInterval))
NOTE: The batch size can be as small as 500 milliseconds. The upper
bound for the batch size is determined by the latency requirements
of your application and the available memory

Starting Stream Computation

The start method begins stream computation. Nothing really


happens in a Spark Streaming application until the start method is
called on an instance of the StreamingContext class.

A Spark Streaming application begins receiving data after it calls


the start method.

ssc.start()

Waiting for Stream Computation to


Finish

The awaitTermination method in the StreamingContext class


makes an application thread wait for stream computation to stop.

It’s syntax is:

ssc.awaitTermination()

DStreams or discretized streams


Spark is built on the concept of RDDs, Spark Streaming provides
an abstraction called DStreams, or discretized streams.

DStreams can be created either from input data streams from


sources such as Kafka, Flume, and Kinesis, or by applying
high-level operations on other DStreams. Internally, a DStream is
represented as a sequence of RDDs.
A DStream is a sequence of data arriving over time. Internally,
each DStream is represented as a sequence of RDDs arriving at
each time step (hence the name “discretized”).

Spark Streaming provides a high-level abstraction


called discretized stream or DStream, which represents a
continuous stream of data

DStream is represented by a continuous series of RDDs, which is


Spark’s abstraction of an immutable, distributed dataset. Each
RDD in a DStream contains data from a certain interval, as shown
in the following figure.

DStreams offer two types of operations:

1. Transformations: which yield a new DStream.


2. Output operations: which write data to an external system

NOTE: DStreams provide many of the same operations available


on RDDs, plus new operations related to time, such as sliding
windows
Unlike batch programs, Spark Streaming applications need
additional setup in order to operate 24/7.

Check pointing: main mechanism Spark Streaming provides for


streaming purpose, which lets it store data in a reliable file
system such as HDFS.

NOTE: In Spark 1.1, Spark Streaming is available only in Java


and Scala. Experimental Python support was added in Spark 1.2

Transform
 The transform method returns a DStream by applying an
RDD => RDD function to each RDD in the source DStream.
 It takes as argument a function that takes an RDD as
argument and returns an RDD.
 Thus, it gives us direct access to the underlying RDDs of a
DStream.
 This method allows you to use methods provided by the RDD
API, but which do not have equivalent operations in the
DStream API. For example, sortBy is a transformation
available in the RDD API, but not in the DStream API.
 If you want to sort the elements within each RDD of a
DStream, you can use the transform

Method as shown in the following example.

val lines = ssc.socketTextStream("localhost", 9999)


val words = lines.flatMap{line => line.split(" ")}
val sorted = words.transform{rdd => rdd.sortBy((w)=> w)}
NOTE: The transform method is also useful for applying machine
and graph computation algorithms to data streams. The machine
learning and graph processing libraries provide classes and
methods that operate at the RDD level. Within the transform
method, you can use the API provided by these libraries

Spark Streaming Use Case:


We will receive a stream of newline-delimited lines of text from a
server running at port 7777, filter only the lines that contain the
word error, and print them.

Maven coordinates for Spark Streaming

groupId = org.apache.spark
artifactId = spark-streaming_2.10
version = 1.2.0

Scala streaming imports

import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.Duration
import org.apache.spark.streaming.Seconds

StreamingContext:

 Main entry point for streaming functionality.


 This also sets up an underlying SparkContext that it will use
to process the data.
 It takes as input a batch interval specifying how often to
process new data... which we set to 1 second.
socketTextStream()

 We use socketTextStream() to create a


DStream based on text data received on port 7777 of the
local machine.

Then we transform the DStream with filter() to get only the lines
that contain error

Finally, we apply the output operation print() to print some of


the filtered lines.

// Create a StreamingContext with a 1-second batch size from a


SparkConf
val ssc = new StreamingContext(conf, Seconds(1))

// Create a DStream using data received after connecting to port


7777 on the local machine

val lines = ssc.socketTextStream("localhost", 7777)

// Filter our DStream for lines with "error"

val errorLines = lines.filter(_.contains("error"))

// Print out the lines with errors

errorLines.print()

For example, in the above example of converting a stream of


lines to words, the flatMap operation is applied on each RDD in
the lines DStream to generate the RDDs of the words DStream.
This is shown in the following figure.
Input DStreams and Receivers

Input DStreams are DStreams representing the stream of input


data received from streaming sources. In the above example of
converting streaming of lines into words, lines was an input
DStream as it represented the stream of data received from the
netcat server

Every input DStream is associated with a Receiver (Scala


doc, Java doc) object which receives the data from a source and
stores it in Spark’s memory for processing.

Spark Streaming provides two categories of built-in streaming


sources.
 Basic sources: Sources directly available in the
StreamingContext API. Example: file systems, socket
connections, and Akka actors.
 Advanced sources: Sources like Kafka, Flume, Kinesis,
Twitter, etc. are available through extra utility classes. These
require linking against extra dependencies.

Basic Sources

File Streams: For reading data from files on any file system
compatible with the HDFS API (that is, HDFS, S3, NFS, etc.), a
DStream can be created as:
streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory)

Spark Streaming will monitor the directory dataDirectory and


process any files created in that directory (files written in nested
directories not supported). Note that

 The files must have the same data format.


 The files must be created in the data Directory by
atomically moving or renaming them into the data directory.
 Once moved, the files must not be changed. So if the files are
being continuously appended, the new data will not be read.

For simple text files, there is an easier


method streamingContext.textFileStream(dataDirectory).
And file streams do not require running a receiver, hence does
not require allocating cores.
Advanced Sources

This category of sources requires interfacing with external


non-Spark libraries, some of them with complex dependencies
(e.g., Kafka and Flume).

Hence, to minimize issues related to version conflicts of


dependencies, the functionality to create DStreams from these
sources have been moved to separate libraries

Example:

If we want to create a DStream using data from Twitter’s stream


of tweets, you have to do the following.

Linking: Add the artifact spark-streaming-twitter_2.10 to the


SBT/Maven project dependencies

Programming: Import the TwitterUtils class and create a


DStream with TwitterUtils.createStream as shown below

Some of these advanced sources are as


follows.
 Twitter: Spark Streaming’s TwitterUtils uses Twitter4j 3.0.3 to
get the public stream of tweets using Twitter’s Streaming API.
Authentication information can be provided by any of
the methods supported by Twitter4J library. You can either get
the public stream, or get the filtered stream based on a
keywords.
 Flume: Spark Streaming 1.2.0 can received data from Flume
1.4.0 .
 Kafka: Spark Streaming 1.2.0 can receive data from Kafka
0.8.0.

Window Operations

Spark Streaming also provides windowed computations, which


allow you to apply transformations over a sliding window of data.
This following figure explaines this sliding window.
Scala Code

// Reduce last 30 seconds of data, every 10 seconds

val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a +


b), Seconds(30), Seconds(10))

NOTE:Above Code generating word counts over last 30


seconds of data, every 10 seconds. To do this, we have to
apply the reduceByKey operation on the pairs DStream
of (word, 1) pairs over the last 30 seconds of data. This is
done using the operation reduceByKeyAndWindow.

Performance Considerations
Batch and Window Sizes

The most common question is what minimum batch size Spark


Streaming can use. In general, 500 milliseconds has proven to be
a good minimum size for many applications.

The best approach is to start with a larger batch size (around 10


seconds) andwork your way down to a smaller batch size. If the
processing times reported in the Streaming UI remain consistent,
then you can continue to decrease the batch size.

Level of Parallelism

A common way to reduce the processing time of batches is to


increase the parallelism. There are three ways to increase the
parallelism.

1. Increasing the number of receivers


2. Explicitly repartitioning received data
explicitly repartitioning the input stream (or the union of multiple
streams) using DStream.repartition.

3. Increasing parallelism in aggregation


For operations like reduceByKey(), you can specify the parallelism as a
second parameter

You might also like