Spark Streaming
Spark Streaming is Spark’s module for real time applications (for
ex: track statistics about page views in real time, automatically
detect anomalies). It lets users write streaming applications using
a very similar API to batch jobs, and thus reuse a lot of the skills
and even code they built for those.
Spark Streaming is a distributed data stream processing
framework. It makes it easy to develop distributed
applications for processing live data streams in near real time. It
not only provides a simple programming model but also enables
an application to process high-velocity stream data. It also
allows the combining of data streams and historical data for
processing.
Spark Streaming is an extension of the core Spark API that
enables scalable, high-throughput, fault-tolerant stream
processing of live data streams. Data can be ingested from many
sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis or TCP
sockets can be processed using complex algorithms expressed
with high-level functions like map, reduce, join and window.
Finally, processed data can be pushed out to file systems,
databases, and live dashboards
Process Flow in Spark Streaming
Spark Streaming works in the below fashion:
Spark Streaming receives live input data streams and
divides the data into batches.
Spark Engine will process the same data
Once processing is done Spark engine will generate the final
stream of results in batches.
High-Level Architecture
Spark Streaming processes a data stream in micro-batches.
It splits a data stream into batches of very small fixed-sized
time intervals.
Data in each micro-batch is stored as an RDD, which is
then processed using Spark core Any RDD operation can be
applied to an RDD created by Spark Streaming.
The results of the RDD operations are streamed out in
batches
.
StreamingContext
StreamingContext, a class defined in the Spark Streaming
library, is the main entry point into the Spark
Streaming library.
It allows a Spark Streaming application to connect to a Spark
cluster.
It also provides methods for creating an instance of the data
stream abstraction provided by Spark Streaming.
Every Spark Streaming application must create an instance
of this class.
import org.apache.spark._
import org.apache.spark.streaming._
val config = new SparkConf().setMaster("spark://host:port")
.setAppName("big streaming app")
val batchInterval = 10
val ssc = new StreamingContext(conf, Seconds(batchInterval))
NOTE: The batch size can be as small as 500 milliseconds. The upper
bound for the batch size is determined by the latency requirements
of your application and the available memory
Starting Stream Computation
The start method begins stream computation. Nothing really
happens in a Spark Streaming application until the start method is
called on an instance of the StreamingContext class.
A Spark Streaming application begins receiving data after it calls
the start method.
ssc.start()
Waiting for Stream Computation to
Finish
The awaitTermination method in the StreamingContext class
makes an application thread wait for stream computation to stop.
It’s syntax is:
ssc.awaitTermination()
DStreams or discretized streams
Spark is built on the concept of RDDs, Spark Streaming provides
an abstraction called DStreams, or discretized streams.
DStreams can be created either from input data streams from
sources such as Kafka, Flume, and Kinesis, or by applying
high-level operations on other DStreams. Internally, a DStream is
represented as a sequence of RDDs.
A DStream is a sequence of data arriving over time. Internally,
each DStream is represented as a sequence of RDDs arriving at
each time step (hence the name “discretized”).
Spark Streaming provides a high-level abstraction
called discretized stream or DStream, which represents a
continuous stream of data
DStream is represented by a continuous series of RDDs, which is
Spark’s abstraction of an immutable, distributed dataset. Each
RDD in a DStream contains data from a certain interval, as shown
in the following figure.
DStreams offer two types of operations:
1. Transformations: which yield a new DStream.
2. Output operations: which write data to an external system
NOTE: DStreams provide many of the same operations available
on RDDs, plus new operations related to time, such as sliding
windows
Unlike batch programs, Spark Streaming applications need
additional setup in order to operate 24/7.
Check pointing: main mechanism Spark Streaming provides for
streaming purpose, which lets it store data in a reliable file
system such as HDFS.
NOTE: In Spark 1.1, Spark Streaming is available only in Java
and Scala. Experimental Python support was added in Spark 1.2
Transform
The transform method returns a DStream by applying an
RDD => RDD function to each RDD in the source DStream.
It takes as argument a function that takes an RDD as
argument and returns an RDD.
Thus, it gives us direct access to the underlying RDDs of a
DStream.
This method allows you to use methods provided by the RDD
API, but which do not have equivalent operations in the
DStream API. For example, sortBy is a transformation
available in the RDD API, but not in the DStream API.
If you want to sort the elements within each RDD of a
DStream, you can use the transform
Method as shown in the following example.
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap{line => line.split(" ")}
val sorted = words.transform{rdd => rdd.sortBy((w)=> w)}
NOTE: The transform method is also useful for applying machine
and graph computation algorithms to data streams. The machine
learning and graph processing libraries provide classes and
methods that operate at the RDD level. Within the transform
method, you can use the API provided by these libraries
Spark Streaming Use Case:
We will receive a stream of newline-delimited lines of text from a
server running at port 7777, filter only the lines that contain the
word error, and print them.
Maven coordinates for Spark Streaming
groupId = org.apache.spark
artifactId = spark-streaming_2.10
version = 1.2.0
Scala streaming imports
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.Duration
import org.apache.spark.streaming.Seconds
StreamingContext:
Main entry point for streaming functionality.
This also sets up an underlying SparkContext that it will use
to process the data.
It takes as input a batch interval specifying how often to
process new data... which we set to 1 second.
socketTextStream()
We use socketTextStream() to create a
DStream based on text data received on port 7777 of the
local machine.
Then we transform the DStream with filter() to get only the lines
that contain error
Finally, we apply the output operation print() to print some of
the filtered lines.
// Create a StreamingContext with a 1-second batch size from a
SparkConf
val ssc = new StreamingContext(conf, Seconds(1))
// Create a DStream using data received after connecting to port
7777 on the local machine
val lines = ssc.socketTextStream("localhost", 7777)
// Filter our DStream for lines with "error"
val errorLines = lines.filter(_.contains("error"))
// Print out the lines with errors
errorLines.print()
For example, in the above example of converting a stream of
lines to words, the flatMap operation is applied on each RDD in
the lines DStream to generate the RDDs of the words DStream.
This is shown in the following figure.
Input DStreams and Receivers
Input DStreams are DStreams representing the stream of input
data received from streaming sources. In the above example of
converting streaming of lines into words, lines was an input
DStream as it represented the stream of data received from the
netcat server
Every input DStream is associated with a Receiver (Scala
doc, Java doc) object which receives the data from a source and
stores it in Spark’s memory for processing.
Spark Streaming provides two categories of built-in streaming
sources.
Basic sources: Sources directly available in the
StreamingContext API. Example: file systems, socket
connections, and Akka actors.
Advanced sources: Sources like Kafka, Flume, Kinesis,
Twitter, etc. are available through extra utility classes. These
require linking against extra dependencies.
Basic Sources
File Streams: For reading data from files on any file system
compatible with the HDFS API (that is, HDFS, S3, NFS, etc.), a
DStream can be created as:
streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory)
Spark Streaming will monitor the directory dataDirectory and
process any files created in that directory (files written in nested
directories not supported). Note that
The files must have the same data format.
The files must be created in the data Directory by
atomically moving or renaming them into the data directory.
Once moved, the files must not be changed. So if the files are
being continuously appended, the new data will not be read.
For simple text files, there is an easier
method streamingContext.textFileStream(dataDirectory).
And file streams do not require running a receiver, hence does
not require allocating cores.
Advanced Sources
This category of sources requires interfacing with external
non-Spark libraries, some of them with complex dependencies
(e.g., Kafka and Flume).
Hence, to minimize issues related to version conflicts of
dependencies, the functionality to create DStreams from these
sources have been moved to separate libraries
Example:
If we want to create a DStream using data from Twitter’s stream
of tweets, you have to do the following.
Linking: Add the artifact spark-streaming-twitter_2.10 to the
SBT/Maven project dependencies
Programming: Import the TwitterUtils class and create a
DStream with TwitterUtils.createStream as shown below
Some of these advanced sources are as
follows.
Twitter: Spark Streaming’s TwitterUtils uses Twitter4j 3.0.3 to
get the public stream of tweets using Twitter’s Streaming API.
Authentication information can be provided by any of
the methods supported by Twitter4J library. You can either get
the public stream, or get the filtered stream based on a
keywords.
Flume: Spark Streaming 1.2.0 can received data from Flume
1.4.0 .
Kafka: Spark Streaming 1.2.0 can receive data from Kafka
0.8.0.
Window Operations
Spark Streaming also provides windowed computations, which
allow you to apply transformations over a sliding window of data.
This following figure explaines this sliding window.
Scala Code
// Reduce last 30 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a +
b), Seconds(30), Seconds(10))
NOTE:Above Code generating word counts over last 30
seconds of data, every 10 seconds. To do this, we have to
apply the reduceByKey operation on the pairs DStream
of (word, 1) pairs over the last 30 seconds of data. This is
done using the operation reduceByKeyAndWindow.
Performance Considerations
Batch and Window Sizes
The most common question is what minimum batch size Spark
Streaming can use. In general, 500 milliseconds has proven to be
a good minimum size for many applications.
The best approach is to start with a larger batch size (around 10
seconds) andwork your way down to a smaller batch size. If the
processing times reported in the Streaming UI remain consistent,
then you can continue to decrease the batch size.
Level of Parallelism
A common way to reduce the processing time of batches is to
increase the parallelism. There are three ways to increase the
parallelism.
1. Increasing the number of receivers
2. Explicitly repartitioning received data
explicitly repartitioning the input stream (or the union of multiple
streams) using DStream.repartition.
3. Increasing parallelism in aggregation
For operations like reduceByKey(), you can specify the parallelism as a
second parameter