Copyright© Scalebyte 1
ScaleByte
APACHE SPARK AND SCALA
www.scalebyte.com
Introduction
2
Spark is a framework that allows distributed
processing and also provides in-memory
computation power.
It’s a open source project, used for fast data
analytics
It is one of the apaches top level project
It provides high level API’s in JAVA, python and
scala with rich built-in library
Copyright© Scalebyte
Introduction Contd..
3
Spark runs on cluster and can access any data
source on HDFS as well as Cassandra
There is a need for fast processing as nowadays
waiting for a long time online / offline for getting
the results is inacceptable, hence spark is
introduced
It provides you with high level tools like spark
SQL, and MLib for machine learning purpose
Copyright© Scalebyte
Batch Vs Real-time (Stream) Scenario
4
Copyright ©
Batch Vs Real-time (Stream) Scenario
5
Copyright ©
Batch Vs Real-time (Stream) Scenario
6
Copyright ©
Analytics Type Based on Input Data
7
Copyright ©
Spark real time streaming
8
Chop up the live stream into
batches of X seconds
Spark treats each batch of
data as RDDs and processes
them using RDD operations
Finally, the processed results
of the RDD operations are
returned in batches
Copyright© Scalebyte
Why spark?
9
It’s a powerful and scalable open source processing engine
Features
Provides in-memory computation
Uses RDD’s
Works on HDFS, S3, Cassandra etc
Spark scheduling of jobs is faster than mapreduce
It has a new world record
Copyright© Scalebyte
In memory Computations
10
Copyright© Scalebyte
Spark unified stack
11 Copyright© Scalebyte
What is RDD
12
Resilient Distributed dataset [primary core abstraction]
A collection of huge data with the following properties
Immutable read only
Fault tolerant due to RDD lineage DAG
Distributed and partitioned across the cluster
Lazily evaluated
Type inferred
Cacheable
Copyright© Scalebyte
RDD Lineage — Logical Execution
Plan
13
It is built as a result of applying
transformations to the RDD and creates
a Logical Execution Plan.
Copyright© Scalebyte
RDD Basics
14
In Spark all work is expressed as either creating new
RDDs, transforming existing RDDs, or calling
operations (Actions) on RDDs to compute a result
There are two ways of creating RDD’s
By loading from external dataset
val Lines = sc.textFile(“Readme.md”)
By distributing a collection of elements/objects
val data = 1 to 100
val dataRDD = sc.parallelize(data)
Copyright© Scalebyte
RDD basics
15
Once created there are two kind of operations that can be
performed on RDD’s
Transformations
Apply functions to create a new RDD from the existing one
Creates DAG
Lazily evaluated
Does not return value
Actions
Compute results by applying functions on RDD
It returns value
Copyright© Scalebyte
RDD basics
16
In some cases you wish to use contents of RDD
repeatedly then in such cases you can persist the
data by using persist function
lines = sc.textFile(“Readme.txt”)
pythonLines=lines.filter(lambda line: "Python" in line)
pythonLines.count( )
pythonLines.persist( )
pythonLines.first( )
Copyright© Scalebyte
RDD Operations (Transformations)
17
Transformations
Transformations are operations that return new RDD’s
eg map(), filter() etc
filter() transformation in Scala
val inputRDD = sc.textFile("log.txt")
val errorsRDD = inputRDD.filter(line =>line.contains("error"))
filter() transformation in Python
inputRDD = sc.textFile("log.txt")
errorsRDD = inputRDD.filter(lambda x: "error" in x)
Copyright© Scalebyte
RDD Operations(Transformations)
18
let’s use inputRDD again to search for lines with the
word “warning”
we’ll use another transformation, union(), to print out
the number of lines that contained either “error” or
“warning”
errorsRDD = inputRDD.filter(lambda x: "error" in x)
warningsRDD = inputRDD.filter(lambda x: "warning" in x)
badLinesRDD = errorsRDD.union(warningsRDD)
Copyright© Scalebyte
RDD Operations(Transformations)
19
union() is a bit different than filter(), in that
it operates on two RDDs instead of one.
Copyright© Scalebyte
RDD Operations (Actions)
20
Actions
Actions are operations performed on RDD
that return results to the driver program, or
they can be stored into some storage Eg
count(), first() etc
Suppose we might want to print out some
information about the badLinesRDD, see
the examples inCopyright©
the next slide
Scalebyte
RDD Operations (Actions)
21
Scala error count using actions
println("Input had " + badLinesRDD.count() + " concerning lines")
println("Here are 10 examples:")
badLinesRDD.take(10).foreach(println)
Python error count using actions
print "Input had " + badLinesRDD.count() + " concerning lines"
print "Here are 10 examples:"
for line in badLinesRDD.take(10):
print line
Copyright© Scalebyte
22
RDD Operations (Actions)
Note: Its important to know that each
time we call a new action the RDD is
computed from scratch, this is indeed time
consuming
That is the reason we have persist() /
cache() action available that will help us to
store the intermediate results / RDD’s into
memory for further use Scalebyte
Copyright©
Lazy Evaluation
23
Transformations on RDDs are lazily evaluated,
meaning that spark will not begin to execute
until it sees an action.
For the below statements spark will not come in picture because
none of the
errorsRDD three statements have
= inputRDD.filter(lambda x: action operation
"error" in x)
warningsRDD = inputRDD.filter(lambda x: "warning"
in x)
badLinesRDD = errorsRDD.union(warningsRDD)
Copyright© Scalebyte
Passing functions to spark
24
word = rdd.filter(lambda s: "error" in s)
def containsError(s):
return "error" in s
word = rdd.filter(containsError)
Copyright© Scalebyte
RDD transformations
25
Copyright© Scalebyte
RDD Transformations
26
Copyright© Scalebyte
RDD Transformations
27
Copyright© Scalebyte
RDD ACTIONS
28
Copyright© Scalebyte
RDD ACTIONS
29
Copyright© Scalebyte
Quick architectural overview
30
Your program acts as a driver
So is your spark shell
The driver program is just one part of the
spark application
Copyright© Scalebyte
SPARK Architecture
31
Spark application starts on two
node cluster 3
mast
worker
The driver program contacts er
4
the master for resources 2 3 execu
ter T
Next the master contacts the T
worker nodes 1
driver 5
Worker nodes create executers worker
4
Executers now directly come in
execu
contact with the driver nodes terT
T
and the further communication
happens between driver and
executer node
Copyright© Scalebyte
Major Industries leveraging Analytics
33
Copyright ©
Major Industries leveraging Analytics
34
Copyright ©
Before We Go Ahead
35
Copyright ©
Before We Go Ahead
36
Our area of Interest is Real-Time Analytics
Lets Explore Tools which can give low latency and
high throughput for Real Time Analytics
Copyright ©
Most Popular Real-Time Analytics Tool
37
Copyright ©
Idle Tool Real-Time Analytics
38
Copyright ©
Apache Flink Idle Tool Real-Time Analytics
39
Copyright ©
Apache Flink
40
Apache Flink is an open source platform which is a
streaming data flow engine that provides communication,
fault-tolerance, and data-distribution for distributed
computations over data streams
Flink is a top level project of Apache.
Flink is a scalable data Analyticsframework that is fully
compatible to Hadoop.
Flink can execute both stream processing and batch
processing easily.
Copyright© Scalebyte
Apache Flink
41
Its at its early state ie its still not been
explored by the many of analytics
community
Most of the companies are migrating
Version available for download Apache
Flink1.3.0
Copyright© Scalebyte
Features of Apache Flink
42
i. Low latency and High Performance
Apache Flink provides high performance and Low latency without any heavy configuration. Its pipelined
architecture provides the high throughput rate. It processes the data at lightening fast speed, it is also
called as 4G of Big Data.
ii. Fault Tolerance
The fault tolerance feature provided by Apache Flink is based on Chandy-Lamport distributed snapshots,
this mechanism provides the strong consistency guarantees.
iii. Memory Management
The memory management in apache flink provides control on how much memory is used by certain
runtime operations.
iv. Iterations
Apache Flink provides the dedicated support for iterative algorithms (machine learning, graph processing)
v. Integration
Apache Flink can be easily integrated with other open source data processing ecosystem. It can be
integrated with Hadoop, streams data from Kafka, It can be run on YARN.
Copyright© Scalebyte
The Strength of Flink comes from its Architecture
43
44
Lambda Architecture
45
Lambda Architecture
46
47
48
Other Users of Apache Flink
49
Conclusion: Now is the Time for Apache Flink
50
Big Data
Processing Tool