Preface
Content of this Lecture:
In this lecture, we will discuss the ‘framework of
spark’, Resilient Distributed Datasets (RDDs) and also
discuss Spark execution.
Big Data Computing Vu Pham Introduction to Spark
Need of Spark
Apache Spark is a big data analytics framework that
was originally developed at the University of
California, Berkeley's AMPLab, in 2012. Since then, it
has gained a lot of attraction both in academia and in
industry.
It is an another system for big data analytics
Isn’t MapReduce good enough?
Simplifies batch processing on large commodity clusters
Big Data Computing Vu Pham Introduction to Spark
Need of Spark
Map Reduce
Input Output
Big Data Computing Vu Pham Introduction to Spark
Need of Spark
Map Reduce
Expensive save to disk for fault
tolerance
Input Output
Big Data Computing Vu Pham Introduction to Spark
Need of Spark
MapReduce can be expensive for some applications e.g.,
Iterative
Interactive
Lacks efficient data sharing
Specialized frameworks did evolve for different programming
models
Bulk Synchronous Processing (Pregel)
Iterative MapReduce (Hadoop) ….
Big Data Computing Vu Pham Introduction to Spark
Solution: Resilient Distributed Datasets (RDDs)
Resilient Distributed Datasets (RDDs)
Immutable, partitioned collection of records
Built through coarse grained transformations (map, join …)
Can be cached for efficient reuse
Big Data Computing Vu Pham Introduction to Spark
Need of Spark
RDD RDD RDD
Read
HDFS
Read Cache
Map Reduce
Big Data Computing Vu Pham Introduction to Spark
Solution: Resilient Distributed Datasets (RDDs)
Resilient Distributed Datasets (RDDs)
Immutable, partitioned collection of records
Built through coarse grained transformations (map, join …)
Fault Recovery?
Lineage!
Log the coarse grained operation applied to a
partitioned dataset
Simply recompute the lost partition if failure occurs!
No cost if no failure
Big Data Computing Vu Pham Introduction to Spark
RDD RDD RDD
Read
HDFS
Read Cache
Map Reduce
Big Data Computing Vu Pham Introduction to Spark
Read
HDFS Map Reduce
Lineage
Introduction to Spark
Big Data Computing Vu Pham Introduction to Spark
RDD RDD RDD
Read
HDFS RDDs track the graph of
Read transformations that built them Cache
(their lineage) to rebuild lost data
Map Reduce
Big Data Computing Vu Pham Introduction to Spark
What can you do with Spark?
RDD operations
Transformations e.g., filter, join, map, group-by …
Actions e.g., count, print …
Control
Partitioning: Spark also gives you control over how you can
partition your RDDs.
Persistence: Allows you to choose whether you want to
persist RDD onto disk or not.
Big Data Computing Vu Pham Introduction to Spark
Spark Applications
i. Twitter spam classification
ii. EM algorithm for traffic prediction
iii. K-means clustering
iv. Alternating Least Squares matrix factorization
v. In-memory OLAP aggregation on Hive data
vi. SQL on Spark
Big Data Computing Vu Pham Introduction to Spark
Reading Material
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin,
Scott Shenker, Ion Stoica
“Spark: Cluster Computing with Working Sets”
Matei Zaharia, Mosharaf Chowdhury et al.
“Resilient Distributed Datasets: A Fault-Tolerant
Abstraction for In-Memory Cluster Computing”
https://spark.apache.org/
Big Data Computing Vu Pham Introduction to Spark
Spark Execution
Big Data Computing Vu Pham Introduction to Spark
Distributed Programming (Broadcast)
Big Data Computing Vu Pham Introduction to Spark
Distributed Programming (Take)
Big Data Computing Vu Pham Introduction to Spark
Distributed Programming (DAG Action)
Big Data Computing Vu Pham Introduction to Spark
Distributed Programming (Shuffle)
Big Data Computing Vu Pham Introduction to Spark
DAG (Directed Acyclic Graph)
Big Data Computing Vu Pham Introduction to Spark
DAG (Directed Acyclic Graph)
Action
Count
Take
Foreach
Transformation
Map
ReduceByKey
GroupByKey
JoinByKey
Big Data Computing Vu Pham Introduction to Spark
DAG (Directed Acyclic Graph)
Big Data Computing Vu Pham Introduction to Spark
Flume Java
Big Data Computing Vu Pham Introduction to Spark
Spark Implementation
Big Data Computing Vu Pham Introduction to Spark
Spark ideas
Expressive computing system, not limited to
map-reduce model
Facilitate system memory
avoid saving intermediate results to disk
cache data for repetitive queries (e.g. for machine
learning)
Compatible with Hadoop
Big Data Computing Vu Pham Introduction to Spark
RDD abstraction
Resilient Distributed Datasets
Partitioned collection of records
Spread across the cluster
Read-only
Caching dataset in memory
different storage levels available
fallback to disk possible
Big Data Computing Vu Pham Introduction to Spark
RDD operations
Transformations to build RDDs through
deterministic operations on other RDDs
transformations include map, filter, join
lazy operation
Actions to return value or export data
actions include count, collect, save
triggers execution
Big Data Computing Vu Pham Introduction to Spark
Spark Components
Big Data Computing Vu Pham Introduction to Spark
Job example
val log = sc.textFile(“hdfs://...”)
val errors = file.filter(_.contains(“ERROR”))
errors.cache()
Driver
errors.filter(_.contains(“I/O”)).count()
errors.filter(_.contains(“timeout”)).count()
Action!
Worker Worker Worker
Cache1 Cache2 Cache2
Block1 Block2 Block3
Big Data Computing Vu Pham Introduction to Spark
RDD partition-level view
Dataset-level view: Partition-level view:
log:
HadoopRDD
path = hdfs://...
errors:
FilteredRDD
func = _.contains(…)
shouldCache = true
Task 1 Task 2 ...
source: https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
Big Data Computing Vu Pham Introduction to Spark
Job scheduling
RDD Objects DAGScheduler TaskScheduler Worker
Cluster Threads
DAG TaskSet manager Task Block
manager
rdd1.join(rdd2) split graph into launch tasks via execute tasks
.groupBy(…)
stages of tasks cluster manager
.filter(…)
submit each retry failed or store and serve
build operator DAG
stage as ready straggling tasks blocks
source: https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
Big Data Computing Vu Pham Introduction to Spark
Available APIs
You can write in Java, Scala or Python
Interactive interpreter: Scala & Python only
Standalone applications: any
Performance: Java & Scala are faster thanks to
static typing
Big Data Computing Vu Pham Introduction to Spark
Hand on - interpreter
script
http://cern.ch/kacper/spark.txt
run scala spark interpreter
$ spark-shell
or python interpreter
$ pyspark
Big Data Computing Vu Pham Introduction to Spark
Hand on – build and submission
download and unpack source code
wget http://cern.ch/kacper/GvaWeather.tar.gz; tar -xzf GvaWeather.tar.gz
build definition in
GvaWeather/gvaweather.sbt
source code
GvaWeather/src/main/scala/GvaWeather.scala
building
cd GvaWeather
sbt package
job submission
spark-submit --master local --class GvaWeather \
target/scala-2.10/gva-weather_2.10-1.0.jar
Big Data Computing Vu Pham Introduction to Spark
Summary
Concept not limited to single pass map-reduce
Avoid sorting intermediate results on disk or
HDFS
Speedup computations when reusing datasets
Big Data Computing Vu Pham Introduction to Spark
Conclusion
RDDs (Resilient Distributed Datasets (RDDs) provide
a simple and efficient programming model
Generalized to a broad set of applications
Leverages coarse-grained nature of parallel
algorithms for failure recovery
Big Data Computing Vu Pham Introduction to Spark