Introduction Apache Spark
1
Recap
MapReduce
• For easily writing applications to process vast amounts of data in-
parallel on large clusters in a reliable, fault-tolerant manner
• Takes care of scheduling tasks, monitoring them and re-executes the
failed tasks
HDFS & MapReduce: Running on the same set of nodes compute
nodes and storage nodes same (keeping data close to the
computation) very high throughput
YARN & MapReduce: A single master resource manager, one slave
node manager per node, and AppMaster per application
2
Today’s Topics
• Motivation
• Spark Basics
• Spark Programming
3
History of Hadoop and Spark
4
Apache Hadoop & Apache Spark
Map Other Spark Spark
Reduce Hive Pig Applications Stream SQL
Processing
Resource Yet Another Resource Mesos etc. Spark Core
manager Data
Negotiator (YARN) Ingestion
Systems
e.g.,
Cassandra etc., Apache
Hadoop Database (HBase) other storage Kafka,
Hadoop Distributed File System (HDFS) systems Flume, etc
Data
Storage
Hadoop Spark
5
Apache Spark
** Spark can connect to several types of cluster managers (either
Spark’s own standalone cluster manager, Mesos or YARN)
Processing Spark Other
Spark SQL Spark ML Applications
Stream
Resource
manager Spark Core Mesos etc. Yet Another Resource Data
(Standalone Scheduler) Negotiator (YARN) Ingestion
Systems
e.g., Apache
S3, Cassandra etc., other Hadoop NoSQL Database (HBase) Kafka, Flume,
Data etc
Storage
storage systems
Hadoop Distributed File System (HDFS)
Hadoop Spark
6
Apache Hadoop: No Unified Vision
• Sparse Modules
• Diversity of APIs
• Higher Operational Costs
7
Spark Ecosystem: A Unified Pipeline
8
Spark vs MapReduce: Data Flow
9
Data Access Rates
• With in a node:
CPU to Memory: 10 GB/sec
CPU to HardDisk: 0.1 GB/sec
CPU to SSD: 0.6 GB/sec
• Nodes between networks: 0.125 GB/sec to 1 GB/sec
• Nodes in the same rack: 0.125 GB/sec to 1 GB/sec
• Nodes between racks: 0.1 GB/sec
10
Spark: High Performance & Simple Data Flow
11
Performance: Spark vs MapReduce (1)
• Iterative algorithms
Spark is faster a simplified data flow
Avoids materializing data on HDFS after each iteration
• Example: k-means algorithm, 1 iteration
HDFS Read
Map(Assign sample to closest centroid)
GroupBy(Centroid_ID)
NETWORK Shuffle
Reduce(Compute new centroids)
HDFS Write
12
Performance: Spark vs MapReduce (2)
13
Code: Hadoop vs Spark (e.g., Word Count)
• Simple/Less code
• Multiple stages pipeline
• Operations
Transformations: apply user code to
distribute data in parallel
Actions: assemble final output from
distributed data
14
Motivation (1)
MapReduce: The original scalable, general, processing engine
of the Hadoop ecosystem
• Disk-based data processing framework (HDFS files)
• Persists intermediate results to disk
• Data is reloaded from disk with every query → Costly I/O
• Best for ETL like workloads (batch processing)
• Costly I/O → Not appropriate for iterative or stream processing
workloads
15
Motivation (2)
Spark: General purpose computational framework that
substantially improves performance of MapReduce, but
retains the basic model
• Memory based data processing framework → avoids costly I/O by
keeping intermediate results in memory
• Leverages distributed memory
• Remembers operations applied to dataset
• Data locality based computation → High Performance
• Best for both iterative (or stream processing) and batch workloads
16
Motivation - Summary
• Software engineering point of view
Hadoop code base is huge
Contributions/Extensions to Hadoop are cumbersome
Java-only hinders wide adoption, but Java support is fundamental
• System/Framework point of view
Unified pipeline
Simplified data flow
Faster processing speed
• Data abstraction point of view
New fundamental abstraction RDD
Easy to extend with new operators
More descriptive computing model
17
Today’s Topics
• Motivation
• Spark Basics
• Spark Programming
18
Spark Basics(1)
Spark: Flexible, in-memory data processing framework written in Scala
Goals:
• Simplicity (Easier to use):
Rich APIs for Scala, Java, and Python
• Generality: APIs for different types of workloads
Batch, Streaming, Machine Learning, Graph
• Low Latency (Performance) : In-memory processing and caching
• Fault-tolerance: Faults shouldn’t be special case
19
Spark Basics(2)
There are two ways to manipulate data in Spark
• Spark Shell:
Interactive – for learning or data exploration
Python or Scala
• Spark Applications
For large scale data processing
Python, Scala, or Java
20
Spark Shell
The Spark Shell provides interactive data exploration (REPL)
REPL: Repeat/Evaluate/Print Loop
21
Spark Core: Code Base (2012)
22
Spark Fundamentals
Example of an application:
• Spark Context
• Resilient Distributed Data
• Transformations
• Actions
23
Spark: Fundamentals
• Spark Context
• Resilient Distributed Datasets (RDDs)
• Transformations
• Actions
24
Spark Context (1)
• Every Spark application requires a spark context: the main entry
point to the Spark API
• Spark Shell provides a preconfigured Spark Context called “sc”
25
Spark Context (2)
• Standalone applications Driver code Spark Context
• Spark Context represents connection to a Spark cluster
Standalone Application (Driver
Program)
26
Spark Context (3)
• Spark context works as a client and represents connection to a Spark
cluster
27
Spark Fundamentals
Example of an application:
• Spark Context
• Resilient Distributed Data
• Transformations
• Actions
28
Resilient Distributed Dataset
RDD (Resilient Distributed Dataset) is the fundamental unit of data in Spark:
An Immutable collection of objects (or records, or elements) that can be
operated on “in parallel” (spread across a cluster)
Resilient -- if data in memory is lost, it can be recreated
• Recover from node failures
• An RDD keeps its lineage information it can be recreated from parent RDDs
Distributed -- processed across the cluster
• Each RDD is composed of one or more partitions (more partitions – more
parallelism)
Dataset -- initial data can come from a file or be created
29
RDDs
Key Idea: Write applications in terms of transformations on
distributed datasets
• Collections of objects spread across a Memory caching layer(cluster)
that stores data in a distributed, fault-tolerant cache
• Can fall back to disk when dataset does not fit in memory
• Built through parallel transformations (map, filter, group-by, join,
etc)
• Automatically rebuilt on failure
• Controllable persistence (e.g. caching in RAM)
30
RDDs -- Immutability
• Immutability lineage information can be recreated at
any time Fault-tolerance
• Avoids data inconsistency problems no simultaneous
updates Correctness
• Easily live in memory as on disk Caching Safe to share
across processes/tasks Improves performance
• Tradeoff: (Fault-tolerance & Correctness) vs (Disk Memory & CPU)
31
Creating a RDD
Three ways to create a RDD
• From a file or set of files
• From data in memory
• From another RDD
32
Example: A File-based RDD
33
Spark Fundamentals
Example of an application:
• Spark Context
• Resilient Distributed Data
• Transformations
• Actions
34
RDD Operations
Two types of operations
Transformations: Define a new
RDD based on current RDD(s)
Actions: return values
35
RDD Transformations
• Set of operations on a RDD that define how they should be
transformed
• As in relational algebra, the application of a transformation to
an RDD yields a new RDD (because RDD are immutable)
• Transformations are lazily evaluated, which allow for
optimizations to take place before execution
• Examples: map(), filter(), groupByKey(), sortByKey(), etc.
36
Example: map and filter Transformations
37
RDD Actions
• Apply transformation chains on RDDs, eventually performing some
additional operations (e.g., counting)
• Some actions only store data to an external data source (e.g. HDFS),
others fetch data from the RDD (and its transformation chain) upon
which the action is applied, and convey it to the driver
• Some common actions
count() – return the number of elements
take(n) – return an array of the first n elements
collect()– return an array of all elements
saveAsTextFile(file) – save to text file(s)
38
Lazy Execution of RDDs (1)
Data in RDDs is not processed until an
action is performed
39
Lazy Execution of RDDs (2)
Data in RDDs is not processed until an
action is performed
40
Lazy Execution of RDDs (3)
Data in RDDs is not processed until an
action is performed
41
Lazy Execution of RDDs (4)
Data in RDDs is not processed until an
action is performed
42
Lazy Execution of RDDs (5)
Data in RDDs is not processed until an
action is performed
43
Example: Log Mining
Load error messages from a log into memory, then interactively
search for various patterns:
lines = spark.textFile(“hdfs://...”) HadoopRDD
errors = lines.filter(lambda s: s.startswith(“ERROR”)) FilteredRDD
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
messages.filter(lambda s: “foo” in s).count()
Result: full-text search of Wikipedia in 0.5 sec (vs 20 sec for on-disk data)
44
RDD and Partitions (More Parallelism)
45
RDD Graph: Data Set vs Partition Views
Much like in Hadoop MapReduce, each RDD is associated to (input)
partitions
46
RDDs: Data Locality
•Data Locality Principle
Same as for Hadoop MapReduce
Avoids network I/O, workers should manage local data
•Data Locality and Caching
First run: data not in cache, so use HadoopRDD’s locality
preferences (from HDFS)
Second run: FilteredRDD is in cache, so use its locations
If something falls out of cache, go back to HDFS
47
RDDs -- Summary
• RDD are partitioned, locality aware, distributed collections
RDD are immutable
• RDD are data structures that:
Either point to a direct data source (e.g. HDFS)
Apply some transformations to its parent RDD(s) to generate new
data elements
• Computations on RDDs
Represented by lazily evaluated lineage DAGs composed by
chained RDDs
48
Lifetime of a Job in Spark
49
Anatomy of a Spark Application
Cluster Manager
(YARN/Mesos)
50
Typical RDD pattern of use
• Hadoop job uses RDD to transform some input object, like a
“recipe” for generating a cooked version of the object.
• The task might further transform the RDD with additional
RDDs, in the style of a functional program.
• Eventually, some task consumes the RDD output (or perhaps
several of these RDDs) as part of a MapReduce-style
computation.
51
Spark: Key Techniques for Performance
• Spark is an “execution engine for computing RDDs” but also decides
when to perform the actual computation, where to place tasks (on the
Hadoop Cluster), and whether to cache RDD output.
• Avoids recomputing an RDD by saving its output if it will be needed
again, and to arrange for tasks to run close to these cached RDDs (or
in a place where later tasks will use the same RDD output)
52
Why is this a good strategy?
• If MapReduce jobs were arbitrary programs, this wouldn’t help.
• But in fact the MapReduce model is valuable because it often applies
the same transformations again and again on input files.
• Also, MapReduce is often run again and again until a machine learning
model converges, or some huge batch of input is consumed, and by
caching RDDs, Spark can avoid wasteful effort.
53
Iterative Algorithms: Spark vs MapReduce
54
Today’s Topics
• Motivation
• Spark Basics
• Spark Programming
55
Spark Programming (1)
Creating RDDs
# Turn a Python collection into an RDD
sc.parallelize([1, 2, 3])
# Load text file from local FS, HDFS, or S3
sc.textFile(“file.txt”)
sc.textFile(“directory/*.txt”)
sc.textFile(“hdfs://namenode:9000/path/file”)
# Use existing Hadoop InputFormat (Java/Scala only)
sc.hadoopFile(keyClass, valClass, inputFmt, conf)
56
Spark Programming (2)
Basic Transformations
nums = sc.parallelize([1, 2, 3])
# Pass each element through a function
squares = nums.map(lambda x: x*x) // {1, 4, 9}
# Keep elements passing a predicate
even = squares.filter(lambda x: x % 2 == 0) // {4}
57
Spark Programming (3)
Basic Actions
nums = sc.parallelize([1, 2, 3])
# Retrieve RDD contents as a local collection
nums.collect() # => [1, 2, 3]
# Return first K elements
nums.take(2) # => [1, 2]
# Count number of elements
nums.count() # => 3
# Merge elements with an associative function
nums.reduce(lambda x, y: x + y) # => 6 58
Spark Programming (4)
Working with Key-Value Pairs
Spark’s “distributed reduce” transformations operate on RDDs of
key-value pairs
Python: pair = (a, b)
pair[0] # => a
pair[1] # => b
Scala: val pair = (a, b)
pair._1 // => a
pair._2 // => b
Java: Tuple2 pair = new Tuple2(a, b);
pair._1 // => a
pair._2 // => b 59
Spark Programming (5)
Some Key-Value Operations
pets = sc.parallelize([(“cat”, 1), (“dog”, 1), (“cat”, 2)])
pets.reduceByKey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)}
pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])}
pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)}
60
Example: Word Count
lines = sc.textFile(“hamlet.txt”)
counts = lines.flatMap(lambda line: line.split(“ “))
.map(lambda word: (word, 1))
.reduceByKey(lambda x, y: x + y)
61
Example: Spark Streaming
Represents streams as a series of RDDs over time (typically sub second
intervals, but it is configurable)
val spammers = sc.sequenceFile(“hdfs://spammers.seq”)
sc.twitterStream(...)
.filter(t => t.text.contains(“Santa Clara University”))
.transform(tweets => tweets.map(t => (t.user, t)).join(spammers))
.print()
62
Spark: Combining Libraries (Unified Pipeline)
# Load data using Spark SQL
points = spark.sql(“select latitude, longitude from tweets”)
# Train a machine learning model
model = KMeans.train(points, 10)
# Apply it to a stream
sc.twitterStream(...)
.map(lambda t: (model.predict(t.location), 1))
.reduceByWindow(“5s”, lambda a, b: a + b)
63
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP
Spark: Setting the Level of Parallelism
All the pair RDD operations take an optional second parameter for
number of tasks
words.reduceByKey(lambda x, y: x + y, 5)
words.groupByKey(5)
visits.join(pageViews, 5)
64
MapReduce vs Spark (Summary)
• Performance:
While Spark performs better when all the data fits in the main memory
(especially on dedicated clusters), MapReduce is designed for data that
doesn’t fit in the memory
• Ease of Use:
Spark is easier to use compared to Hadoop MapReduce as it comes with
user-friendly APIs for Scala (its native language), Java, Python, and Spark
SQL.
• Fault-tolerance:
Batch processing: Spark HDFS replication
Stream processing: Spark RDDs replicated
65
Summary
• Spark is a powerful “manager” for big data computing.
• It centers on a job scheduler for Hadoop (MapReduce) that is smart
about where to run each task: co-locate task with data.
• The data objects are “RDDs”: a kind of recipe for generating a file
from an underlying data collection. RDD caching allows Spark to run
mostly from memory-mapped data, for speed.
• Online tutorials: spark.apache.org/docs/latest
66