Spark Introduction
Spark Introduction
Thomas Ropars
http://tropars.github.io/
2018
1
References
2
Goals of the lecture
3
Agenda
MapReduce
Spark internals
4
Agenda
MapReduce
Spark internals
5
Distributed computing: Definition
6
Distributed computing: Motivation
There are several reasons why one may want to distribute data and
processing:
• Scalability
I The data do not fit in the memory/storage of one node
I The processing power of more processor can reduce the time
to solution
• Latency
I Put computing resources close to the users to decrease latency
7
Increasing the processing power
Goals
• Increasing the amount of data that can be processed (weak
scaling)
• Decreasing the time needed to process a given amount of data
(strong scaling)
Two solutions
• Scaling up
• Scaling out
8
Vertical scaling (scaling up)
Idea
Increase the processing power by adding resources to existing
nodes:
• Upgrade the processor (more cores, higher frequency)
• Increase memory capacity
• Increase storage capacity
9
Vertical scaling (scaling up)
Idea
Increase the processing power by adding resources to existing
nodes:
• Upgrade the processor (more cores, higher frequency)
• Increase memory capacity
• Increase storage capacity
9
Horizontal scaling (scaling out)
Idea
Increase the processing power by adding more nodes to the system
• Cluster of commodity servers
10
Horizontal scaling (scaling out)
Idea
Increase the processing power by adding more nodes to the system
• Cluster of commodity servers
10
Horizontal scaling (scaling out)
Idea
Increase the processing power by adding more nodes to the system
• Cluster of commodity servers
10
Large scale infrastructures
11
Programming for large-scale infrastructures
Challenges
• Performance
I How to take full advantage of the available resources?
I Moving data is costly
• How to maximize the ratio between computation and
communication?
• Scalability
I How to take advantage of a large number of distributed
resources?
• Fault tolerance
I The more resources, the higher the probability of failure
I MTBF (Mean Time Between Failures)
• MTBF of one server = 3 years
• MTBF of 1000 servers ' 19 hours (beware: over-simplified
computation)
12
Programming in the Clouds
Cloud computing
• A service provider gives access to computing resources
through an internet connection.
13
Programming in the Clouds
Cloud computing
• A service provider gives access to computing resources
through an internet connection.
13
Architecture of a data center
Simplified
Switch
A shared-nothing architecture
• Horizontal scaling
• No specific hardware
A hierarchical infrastructure
• Resources clustered in racks
• Communication inside a rack is more efficient than between
racks
• Resources can even be geographically distributed over several
datacenters
15
A warning about distributed computing
You can have a second computer once you’ve shown you
know how to use the first one. (P. Braham)
Examples
• Processing a few 10s of GB of data is often more efficient on
a single machine that on a cluster of machines
• Sometimes a single threaded program outperforms a cluster of
machines (F. McSherry et al. “Scalability? But at what
COST!”. 2015.)
16
Agenda
MapReduce
Spark internals
17
Summary of the challenges
Context of execution
• Large number of resources
• Resources can crash (or disappear)
I Failure is the norm rather than the exception.
• Resources can be slow
Objectives
• Run until completion
I And obtain a correct result :-)
• Run fast
18
Shared memory and message passing
19
Shared memory
• Entities share a global memory
• Communication by reading and writing to the globally shared
memory
• Examples: Pthreads, OpenMP, etc
20
Message passing
21
Dealing with failures: Checkpointing
Checkpointing
App
ckpt 1 ckpt 2 ckpt 3 ckpt 4
22
Dealing with failures: Checkpointing
Checkpointing
App
ckpt 1 ckpt 2 ckpt 3 ckpt 4
22
Dealing with failures: Checkpointing
Checkpointing
App
ckpt 1 ckpt 2 ckpt 3 ckpt 4
22
About checkpointing
Limits
• Performance cost
• Difficult to implement
• The alternatives (passive or active replication) are even more
costly and difficult to implement in most cases
23
About slow resources (stragglers)
Performance variations
• Both for the nodes and the network
• Resources shared with other users
Do some computation
new_data = Recv(from B) /*blocking*/
Resume computing with new_data
25
Agenda
MapReduce
Spark internals
26
MapReduce at Google
References
• The Google file system, S. Ghemawat et al. SOSP 2003.
• MapReduce: simplified data processing on large clusters, D.
Jeffrey and S. Ghemawat. OSDI 2004.
Main ideas
• Data represented as key-value pairs
• Two main operations on data: Map and Reduce
• A distributed file system
I Compute where the data are located
Use at Google
• Compute the index of the World Wide Web.
• Google has moved on to other technologies
27
Apache Hadoop
28
Apache Hadoop
In a few words
• Built on top of the ideas of Google
• A full data processing stack
• The core elements
I A distributed file system: HDFS (Hadoop Distributed File
System)
I A programming model and execution framework: Hadoop
MapReduce
MapReduce
• Allows simply expressing many parallel/distributed
computational algorithms
29
MapReduce
30
Hadoop MapReduce
Key/Value pairs
• MapReduce manipulate sets of Key/Value pairs
• Keys and values can be of any types
Functions to apply
• The user defines the functions to apply
• In Map, the function is applied independently to each pair
• In Reduce, the function is applied to all values with the same
key
31
Hadoop MapReduce
32
A first MapReduce program
Word Count
Description
• Input: A set of lines including words
I Pairs < line number, line content >
I The initial keys are ignored in this example
• Output: A set of pairs < word, nb of occurrences >
Input Output
• < 1, ”aaa bb ccc” > • < ”aaa”, 2 >
• < 2, ”aaa bb” > • < ”bb”, 2 >
• < ”ccc”, 1 >
33
A first MapReduce program
Word Count
34
A first MapReduce program
Word Count
”aaa”, 1
”bb”, 1
”ccc”, 1
”bb”, 1
1, ”aaa bb ccc” ”aaa”, 2
map ”bb”, 1
2, ”bb bb d” reduce ”bb”, 4
”d”, 1
3, ”d aaa bb” ”ccc”, 1
”d”, 1
4, ”d” ”d”, 3
”aaa”, 1
”bb”, 1
”d”, 1
red
”bb”, 1
uc
”aa”, 3
e
”bb”, 4
node A node A node A
e
uc
1, ”aa bb” ”aa”, 1 ”aa”, 3
red
2, ”aa aa” map ”bb”, 1 ”bb”, 1
comb
”aa”, 1
”aa”, 1
36
Example: Web index
Description
Construct an index of the pages in which a word appears.
• Input: A set of web pages
I Pairs < URL, content of the page >
37
Example: Web index
38
Running at scale
How to distribute data?
• Partitioning • Replication
Partitioning
• Splitting the data into partitions
• Partitions are assigned to different nodes
• Main goal: Performance
I Partitions can be processed in parallel
Replication
• Several nodes host a copy of the data
• Main goal: Fault tolerance
I No data lost if one node crashes
39
Hadoop Distributed File System (HDFS)
Main ideas
• Running on a cluster of commodity servers
I Each node has a local disk
I A node may fail at any time
40
HDFS architecture
Figure from https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
41
Hadoop data workflow
Figure from
https://www.supinfo.com/articles/single/2807-introduction-to-the-mapreduce-life-cycle
42
Hadoop workflow: a few comments
Data movements
• Map tasks are executing on nodes where the data blocks are
hosted
I Or on close nodes
I Less expensive to move computation than to move data
43
Hadoop workflow: a few comments
I/O operations
• Map tasks read data from disks
• Output of the mappers are stored in memory if possible
I Otherwise flushed to disk
• The result of reduce tasks in written into HDFS
Fault tolerance
• Execution of tasks is monitored by the master node
I Tasks are launched again on other nodes if crashed or too slow
44
Agenda
MapReduce
Spark internals
45
Apache Spark
46
Spark vs Hadoop
Spark added value
• Performance
I Especially for iterative algorithms
• Interactive queries
• Supports more operations on data
• A full ecosystem (High level libraries)
• Running on your machine or at scale
Main novelties
• Computing in memory
• A new computing abstraction: Resilient Distributed Datasets
(RDD)
47
Programming with Spark
48
Many resources to get started
• https://spark.apache.org/
• https://sparkhub.databricks.com/
49
Starting with Spark
50
A very first example with pyspark
Counting lines
51
The Spark Web UI
52
The Spark built-in libraries
53
Agenda
MapReduce
Spark internals
54
In-memory computing: Insights
See Latency Numbers Every Programmer Should Know
55
In-memory computing: Insights
Graph by P. Johnson
56
Efficient iterative computation
57
Main challenge
Fault Tolerance
58
Resilient Distributed Datasets
59
Transformations and actions
60
Fault tolerance with Lineage
Fault tolerance
• RDD partition lost
I Replay all transformations on the subset of input data or the
most recent RDD available
• Deal with stragglers
I Generate a new copy of a partition on another node
61
Spark runtime
Figure by M. Zaharia et al
• Driver
I Executes the user
program
I Defines RDDs and invokes
actions
I Tracks RDD’s lineage
• Workers
I Store RDD partitions
I Perform transformations
and actions
• Run tasks
62
Persistence and partitioning
See https:
//spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence
Partitions
• RDDs are automatically partitioned based on:
I The configuration of the target platform (nodes, CPUs)
I The size of the RDD
I User can also specify its own partitioning
• Tasks are created for each partition
63
RDD dependencies
Transformations create dependencies between RDDs.
2 kinds of dependencies
• Narrow dependencies
I Each partition in the parent is used by at most one partition in
the child
• Wide (shuffle) dependencies
I Each partition in the parent is used by multiple partitions in
the child
Impact of dependencies
• Scheduling: Which tasks can be run independently
• Fault tolerance: Which partitions are needed to recreate a lost
partition
• Communication: Shuffling implies large amount of data
exchanges
64
RDD dependencies
Figure by M. Zaharia et al
65
Executing transformations and actions
Lazy evaluation
• Transformations are executed only when an action is called on
the corresponding RDD
• Examples of optimizations allowed by lazy evaluation
I Read file from disk + action first(): no need to read the
whole file
I Read file from disk + transformation filter(): No need to
create an intermediate object that contains all lines
66
Persist an RDD
67
Agenda
MapReduce
Spark internals
68
The SparkContext
What is it?
• Object representing a connection to an execution cluster
• We need a SparkContext to build RDDs
Creation
• Automatically created when running in shell (variable sc)
• To be initialized when writing a standalone application
Initialization
• Run in local mode with nb threads = nb cores: local[*]
• Run in local mode with 2 threads: local[2]
• Run on a spark cluster: spark://HOST:PORT
69
The SparkContext
Python shell
Python program
import pyspark
sc = pyspark.SparkContext("local[*]")
70
The first RDDs
Create RDD from existing iterator
• Use of SparkContext.parallelize()
• Optional second argument to define the number of partitions
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
data = sc.textFile("myfile.txt")
hdfsData = sc.textFile("hdfs://myhdfsfile.txt")
71
Some transformations
see https:
//spark.apache.org/docs/latest/rdd-programming-guide.html#transformations
72
Some transformations with <K,V> pairs
73
Some actions
see
https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions
74
An example
75
An example with key-value pairs
lines = sc.textFile("data.txt")
words = lines.flatMap(lambda s: s.split(’ ’))
pairs = words.map(lambda s: (s, 1))
counts = pairs.reduceByKey(lambda a, b: a + b)
76
Another example with key-value pairs
77
Shared Variables
see https://spark.apache.org/docs/latest/rdd-programming-guide.html#
shared-variables
Broadcast variables
• Use-case: A read-only large variable should be made available
to all tasks (e.g., used in a map function)
• Costly to be shipped with each task
• Declare a broadcast variable
I Spark will make the variable available to all tasks in an
efficient way
78
Example with a Broadcast variable
b = sc.broadcast([1, 2, 3, 4, 5])
print(b.value)
# [1, 2, 3, 4, 5]
print(sc.parallelize([0, 0]).
flatMap(lambda x: b.value).collect())
# [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
b.unpersist()
79
Shared Variables
Accumulator
• Use-case: Accumulate values over all tasks
• Declare an Accumulator on the driver
I Updates by the tasks are automatically propagated to the
driver.
• Default accumulator: operator ’+=’ on int and float.
I User can define custom accumulator functions
80
Example with an Accumulator
file = sc.textFile(inputFile)
# Create Accumulator[Int] initialized to 0
blankLines = sc.accumulator(0)
def splitLine(line):
# Make the global variable accessible
global blankLines
if not line:
blankLines += 1
return line.split(" ")
words = file.flatMap(splitLine)
print(blankLines.value)
81
additional slides
82
Job scheduling
Main ideas
• Tasks are run when the user calls an action
83
Stages in a RDD’s DAG
Figure by M. Zaharia et al
84