Chapter 6
Batch processing - part 2
Apache Spark
An unified analytics engine for large-scale data
processing
Map Reduce: Iterative jobs
• Iterative jobs involve a lot of disk I/O for each repetition
• è Disk I/O is very slow!
[email protected] 3
0.1 Gb/s
1 Gb/s or125 MB/s Nodesin
another
Network rack
CPUs:
10GB/s
100MB/s 1 Gb/s or125 MB/s Nodesin
600MB/s same
rack
3-12 msrandom 0.1 ms random
access access
$0.025 perGB $0.35 perGB
RAM is the new disk
[email protected] 5
A unified analytics engine for large-scale
data processing
• Better support for
• Iterative algorithms
• Interactive data mining
• Fault tolerance, data locality, scalability
• Hide complexites: help users avoid the coding for structure
the distributed mechanism.
[email protected] 6
Memory instead of disk
HDFS HDFS HDFS
[email protected] 7
Spark and Map Reduce differences
Apache Hadoop MR Apache Spark
Storage Disk only In-memory or on disk
Operations Map and Reduce Many transformations and actions,
including Map and Reduce
Execution model Batch Batch, iterative, streaming
Languages Java Scala, Java, Python and R
[email protected] 8
Apache Spark vs Apache Hadoop
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
[email protected] 9
Resilient Distributed Dataset (RDD)
• RDDs are fault-tolerant, parallel data structures that
let users explicitly persist intermediate results in
memory, control their partitioning to optimize data
placement, and manipulate them using a rich set of
operators.
• coarse-grained transformations vs. fine-grained
updates
• e.g., map, filter and join) that apply the same operation to
many data items at once.
[email protected] 10
more partitions= more parallelism
RDD
item-1 item-6 item-11 item-16 item-21
item-2 item-7 item-12 item-17 item-22
item-3 item-8 item-13 item-18 item-23
item-4 item-9 item-14 item-19 item-24
item-5 item-10 item-15 item-20 item-25
W W W
Ex Ex Ex
RDD RDD RDD
RDD RDD
RDD with 4 partitions
Error, ts, Info, ts, msg8 Error, ts, Error, ts,
msg1 Warn, Warn, ts, msg3 Info, msg4 Warn,
ts, msg2 msg2 Info, ts, ts, msg5 ts, msg9
Error, ts, msg8 Info, ts, Error, ts, logLinesRDD
msg1 msg5 msg1
Abase RDD can be created 2 ways:
- Parallelize a collection
- Read data from an external source (S3, C*, HDFS,etc)
Parallelize
// Parallelize in Scala
• Take an existing in-
val wordsRDD = sc.parallelize(List("fish", "cats", "dogs"))
memory collection and
pass it toSparkContext’s
parallelizemethod
• Not generallyused
outside of prototyping
andtesting since it
requires entire dataset in
# Parallelize in Python memory on one machine
wordsRDD = sc.parallelize(["fish", "cats", "dogs"])
// Parallelize in Java
JavaRDD<String> wordsRDD = sc.parallelize(Arrays.asList("fish", "cats", "dogs"));
Read from Text File
// Read a local txt file in Scala
There are other
val linesRDD = sc.textFile("/path/to/README.md") methods to read data
from HDFS, C*, S3,
HBase,etc.
# Read a local txt file in Python
linesRDD = sc.textFile("/path/to/README.md")
// Read a local txt file in Java
JavaRDD<String> lines = sc.textFile("/path/to/README.md");
Operations on Distributed Data
• Two types of operations: transformations and actions
• Transformations are lazy (not computed immediately)
• Transformations are executed when an action is run
• Persist (cache) distributed data in memory or disk
Transformation: Filter
Error, ts, Info, ts, msg8 Error,ts, Error, ts,
msg1 Warn, Warn, ts, msg3 Info, msg4 Warn,
ts, msg2 msg2 Info, ts, msg5 ts, msg9 logLinesRDD
Error, ts, ts,msg8 Info, ts, Error, ts, (input/base RDD)
msg1 msg5 msg1
.filter( λ )
Error,ts, Error,ts,
Error,ts, msg3
msg1 Error, msg4 Error,
ts, msg1 ts, msg1 errorsRDD
Action: Collect
Error,ts, msg3 Error,ts,
Error,ts,
msg4 Error,
msg1 Error, errorsRDD
ts, msg1
ts, msg1 .coalesce( 2)
Error, ts, msg1 Error, ts, msg4
Error, ts, msg3
Error, ts, msg1 Error, ts, msg1
cleanedRDD
.collect( )
Driver
DAG execution
Execute DAG!
.collect( )
Driver
Logical
logLinesRDD
.filter( λ )
errorsRDD
.coalesce( 2)
cleanedRDD
.collect( )
Driver
Physical
4. compute
logLinesRDD
errorsRDD
cleanedRDD
Driver
DAG
logLinesRDD
errorsRDD
.saveAsTextFile( ) Error, ts, msg1 Error, ts, msg4
Error, ts, msg3
Error, ts, msg1 Error, ts, msg1
cleanedRDD
.filter( λ )
Error, ts, msg1
.count( )
Error, ts, msg1 Error, ts, msg1
5 errorMsg1RDD
.collect( )
Cache
logLinesRDD
errorsRDD
.cache( )
.saveAsTextFile( ) Error, ts, msg1 Error, ts, msg4
Error, ts, msg3
Error, ts, msg1 Error, ts, msg1
cleanedRDD
.filter( λ )
Error, ts, msg1
.count( )
Error, ts, msg1 Error, ts, msg1
5
errorMsg1RDD
.collect( )
Partition >>> Task >>> Partition
logLinesRDD
(HadoopRDD)
Task-1
Task-2
.filter( λ ) Task-3
Task-4
errorsRDD
(filteredRDD)
RDD Lineage
[email protected] 24
Resilient Distributed Dataset (RDD)
• Initial RDD on disks (HDFS, etc)
• Intermediate RDD on RAM
• Fault recovery based on lineage
• RDD operations is distributed
[email protected] 25
DataFrame
• A primary abstraction in Spark 2.0
• Immutable once constructed
• Track lineage information to efficiently re-compute lost data
• Enable operations on collection of elements in parallel
• To construct DataFrame
• By parallelizing existing Python collections (lists)
• By transforming an existing Spark or pandas DataFrame
• From files in HDFS or other storage system
[email protected] 26
Using DataFrame
>>> data = [(‘Alice’, 1), (‘Bob’, 2), (‘Bob’, 2)]
>>> df1 = sqlContext.createDataFrame(data, [‘name’,
‘age’])
[Row(name=u’Alice’, age=1),
Row=(name=u’Bob’, age=2),
Row=(name=u’Bob’, age=2)]
[email protected] 27
Transformations
• Create new DataFrame from an existing one
• Use lazy evaluation
• Nothing executes
• Spark saves recipe for transformation source
Transformation Description
select(*cols) Selects columns from this DataFrame
drop(col) Returns a new Dataframe that drops the specific column
filter(func) Returns a new DataFrame formed by selecting those rows of the
source on which func returns true
where(func) Where is an alias for filter
distinct() Returns a new DataFrame that contains the distinct rows of the
source DataFrame
sort(*cols, **kw) Returns a new DataFrame sorted by the specified columns and in
the sort order specified by kw
[email protected] 28
Using Transformations
>>> data = [(‘Alice’, 1), (‘Bob’, 2), (‘Bob’, 2)]
>>> df1 = sqlContext.createDataFrame(data, [‘name’,
‘age’])
>>> df2 = df1.distinct()
[Row(name=u’Alice’, age=1), Row=(name=u’Bob’,
age=2)]
>>> df3 = df2.sort(“age”, asceding=False)
[Row=(name=u’Bob’, age=2), Row(name=u’Alice’,
age=1)]
[email protected] 29
Actions
• Cause Spark to execute recipe to transform source
• Mechanisms for getting results out of Spark
Action Description
show(n, truncate) Prints the first n rows of this DataFrame
take(n) Returns the first n rows as a list of Row
collect() Returns all the records as a list of Row (*)
count() Returns the number of rows in this DataFrame
describe(*cols) Exploratory Data Analysis function that computes statistics
(count, mean, stddev, min, max) for numeric columns
[email protected] 30
Using Actions
>>> data = [(‘Alice’, 1), (‘Bob’, 2)]
>>> df = sqlContext.createDataFrame(data, [‘name’, ‘age’])
>>> df.collect()
[Row(name=u’Alice’, age=1), Row=(name=u’Bob’, age=2)]
>>> df.count()
2
>>> df.show()
+-------+--------+
|name| age |
+-------+-------+
|Alice| 1|
|Bob | 2|
+-----+-------+
[email protected] 31
Caching
>>> linesDF = sqlContext.read.text(‘…’)
>>> linesDF.cache()
>>> commentsDF = linesDF.filter(isComment)
>>> print linesDF.count(), commentsDF.count()
>>> commentsDF.cache()
[email protected] 32
Spark Programming Routine
• Create DataFrames from external data or
createDataFrame from a collection in driver program
• Lazily transform them into new DataFrames
• cache() some DataFrames for reuse
• Perform actions to execute parallel computation and
produce results
[email protected] 33
DataFrames versus RDDs
• For new users familiar with data frames in other
programming languages, this API should make them
feel at home
• For existing Spark users, the API will make Spark
easier to program than using RDDs
• For both sets of users, DataFrames will improve
performance through intelligent optimizations and
code-generation
Write Less Code: Input & Output
Unified interface to reading/writing data ina
variety of formats.
val df = sqlContext.
read.
format("json").
option("samplingRatio", "0.1").
load("/Users/spark/data/stuff.json")
df.write.
format("parquet").
mode("append").
partitionBy("year").
saveAsTable("faster-stuff")
46
Write Less Code: Input & Output
Unified interface to reading/writing data ina
variety of formats.
val df = sqlContext.
read.
format("json").
option("samplingRatio", "0.1").
load("/Users/spark/data/stuff.json")
df.write.
format("parquet"). read and write
mode("append"). functions create
partitionBy("year").
saveAsTable("faster-stuff") new builders for
doing I/O
47
Write Less Code: Input & Output
Unified interface to reading/writing data ina
variety of formats.
val df = sqlContext.
read.
format("json"). Builder
"methods
}
option("samplingRatio", "0.1").
)
specify:
load("/Users/spark/data/stuff.json
• format
df.write.
mode("append").
• partitioning
• handling of
}
format("parquet").
existing data
partitionBy("year").
saveAsTable("faster- 48
stuff")
Write Less Code: Input & Output
Unified interface to reading/writing data ina
variety of formats.
val df = sqlContext.
read.
format("json").
option("samplingRatio", "0.1"). load(…), save(…),
load("/Users/spark/data/stuff.json")
or saveAsTable(…)
df.write.
format("parquet").
finish the I/O
mode("append"). specification
partitionBy("year").
saveAsTable("faster-stuff")
49
Data Sources supported by DataFrames
built-in external
JDBC
{ JSON }
and more …
Write Less Code: High-Level
Operations
• Solve common problems concisely with DataFrame
functions:
• selecting columns and filtering
• joining different data sources
• aggregation (count, sum, average, etc.)
• plotting results (e.g., with Pandas)
Write Less Code: Compute an Average
private IntWritable one = new IntWritable(1); rdd = sc.textFile(...).map(_.split(" "))
private IntWritable output =new IntWritable(); rdd.map { x => (x(0), (x(1).toFloat, 1)) }.
protected void map(LongWritable key, reduceByKey { case ((num1, count1), (num2, count2)) =>
Text value, (num1 + num2, count1 + count2)
Context context) { }.
String[] fields = value.split("\t"); map { case (key, (num, count)) => (key, num / count) }.
output.set(Integer.parseInt(fields[1])); collect()
context.write(one, output);
}
---------------------------------------------------------------------------------- rdd = sc.textFile(...).map(lambda s: s.split())
rdd.map(lambda x: (x[0], (float(x[1]), 1))).\
IntWritable one = new IntWritable(1) reduceByKey(lambda t1, t2: (t1[0] + t2[0], t1[1] + t2[1])).\
DoubleWritable average = new DoubleWritable(); map(lambda t: (t[0], t[1][0] / t[1][1])).\
collect()
protected void reduce(IntWritable key,
Iterable<IntWritable> values,
Context context) {
int sum = 0;
int count = 0;
for (IntWritable value: values) {
sum += value.get();
count++;
}
average.set(sum / (double) count);
context.write(key, average);
}
Write Less Code: Compute an
Average
Using RDDs
rdd = sc.textFile(...).map(_.split(" "))
rdd.map { x => (x(0), (x(1).toFloat, 1)) }. Full APIDocs
reduceByKey { case ((num1, count1), (num2, count2)) =>
(num1 + num2, count1 + count2)
• Scala
}. • Java
map { case (key, (num, count)) => (key, num / count) }.
collect() • Python
• R
Using DataFrames
import org.apache.spark.sql.functions._
val df = rdd.map(a => (a(0), a(1))).toDF("key", "value")
df.groupBy("key")
.agg(avg("value"))
.collect()
Architecture
• A master-worker type architecture
• A driver or master node
• Worker nodes
• The master send works to the workers and either
instructs them to pull data from memory or from hard
disk (or from another source like S3 or HDSF)
[email protected] 43
Architecture(2)
• A Spark program first creates a SparkContext object
• SparkContext tells Spark how and where to access a cluster
• The master parameter for a SparkContext determines which
type and size of cluster to use
Master parameter Description
local Run Spark locally with one worker thread (no parallelism)
local[K] Run Spark locally with K worker threads (ideal set to number of
cores)
spark://HOST:PORT Connect to a Spark standalone cluster
mesos://HOST:PORT Connect to a Mesos cluster
yarn Connect to a YARN cluster
[email protected] 44
Lifetime of a Job in Spark
[email protected] 45
Demo
References
• Zaharia, Matei, et al. "Resilient distributed datasets: A fault-
tolerant abstraction for in-memory cluster computing." Presented
as part of the 9th {USENIX} Symposium on Networked Systems
Design and Implementation ({NSDI} 12). 2012.
• Armbrust, Michael, et al. "Spark sql: Relational data processing in
spark." Proceedings of the 2015 ACM SIGMOD international
conference on management of data. 2015.
• Zaharia, Matei, et al. "Discretized streams: Fault-tolerant
streaming computation at scale." Proceedings of the twenty-fourth
ACM symposium on operating systems principles. 2013.
• Chambers, Bill, and Matei Zaharia. Spark: The definitive guide:
Big data processing made simple. " O'Reilly Media, Inc.", 2018.
Thank you
for your
attention!!!