0% found this document useful (0 votes)

11 views48 pages

7 Apache Spark

Chapter 6 discusses batch processing with Apache Spark, highlighting its advantages over traditional MapReduce, particularly in handling iterative jobs and reducing disk I/O through in-memory processing. It introduces key concepts such as Resilient Distributed Datasets (RDDs) and DataFrames, emphasizing their fault tolerance and efficiency in data manipulation. The chapter also covers operations, transformations, and actions in Spark programming, showcasing how to work with data in a more streamlined manner.

Uploaded by

Tuân Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views48 pages

7 Apache Spark

Uploaded by

Tuân Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

Chapter 6

Batch processing - part 2

Apache Spark
An unified analytics engine for large-scale data
processing
Map Reduce: Iterative jobs
• Iterative jobs involve a lot of disk I/O for each repetition

• è Disk I/O is very slow!

[email protected] 3
0.1 Gb/s
1 Gb/s or125 MB/s Nodesin
another
Network rack
CPUs:
10GB/s

100MB/s 1 Gb/s or125 MB/s Nodesin

600MB/s same
rack

3-12 msrandom 0.1 ms random

access access

$0.025 perGB $0.35 perGB

RAM is the new disk

[email protected] 5
A unified analytics engine for large-scale
data processing
• Better support for
• Iterative algorithms
• Interactive data mining
• Fault tolerance, data locality, scalability
• Hide complexites: help users avoid the coding for structure
the distributed mechanism.

[email protected] 6
Memory instead of disk

HDFS HDFS HDFS

[email protected] 7
Spark and Map Reduce differences
Apache Hadoop MR Apache Spark
Storage Disk only In-memory or on disk
Operations Map and Reduce Many transformations and actions,
including Map and Reduce

Execution model Batch Batch, iterative, streaming

Languages Java Scala, Java, Python and R

[email protected] 8
Apache Spark vs Apache Hadoop

https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
[email protected] 9
Resilient Distributed Dataset (RDD)
• RDDs are fault-tolerant, parallel data structures that
let users explicitly persist intermediate results in
memory, control their partitioning to optimize data
placement, and manipulate them using a rich set of
operators.
• coarse-grained transformations vs. fine-grained
updates
• e.g., map, filter and join) that apply the same operation to
many data items at once.

[email protected] 10
more partitions= more parallelism

RDD

item-1 item-6 item-11 item-16 item-21

item-2 item-7 item-12 item-17 item-22
item-3 item-8 item-13 item-18 item-23
item-4 item-9 item-14 item-19 item-24
item-5 item-10 item-15 item-20 item-25

W W W

Ex Ex Ex
RDD RDD RDD
RDD RDD
RDD with 4 partitions

Error, ts, Info, ts, msg8 Error, ts, Error, ts,

msg1 Warn, Warn, ts, msg3 Info, msg4 Warn,
ts, msg2 msg2 Info, ts, ts, msg5 ts, msg9
Error, ts, msg8 Info, ts, Error, ts, logLinesRDD
msg1 msg5 msg1

Abase RDD can be created 2 ways:

- Parallelize a collection
- Read data from an external source (S3, C*, HDFS,etc)
Parallelize
// Parallelize in Scala
• Take an existing in-
val wordsRDD = sc.parallelize(List("fish", "cats", "dogs"))
memory collection and
pass it toSparkContext’s
parallelizemethod

• Not generallyused
outside of prototyping
andtesting since it
requires entire dataset in
# Parallelize in Python memory on one machine
wordsRDD = sc.parallelize(["fish", "cats", "dogs"])

// Parallelize in Java
JavaRDD<String> wordsRDD = sc.parallelize(Arrays.asList("fish", "cats", "dogs"));
Read from Text File

// Read a local txt file in Scala

There are other
val linesRDD = sc.textFile("/path/to/README.md") methods to read data
from HDFS, C*, S3,
HBase,etc.

# Read a local txt file in Python

linesRDD = sc.textFile("/path/to/README.md")

// Read a local txt file in Java

JavaRDD<String> lines = sc.textFile("/path/to/README.md");
Operations on Distributed Data
• Two types of operations: transformations and actions
• Transformations are lazy (not computed immediately)
• Transformations are executed when an action is run
• Persist (cache) distributed data in memory or disk
Transformation: Filter

Error, ts, Info, ts, msg8 Error,ts, Error, ts,

msg1 Warn, Warn, ts, msg3 Info, msg4 Warn,
ts, msg2 msg2 Info, ts, msg5 ts, msg9 logLinesRDD
Error, ts, ts,msg8 Info, ts, Error, ts, (input/base RDD)
msg1 msg5 msg1

.filter( λ )

Error,ts, Error,ts,
Error,ts, msg3
msg1 Error, msg4 Error,

ts, msg1 ts, msg1 errorsRDD

Action: Collect
Error,ts, msg3 Error,ts,
Error,ts,
msg4 Error,
msg1 Error, errorsRDD
ts, msg1

ts, msg1 .coalesce( 2)

Error, ts, msg1 Error, ts, msg4

Error, ts, msg3
Error, ts, msg1 Error, ts, msg1
cleanedRDD

.collect( )

Driver
DAG execution

Execute DAG!

.collect( )

Driver
Logical

logLinesRDD
.filter( λ )

errorsRDD
.coalesce( 2)

cleanedRDD

.collect( )

Driver
Physical
4. compute

logLinesRDD

errorsRDD

cleanedRDD

Driver
DAG
logLinesRDD

errorsRDD

.saveAsTextFile( ) Error, ts, msg1 Error, ts, msg4

Error, ts, msg3
Error, ts, msg1 Error, ts, msg1
cleanedRDD

.filter( λ )

Error, ts, msg1

.count( )
Error, ts, msg1 Error, ts, msg1
5 errorMsg1RDD
.collect( )
Cache
logLinesRDD

errorsRDD
.cache( )

.saveAsTextFile( ) Error, ts, msg1 Error, ts, msg4

Error, ts, msg3
Error, ts, msg1 Error, ts, msg1
cleanedRDD

.filter( λ )

Error, ts, msg1

.count( )
Error, ts, msg1 Error, ts, msg1
5
errorMsg1RDD
.collect( )
Partition >>> Task >>> Partition

logLinesRDD
(HadoopRDD)
Task-1
Task-2
.filter( λ ) Task-3
Task-4

errorsRDD
(filteredRDD)
RDD Lineage

[email protected] 24
Resilient Distributed Dataset (RDD)
• Initial RDD on disks (HDFS, etc)
• Intermediate RDD on RAM
• Fault recovery based on lineage
• RDD operations is distributed

[email protected] 25
DataFrame
• A primary abstraction in Spark 2.0
• Immutable once constructed
• Track lineage information to efficiently re-compute lost data
• Enable operations on collection of elements in parallel
• To construct DataFrame
• By parallelizing existing Python collections (lists)
• By transforming an existing Spark or pandas DataFrame
• From files in HDFS or other storage system

[email protected] 26
Using DataFrame
>>> data = [(‘Alice’, 1), (‘Bob’, 2), (‘Bob’, 2)]
>>> df1 = sqlContext.createDataFrame(data, [‘name’,
‘age’])
[Row(name=u’Alice’, age=1),
Row=(name=u’Bob’, age=2),
Row=(name=u’Bob’, age=2)]

[email protected] 27
Transformations
• Create new DataFrame from an existing one
• Use lazy evaluation
• Nothing executes
• Spark saves recipe for transformation source

Transformation Description
select(*cols) Selects columns from this DataFrame
drop(col) Returns a new Dataframe that drops the specific column

filter(func) Returns a new DataFrame formed by selecting those rows of the

source on which func returns true

where(func) Where is an alias for filter

distinct() Returns a new DataFrame that contains the distinct rows of the
source DataFrame

sort(*cols, **kw) Returns a new DataFrame sorted by the specified columns and in
the sort order specified by kw
[email protected] 28
Using Transformations
>>> data = [(‘Alice’, 1), (‘Bob’, 2), (‘Bob’, 2)]
>>> df1 = sqlContext.createDataFrame(data, [‘name’,
‘age’])
>>> df2 = df1.distinct()
[Row(name=u’Alice’, age=1), Row=(name=u’Bob’,
age=2)]
>>> df3 = df2.sort(“age”, asceding=False)
[Row=(name=u’Bob’, age=2), Row(name=u’Alice’,
age=1)]

[email protected] 29
Actions
• Cause Spark to execute recipe to transform source
• Mechanisms for getting results out of Spark

Action Description
show(n, truncate) Prints the first n rows of this DataFrame
take(n) Returns the first n rows as a list of Row
collect() Returns all the records as a list of Row (*)
count() Returns the number of rows in this DataFrame

describe(*cols) Exploratory Data Analysis function that computes statistics

(count, mean, stddev, min, max) for numeric columns

[email protected] 30
Using Actions
>>> data = [(‘Alice’, 1), (‘Bob’, 2)]
>>> df = sqlContext.createDataFrame(data, [‘name’, ‘age’])
>>> df.collect()
[Row(name=u’Alice’, age=1), Row=(name=u’Bob’, age=2)]
>>> df.count()
2
>>> df.show()
+-------+--------+
|name| age |
+-------+-------+
|Alice| 1|
|Bob | 2|
+-----+-------+

[email protected] 31
Caching
>>> linesDF = sqlContext.read.text(‘…’)
>>> linesDF.cache()
>>> commentsDF = linesDF.filter(isComment)
>>> print linesDF.count(), commentsDF.count()
>>> commentsDF.cache()

[email protected] 32
Spark Programming Routine
• Create DataFrames from external data or
createDataFrame from a collection in driver program
• Lazily transform them into new DataFrames
• cache() some DataFrames for reuse
• Perform actions to execute parallel computation and
produce results

[email protected] 33
DataFrames versus RDDs
• For new users familiar with data frames in other
programming languages, this API should make them
feel at home
• For existing Spark users, the API will make Spark
easier to program than using RDDs
• For both sets of users, DataFrames will improve
performance through intelligent optimizations and
code-generation
Write Less Code: Input & Output

Unified interface to reading/writing data ina

variety of formats.
val df = sqlContext.
read.
format("json").
option("samplingRatio", "0.1").
load("/Users/spark/data/stuff.json")

df.write.
format("parquet").
mode("append").
partitionBy("year").
saveAsTable("faster-stuff")

46
Write Less Code: Input & Output

Unified interface to reading/writing data ina

variety of formats.
val df = sqlContext.
read.
format("json").
option("samplingRatio", "0.1").
load("/Users/spark/data/stuff.json")

df.write.
format("parquet"). read and write
mode("append"). functions create
partitionBy("year").
saveAsTable("faster-stuff") new builders for
doing I/O
47
Write Less Code: Input & Output
Unified interface to reading/writing data ina
variety of formats.

val df = sqlContext.
read.
format("json"). Builder
"methods
}
option("samplingRatio", "0.1").
)
specify:
load("/Users/spark/data/stuff.json
• format
df.write.
mode("append").
• partitioning
• handling of
}
format("parquet").
existing data
partitionBy("year").
saveAsTable("faster- 48
stuff")
Write Less Code: Input & Output

Unified interface to reading/writing data ina

variety of formats.
val df = sqlContext.
read.
format("json").
option("samplingRatio", "0.1"). load(…), save(…),
load("/Users/spark/data/stuff.json")
or saveAsTable(…)
df.write.
format("parquet").
finish the I/O
mode("append"). specification
partitionBy("year").
saveAsTable("faster-stuff")

49
Data Sources supported by DataFrames

built-in external

JDBC

{ JSON }

and more …
Write Less Code: High-Level
Operations
• Solve common problems concisely with DataFrame
functions:
• selecting columns and filtering
• joining different data sources
• aggregation (count, sum, average, etc.)
• plotting results (e.g., with Pandas)
Write Less Code: Compute an Average

private IntWritable one = new IntWritable(1); rdd = sc.textFile(...).map(_.split(" "))

private IntWritable output =new IntWritable(); rdd.map { x => (x(0), (x(1).toFloat, 1)) }.
protected void map(LongWritable key, reduceByKey { case ((num1, count1), (num2, count2)) =>
Text value, (num1 + num2, count1 + count2)
Context context) { }.
String[] fields = value.split("\t"); map { case (key, (num, count)) => (key, num / count) }.
output.set(Integer.parseInt(fields[1])); collect()
context.write(one, output);
}

---------------------------------------------------------------------------------- rdd = sc.textFile(...).map(lambda s: s.split())

rdd.map(lambda x: (x[0], (float(x[1]), 1))).\
IntWritable one = new IntWritable(1) reduceByKey(lambda t1, t2: (t1[0] + t2[0], t1[1] + t2[1])).\
DoubleWritable average = new DoubleWritable(); map(lambda t: (t[0], t[1][0] / t[1][1])).\
collect()
protected void reduce(IntWritable key,
Iterable<IntWritable> values,
Context context) {
int sum = 0;
int count = 0;
for (IntWritable value: values) {
sum += value.get();
count++;
}
average.set(sum / (double) count);
context.write(key, average);
}
Write Less Code: Compute an
Average
Using RDDs
rdd = sc.textFile(...).map(_.split(" "))
rdd.map { x => (x(0), (x(1).toFloat, 1)) }. Full APIDocs
reduceByKey { case ((num1, count1), (num2, count2)) =>
(num1 + num2, count1 + count2)
• Scala
}. • Java
map { case (key, (num, count)) => (key, num / count) }.
collect() • Python
• R
Using DataFrames
import org.apache.spark.sql.functions._

val df = rdd.map(a => (a(0), a(1))).toDF("key", "value")

df.groupBy("key")
.agg(avg("value"))
.collect()
Architecture
• A master-worker type architecture
• A driver or master node
• Worker nodes

• The master send works to the workers and either

instructs them to pull data from memory or from hard
disk (or from another source like S3 or HDSF)

[email protected] 43
Architecture(2)
• A Spark program first creates a SparkContext object
• SparkContext tells Spark how and where to access a cluster
• The master parameter for a SparkContext determines which
type and size of cluster to use

Master parameter Description

local Run Spark locally with one worker thread (no parallelism)
local[K] Run Spark locally with K worker threads (ideal set to number of
cores)
spark://HOST:PORT Connect to a Spark standalone cluster
mesos://HOST:PORT Connect to a Mesos cluster
yarn Connect to a YARN cluster

[email protected] 44
Lifetime of a Job in Spark

[email protected] 45
Demo
References
• Zaharia, Matei, et al. "Resilient distributed datasets: A fault-
tolerant abstraction for in-memory cluster computing." Presented
as part of the 9th {USENIX} Symposium on Networked Systems
Design and Implementation ({NSDI} 12). 2012.
• Armbrust, Michael, et al. "Spark sql: Relational data processing in
spark." Proceedings of the 2015 ACM SIGMOD international
conference on management of data. 2015.
• Zaharia, Matei, et al. "Discretized streams: Fault-tolerant
streaming computation at scale." Proceedings of the twenty-fourth
ACM symposium on operating systems principles. 2013.
• Chambers, Bill, and Matei Zaharia. Spark: The definitive guide:
Big data processing made simple. " O'Reilly Media, Inc.", 2018.
Thank you
for your
attention!!!

The Mongodb Cheat Sheet
No ratings yet
The Mongodb Cheat Sheet
8 pages
Bikramjit Resume
No ratings yet
Bikramjit Resume
7 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Pyspark Basics
No ratings yet
Pyspark Basics
74 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Spark
No ratings yet
Spark
51 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Big Data & Apache Spark Explained
No ratings yet
Big Data & Apache Spark Explained
31 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
SPARK
No ratings yet
SPARK
27 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Introduction to Data Analysis with Spark
No ratings yet
Introduction to Data Analysis with Spark
51 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
02 Sparkml
No ratings yet
02 Sparkml
104 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
Pyspark
No ratings yet
Pyspark
31 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
06 Parallel Processing Part2
No ratings yet
06 Parallel Processing Part2
93 pages
Apache Spark: In-Memory Data Processing
No ratings yet
Apache Spark: In-Memory Data Processing
187 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Devops Slides
No ratings yet
Devops Slides
223 pages
Complete Spark & Azure Databricks Interview Guide - Claude
No ratings yet
Complete Spark & Azure Databricks Interview Guide - Claude
46 pages
Apache Spark Basics & Comparison
No ratings yet
Apache Spark Basics & Comparison
66 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
10 Spark1
No ratings yet
10 Spark1
31 pages
PySpark Basics Overview 2
No ratings yet
PySpark Basics Overview 2
15 pages
SPARK
No ratings yet
SPARK
125 pages
Data Bricks
No ratings yet
Data Bricks
42 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Big Data with Apache Spark Basics
No ratings yet
Big Data with Apache Spark Basics
43 pages
SPARK
No ratings yet
SPARK
66 pages
SPARK
No ratings yet
SPARK
47 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Apache Spark Lecture Notes
No ratings yet
Apache Spark Lecture Notes
4 pages
Apache Spark vs Dask: Big Data Tools
No ratings yet
Apache Spark vs Dask: Big Data Tools
55 pages
Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
Pyspark Interview Code
100% (3)
Pyspark Interview Code
197 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Spark: Big Data Processing & Libraries
No ratings yet
Spark: Big Data Processing & Libraries
47 pages
Unit - 4
No ratings yet
Unit - 4
18 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
Py Spark
No ratings yet
Py Spark
9 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Pyspark
No ratings yet
Pyspark
4 pages
Pyspark Cheat Sheet PDF
No ratings yet
Pyspark Cheat Sheet PDF
1 page
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Lecture 7 - 1-Spark - Streaming
No ratings yet
Lecture 7 - 1-Spark - Streaming
25 pages
Lecture 3 - 1-ML and Data Systems Fundamentals
No ratings yet
Lecture 3 - 1-ML and Data Systems Fundamentals
48 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
124 pages
Lecture 4 - Spark Introduction
No ratings yet
Lecture 4 - Spark Introduction
45 pages
5.1 Exploratory Analysis en
No ratings yet
5.1 Exploratory Analysis en
79 pages
5.2 Feature Engineering
No ratings yet
5.2 Feature Engineering
57 pages
SPSS&EXCEL
No ratings yet
SPSS&EXCEL
1 page
Existing Clients Can Be Copied From Local To Remote System and Vice Versa
No ratings yet
Existing Clients Can Be Copied From Local To Remote System and Vice Versa
21 pages
Mobile Recharge Report
No ratings yet
Mobile Recharge Report
110 pages
Intern's Journey at GSIS
No ratings yet
Intern's Journey at GSIS
4 pages
DSV-S6 Measures of Similarity and Dissimilarity
No ratings yet
DSV-S6 Measures of Similarity and Dissimilarity
43 pages
Distributed Databases & Security
No ratings yet
Distributed Databases & Security
27 pages
XII CS - Term2 - Practical - Solution
No ratings yet
XII CS - Term2 - Practical - Solution
45 pages
Lecture # 01: Systems and Network Administration
No ratings yet
Lecture # 01: Systems and Network Administration
23 pages
CS Practical 2022-23
No ratings yet
CS Practical 2022-23
2 pages
9., ,, 9 - 11. 2014. 9 International Conference Serbia, Zajecar, Serbia Hotel, April 9 - 11, 2014
No ratings yet
9., ,, 9 - 11. 2014. 9 International Conference Serbia, Zajecar, Serbia Hotel, April 9 - 11, 2014
6 pages
Abdullah Tanvir CV
No ratings yet
Abdullah Tanvir CV
3 pages
22bit0518 VL2023240503986 Da
No ratings yet
22bit0518 VL2023240503986 Da
94 pages
Security in UNIX & Windows
No ratings yet
Security in UNIX & Windows
40 pages
Ethan S Résumé Template
No ratings yet
Ethan S Résumé Template
1 page
Understand The Netting Procedure in SAP
No ratings yet
Understand The Netting Procedure in SAP
3 pages
Database Systems Lab 2 Presentation Extended
No ratings yet
Database Systems Lab 2 Presentation Extended
32 pages
Forensic Analysis of WhatsApp On Android Smartphones
No ratings yet
Forensic Analysis of WhatsApp On Android Smartphones
33 pages
DBMS Question Bank PDF
No ratings yet
DBMS Question Bank PDF
10 pages
Backend Syllabus
No ratings yet
Backend Syllabus
12 pages
WebGISFinalNITW Praveenkaluragmail - Com 2015-05-26
No ratings yet
WebGISFinalNITW Praveenkaluragmail - Com 2015-05-26
5 pages
Plant Disease Detection Using Deep Learning
No ratings yet
Plant Disease Detection Using Deep Learning
7 pages
Lecture 7 - Relational Algebra in DBMS
No ratings yet
Lecture 7 - Relational Algebra in DBMS
16 pages
Information Retrieval
No ratings yet
Information Retrieval
31 pages
Aws Data Engineer Standout Resume Example
No ratings yet
Aws Data Engineer Standout Resume Example
1 page
Lab # 1 Hands-On: Installation of Mysql and Type of Queries (DDL, DML, DCL, TCL)
No ratings yet
Lab # 1 Hands-On: Installation of Mysql and Type of Queries (DDL, DML, DCL, TCL)
5 pages
AZ-304 Exam - Free Actual Q&as, Page 1 - ExamTopics
No ratings yet
AZ-304 Exam - Free Actual Q&as, Page 1 - ExamTopics
215 pages
LangChain Talk (Aug-Sep'23)
No ratings yet
LangChain Talk (Aug-Sep'23)
47 pages
Student Practicum Report
No ratings yet
Student Practicum Report
108 pages

7 Apache Spark

Uploaded by

7 Apache Spark

Uploaded by

Chapter 6

Batch processing - part 2

• è Disk I/O is very slow!

100MB/s 1 Gb/s or125 MB/s Nodesin

3-12 msrandom 0.1 ms random

$0.025 perGB $0.35 perGB

HDFS HDFS HDFS

Execution model Batch Batch, iterative, streaming

item-1 item-6 item-11 item-16 item-21

Error, ts, Info, ts, msg8 Error, ts, Error, ts,

Abase RDD can be created 2 ways:

// Read a local txt file in Scala

# Read a local txt file in Python

// Read a local txt file in Java

Error, ts, Info, ts, msg8 Error,ts, Error, ts,

ts, msg1 ts, msg1 errorsRDD

ts, msg1 .coalesce( 2)

Error, ts, msg1 Error, ts, msg4

.saveAsTextFile( ) Error, ts, msg1 Error, ts, msg4

Error, ts, msg1

.saveAsTextFile( ) Error, ts, msg1 Error, ts, msg4

Error, ts, msg1

filter(func) Returns a new DataFrame formed by selecting those rows of the

where(func) Where is an alias for filter

describe(*cols) Exploratory Data Analysis function that computes statistics

Unified interface to reading/writing data ina

Unified interface to reading/writing data ina

Unified interface to reading/writing data ina

private IntWritable one = new IntWritable(1); rdd = sc.textFile(...).map(_.split(" "))

---------------------------------------------------------------------------------- rdd = sc.textFile(...).map(lambda s: s.split())

val df = rdd.map(a => (a(0), a(1))).toDF("key", "value")

• The master send works to the workers and either

Master parameter Description

You might also like