0% found this document useful (0 votes)

285 views21 pages

11 SparkWorkingWithRDDs

Spark Working with RDD's

Uploaded by

mym786

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

285 views21 pages

11 SparkWorkingWithRDDs

Spark Working with RDD's

Uploaded by

mym786

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Working

With RDDs in Spark

Chapter 11

201509

Course Chapters
1

IntroducIon

2 IntroducIon to Hadoop and the Hadoop Ecosystem

Hadoop Architecture and HDFS
3
ImporIng RelaIonal Data with Apache Sqoop
4
IntroducIon to Impala and Hive
5
6 Modeling and Managing Data with Impala and Hive
Data Formats
7
Data File ParIIoning
8
9

Capturing Data with Apache Flume

10
11
12
13
14
15
16
17

Spark Basics
Working with RDDs in Spark
AggregaIng Data with Pair RDDs
WriIng and Deploying Spark ApplicaIons
Parallel Processing in Spark
Spark RDD Persistence
Common PaEerns in Spark Data Processing
Spark SQL and DataFrames

Conclusion

Course IntroducIon
IntroducIon to Hadoop

ImporIng and Modeling Structured

Data
IngesIng Streaming Data

Distributed Data Processing with

Spark

Course Conclusion

11-2

Working With RDDs

In this chapter you will learn
How RDDs are created from les or data in memory
How to handle le formats with mulC-line records
How to use some addiConal operaCons on RDDs

11-3

Chapter Topics
Working With RDDs in Spark

Distributed Data Processing with

Spark

CreaCng RDDs
Other General RDD OperaIons
Conclusion
Homework: Process Data Files with Spark

11-4

RDDs
RDDs can hold any type of element
PrimiIve types: integers, characters, booleans, etc.
Sequence types: strings, lists, arrays, tuples, dicts, etc. (including nested
data types)
Scala/Java Objects (if serializable)
Mixed types
Some types of RDDs have addiConal funcConality
Pair RDDs
RDDs consisIng of Key-Value pairs
Double RDDs
RDDs consisIng of numeric data

11-5

CreaIng RDDs From CollecIons

You can create RDDs from collecCons instead of les
sc.parallelize(collection)
> myData = ["Alice","Carlos","Frank","Barbara"]
> myRdd = sc.parallelize(myData)
> myRdd.take(2)
['Alice', 'Carlos']

Useful when
TesIng
GeneraIng data programmaIcally
IntegraIng

11-6

CreaIng RDDs from Files (1)

For le-based RDDs, use SparkContext.textFile
Accepts a single le, a wildcard list of les, or a comma-separated list of
les
Examples
sc.textFile("myfile.txt")
sc.textFile("mydata/*.log")
sc.textFile("myfile1.txt,myfile2.txt")
Each line in the le(s) is a separate record in the RDD
Files are referenced by absolute or relaCve URI
Absolute URI:
file:/home/training/myfile.txt
hdfs://localhost/loudacre/myfile.txt
RelaIve URI (uses default le system): myfile.txt

11-7

CreaIng RDDs from Files (2)

textFile maps each line in a le to a separate RDD element
I've never seen a purple cow.\n
I never hope to see one;\n
But I can tell you, anyhow,\n
I'd rather see than be one.\n

I've never seen a purple cow.

I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

textFile only works with line-delimited text les

What about other formats?

11-8

Input and Output Formats (1)

Spark uses Hadoop InputFormat and OutputFormat Java classes
Some examples from core Hadoop
TextInputFormat / TextOutputFormat newline
delimited text les
SequenceInputFormat / SequenceOutputFormat
FixedLengthInputFormat
Many implementaIons available in addiIonal libraries
e.g. AvroInputFormat / AvroOutputFormat in the Avro
library

11-9

Input and Output Formats (2)

Specify any input format using sc.hadoopFile
or newAPIhadoopFile for New API classes
Specify any output format using rdd.saveAsHadoopFile
or saveAsNewAPIhadoopFile for New API classes
textFile and saveAsTextFile are convenience funcCons
textFile just calls hadoopFile specifying TextInputFormat
saveAsTextFile calls saveAsHadoopFile specifying
TextOutputFormat

11-10

Whole File-Based RDDs (1)

sc.textFile maps each line in a le to a
separate RDD element
What about les with a mulI-line input
format, e.g. XML or JSON?

le1.json

sc.wholeTextFiles(directory)
Maps enIre contents of each le in a directory
to a single RDD element
Works only for small les (element must t in
memory)

le2.json

{
"firstName":"Fred",
"lastName":"Flintstone",
"userid":"123"
}

{
"firstName":"Barney",
"lastName":"Rubble",
"userid":"234
}

(file1.json,{"firstName":"Fred","lastName":"Flintstone","userid":"123"} )
(file2.json,{"firstName":"Barney","lastName":"Rubble","userid":234"} )
(file3.xml, )
(file4.xml, )

11-11

Whole File-Based RDDs (2)

> import json
> myrdd1 = sc.wholeTextFiles(mydir)
> myrdd2 = myrdd1
.map(lambda (fname,s): json.loads(s))
> for record in myrdd2.take(2):
>
print record["firstName"]

Output:
Fred
Barney

> import scala.util.parsing.json.JSON

> val myrdd1 = sc.wholeTextFiles(mydir)
> val myrdd2 = myrdd1
.map(pair => JSON.parseFull(pair._2).get.
asInstanceOf[Map[String,String]])
> for (record <- myrdd2.take(2))
println(record.getOrElse("firstName",null))

11-12

Chapter Topics
Working With RDDs in Spark

Distributed Data Processing with

Spark

CreaIng RDDs
Other General RDD OperaCons
Conclusion
Homework: Process Data Files with Spark

11-13

Some Other General RDD OperaIons

Single-RDD TransformaCons
flatMap maps one element in the base RDD to mulIple elements
distinct lter out duplicates
sortBy use provided funcIon to sort
MulC-RDD TransformaCons
intersection create a new RDD with all elements in both original
RDDs
union add all elements of two RDDs into a single new RDD
zip pair each element of the rst RDD with the corresponding
element of the second

11-14

Example: flatMap and distinct

Python

Scala

> sc.textFile(file) \
.flatMap(lambda line: line.split()) \
.distinct()
> sc.textFile(file).
flatMap(line => line.split(' ')).
distinct()

I've never seen a purple cow.

I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

Ive

never

seen

purple

cow

never

hope

11-15

Examples: MulI-RDD TransformaIons

rdd1

rdd2

Chicago

San Francisco

Boston

Paris

Amsterdam

San Francisco

Mumbai

Tokyo

McMurdo Station

rdd1.subtract(rdd2)

rdd1.zip(rdd2)

rdd1.union(rdd2)

Chicago
Boston
Paris
San Francisco
Tokyo
San Francisco
Boston

Tokyo

(Chicago,San Francisco)

Amsterdam

Paris

(Boston,Boston)

Mumbai

Chicago

(Paris,Amsterdam)

McMurdo Station

(San Francisco,Mumbai)
(Tokyo,McMurdo Station)

11-16

Some Other General RDD OperaIons

Other RDD operaCons
first return the rst element of the RDD
foreach apply a funcIon to each element in an RDD
top(n) return the largest n elements using natural ordering
Sampling operaCons
sample create a new RDD with a sampling of elements
takeSample return an array of sampled elements
Double RDD operaCons
StaIsIcal funcIons, e.g., mean, sum, variance, stdev

11-17

Chapter Topics
Working With RDDs in Spark

Distributed Data Processing with

Spark

CreaIng RDDs
Other General RDD OperaIons
Conclusion
Homework: Process Data Files with Spark

11-18

EssenIal Points
RDDs can be created from les, parallelized data in memory, or other
RDDs
sc.textFile reads newline delimited text, one line per RDD record
sc.wholeTextFile reads enCre les into single RDD records
Generic RDDs can consist of any type of data
Generic RDDs provide a wide range of transformaCon operaCons

11-19

Chapter Topics
Working With RDDs in Spark

Distributed Data Processing with

Spark

CreaIng RDDs
Other General RDD OperaIons
Conclusion
Homework: Process Data Files with Spark

11-20

Homework: Process Data Files with Spark

In this homework assignment you will
Process a set of XML les using wholeTextFiles
Reformat a dataset to standardize format (bonus)
Please refer to the Homework descripCon

11-21

Design For Cognitive Bias PRV
No ratings yet
Design For Cognitive Bias PRV
7 pages
Introduction To RDD
No ratings yet
Introduction To RDD
39 pages
06 Parallel Processing Part2
No ratings yet
06 Parallel Processing Part2
93 pages
Class 9 PT-2
No ratings yet
Class 9 PT-2
3 pages
Domains of Development
No ratings yet
Domains of Development
19 pages
Week7 MidtermReview
No ratings yet
Week7 MidtermReview
34 pages
Worksheet On Regression
No ratings yet
Worksheet On Regression
2 pages
Agriculture and Allied Group
No ratings yet
Agriculture and Allied Group
2 pages
External Video-En
No ratings yet
External Video-En
2 pages
ADE Training
No ratings yet
ADE Training
1 page
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
School Monitoring, Evaluation, and Adjustment (Smea) : (Tools/Instrument)
No ratings yet
School Monitoring, Evaluation, and Adjustment (Smea) : (Tools/Instrument)
8 pages
Hands - On Exercise: Using The Spark Shell..................................
100% (2)
Hands - On Exercise: Using The Spark Shell..................................
13 pages
Practical 11cdscds
No ratings yet
Practical 11cdscds
4 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
Absence Error Codes
100% (1)
Absence Error Codes
28 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
Spark
No ratings yet
Spark
51 pages
Slide 8 Spark Shell Tutorial
No ratings yet
Slide 8 Spark Shell Tutorial
61 pages
Michael Salla, Elena Danaan, Commander Thor Han Eredyon - Decimation of The Dark Fleet and The Liberation of Terra, An Nonfiction Galactic Anthology (2021)
No ratings yet
Michael Salla, Elena Danaan, Commander Thor Han Eredyon - Decimation of The Dark Fleet and The Liberation of Terra, An Nonfiction Galactic Anthology (2021)
269 pages
BDA Unit III
No ratings yet
BDA Unit III
19 pages
Apache Spark
No ratings yet
Apache Spark
6 pages
Lec 9
No ratings yet
Lec 9
33 pages
Spark Labs for Data Engineers
No ratings yet
Spark Labs for Data Engineers
133 pages
Lab Chapter 10 Use RDDs
0% (1)
Lab Chapter 10 Use RDDs
4 pages
S.S. Carnatic: 19th Century Shipwreck Analysis
100% (2)
S.S. Carnatic: 19th Century Shipwreck Analysis
264 pages
Bda Unit4
No ratings yet
Bda Unit4
22 pages
16 SparkAlgorithms
No ratings yet
16 SparkAlgorithms
64 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Solving Modern Problems With A Stone Age Brain Human Evolution and The Seven Fundamental Motives Open Access Download
100% (9)
Solving Modern Problems With A Stone Age Brain Human Evolution and The Seven Fundamental Motives Open Access Download
17 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
Case Study Repor Take Time
No ratings yet
Case Study Repor Take Time
18 pages
Devsh 201605 Student Exercisemanual
No ratings yet
Devsh 201605 Student Exercisemanual
86 pages
02 HadoopIntroEcosystem
No ratings yet
02 HadoopIntroEcosystem
41 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
GT-100 System Update Procedure
No ratings yet
GT-100 System Update Procedure
4 pages
Spark Overview
No ratings yet
Spark Overview
31 pages
Pyspark Essentials
No ratings yet
Pyspark Essentials
24 pages
Pyspark
No ratings yet
Pyspark
31 pages
Ohms Law 14to16 Lesson-Plan
No ratings yet
Ohms Law 14to16 Lesson-Plan
3 pages
74HC190 Datasheet
No ratings yet
74HC190 Datasheet
14 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Lec 9
No ratings yet
Lec 9
38 pages
SPARK
No ratings yet
SPARK
66 pages
4220 6 (DataFormat)
No ratings yet
4220 6 (DataFormat)
15 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Extended Shear Tab Connections Under Combined Axial and Shear Loading
No ratings yet
Extended Shear Tab Connections Under Combined Axial and Shear Loading
10 pages
Spark
No ratings yet
Spark
160 pages
Gastroschisis Pathway
100% (1)
Gastroschisis Pathway
18 pages
Apache Spark Basics & Comparison
No ratings yet
Apache Spark Basics & Comparison
66 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
30 pages
Stable System Lot Size & Cycle Time Analysis
No ratings yet
Stable System Lot Size & Cycle Time Analysis
4 pages
Original Operating Manual HT-S Sintering Furnace HT-S Speed Sintering Furnace
No ratings yet
Original Operating Manual HT-S Sintering Furnace HT-S Speed Sintering Furnace
39 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Apache Spark: CS240A Winter 2016. T Yang
No ratings yet
Apache Spark: CS240A Winter 2016. T Yang
36 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Happy Street II - 1st Tests 2017 - Key
No ratings yet
Happy Street II - 1st Tests 2017 - Key
1 page
Discussedlessonplan
No ratings yet
Discussedlessonplan
2 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Learning Spark Programming Basics: Introduction To Rdds
No ratings yet
Learning Spark Programming Basics: Introduction To Rdds
70 pages
Vaişeshika's Prāgabhāva in Politics
0% (1)
Vaişeshika's Prāgabhāva in Politics
5 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
TMA Quiz Questions
67% (6)
TMA Quiz Questions
12 pages
Oreilly Using - Samba
No ratings yet
Oreilly Using - Samba
798 pages
JCT3V-G1100 (CTC)
No ratings yet
JCT3V-G1100 (CTC)
7 pages
14 SparkParallelProcessing
No ratings yet
14 SparkParallelProcessing
51 pages
Spark RDD Transformations & Actions
No ratings yet
Spark RDD Transformations & Actions
12 pages
Big Data with Apache Spark Basics
No ratings yet
Big Data with Apache Spark Basics
43 pages
12 Must-Watch Mograph Videos: Grab Some Popcorn. It'S Binge Watching Time!
No ratings yet
12 Must-Watch Mograph Videos: Grab Some Popcorn. It'S Binge Watching Time!
6 pages
12 SparkAggregatingData
No ratings yet
12 SparkAggregatingData
47 pages
Huta Eq.:: Area of City 500 KM 2 No. of Subscribers / City 24 K M Erlang / Subscriber 15 Model Used: Hata Model
No ratings yet
Huta Eq.:: Area of City 500 KM 2 No. of Subscribers / City 24 K M Erlang / Subscriber 15 Model Used: Hata Model
4 pages
離散數學第一次作業1 1～1 4
No ratings yet
離散數學第一次作業1 1～1 4
5 pages
Experiment 1 Determination of Phase of Capacitor and Inductor
No ratings yet
Experiment 1 Determination of Phase of Capacitor and Inductor
1 page
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
IT Chem F4 Topical Test 1 (E)
No ratings yet
IT Chem F4 Topical Test 1 (E)
2 pages
Apache Spark: In-Memory Data Processing
No ratings yet
Apache Spark: In-Memory Data Processing
187 pages
CSE 426: Advanced Communication Systems: Assignment # 01
No ratings yet
CSE 426: Advanced Communication Systems: Assignment # 01
1 page
10 SparkBasics
No ratings yet
10 SparkBasics
45 pages
Fundamentals of HPLC
100% (1)
Fundamentals of HPLC
37 pages
Cloudera Spark
No ratings yet
Cloudera Spark
70 pages
Electroculture According To Rexresearch Part 3
No ratings yet
Electroculture According To Rexresearch Part 3
36 pages
Cloudera Developer Training For Apache Spark: Hands-On Exercises
No ratings yet
Cloudera Developer Training For Apache Spark: Hands-On Exercises
61 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Chiller PDF
No ratings yet
Chiller PDF
20 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Span of Control
No ratings yet
Span of Control
3 pages
Hadoop Training #4: Programming With Hadoop
100% (2)
Hadoop Training #4: Programming With Hadoop
46 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages

11 SparkWorkingWithRDDs

Uploaded by

11 SparkWorkingWithRDDs

Uploaded by

Working

With RDDs in Spark

2 IntroducIon to Hadoop and the Hadoop Ecosystem

Capturing Data with Apache Flume

ImporIng and Modeling Structured

Distributed Data Processing with

Working With RDDs

Distributed Data Processing with

CreaIng RDDs From CollecIons

CreaIng RDDs from Files (1)

CreaIng RDDs from Files (2)

I've never seen a purple cow.

textFile only works with line-delimited text les

Input and Output Formats (1)

Input and Output Formats (2)

Whole File-Based RDDs (1)

Whole File-Based RDDs (2)

> import scala.util.parsing.json.JSON

Distributed Data Processing with

Some Other General RDD OperaIons

Example: flatMap and distinct

I've never seen a purple cow.

Examples: MulI-RDD TransformaIons

Some Other General RDD OperaIons

Distributed Data Processing with

Distributed Data Processing with

Homework: Process Data Files with Spark

You might also like