Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
285 views21 pages

11 SparkWorkingWithRDDs

Spark Working with RDD's

Uploaded by

mym786
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
285 views21 pages

11 SparkWorkingWithRDDs

Spark Working with RDD's

Uploaded by

mym786
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Working

With RDDs in Spark


Chapter 11

201509

Course Chapters
1

IntroducIon

2 IntroducIon to Hadoop and the Hadoop Ecosystem


Hadoop Architecture and HDFS
3
ImporIng RelaIonal Data with Apache Sqoop
4
IntroducIon to Impala and Hive
5
6 Modeling and Managing Data with Impala and Hive
Data Formats
7
Data File ParIIoning
8
9

Capturing Data with Apache Flume

10
11
12
13
14
15
16
17

Spark Basics
Working with RDDs in Spark
AggregaIng Data with Pair RDDs
WriIng and Deploying Spark ApplicaIons
Parallel Processing in Spark
Spark RDD Persistence
Common PaEerns in Spark Data Processing
Spark SQL and DataFrames

18

Conclusion

Course IntroducIon
IntroducIon to Hadoop

ImporIng and Modeling Structured


Data
IngesIng Streaming Data

Distributed Data Processing with


Spark

Course Conclusion

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera.

11-2

Working With RDDs


In this chapter you will learn
How RDDs are created from les or data in memory
How to handle le formats with mulC-line records
How to use some addiConal operaCons on RDDs

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera.

11-3

Chapter Topics
Working With RDDs in Spark

Distributed Data Processing with


Spark

CreaCng RDDs
Other General RDD OperaIons
Conclusion
Homework: Process Data Files with Spark

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera.

11-4

RDDs
RDDs can hold any type of element
PrimiIve types: integers, characters, booleans, etc.
Sequence types: strings, lists, arrays, tuples, dicts, etc. (including nested
data types)
Scala/Java Objects (if serializable)
Mixed types
Some types of RDDs have addiConal funcConality
Pair RDDs
RDDs consisIng of Key-Value pairs
Double RDDs
RDDs consisIng of numeric data

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera.

11-5

CreaIng RDDs From CollecIons


You can create RDDs from collecCons instead of les
sc.parallelize(collection)
> myData = ["Alice","Carlos","Frank","Barbara"]
> myRdd = sc.parallelize(myData)
> myRdd.take(2)
['Alice', 'Carlos']

Useful when
TesIng
GeneraIng data programmaIcally
IntegraIng

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera.

11-6

CreaIng RDDs from Files (1)


For le-based RDDs, use SparkContext.textFile
Accepts a single le, a wildcard list of les, or a comma-separated list of
les
Examples
sc.textFile("myfile.txt")
sc.textFile("mydata/*.log")
sc.textFile("myfile1.txt,myfile2.txt")
Each line in the le(s) is a separate record in the RDD
Files are referenced by absolute or relaCve URI
Absolute URI:
file:/home/training/myfile.txt
hdfs://localhost/loudacre/myfile.txt
RelaIve URI (uses default le system): myfile.txt

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera.

11-7

CreaIng RDDs from Files (2)


textFile maps each line in a le to a separate RDD element
I've never seen a purple cow.\n
I never hope to see one;\n
But I can tell you, anyhow,\n
I'd rather see than be one.\n

I've never seen a purple cow.


I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

textFile only works with line-delimited text les


What about other formats?

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera.

11-8

Input and Output Formats (1)


Spark uses Hadoop InputFormat and OutputFormat Java classes
Some examples from core Hadoop
TextInputFormat / TextOutputFormat newline
delimited text les
SequenceInputFormat / SequenceOutputFormat
FixedLengthInputFormat
Many implementaIons available in addiIonal libraries
e.g. AvroInputFormat / AvroOutputFormat in the Avro
library

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera.

11-9

Input and Output Formats (2)


Specify any input format using sc.hadoopFile
or newAPIhadoopFile for New API classes
Specify any output format using rdd.saveAsHadoopFile
or saveAsNewAPIhadoopFile for New API classes
textFile and saveAsTextFile are convenience funcCons
textFile just calls hadoopFile specifying TextInputFormat
saveAsTextFile calls saveAsHadoopFile specifying
TextOutputFormat

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera.

11-10

Whole File-Based RDDs (1)


sc.textFile maps each line in a le to a
separate RDD element
What about les with a mulI-line input
format, e.g. XML or JSON?

le1.json

sc.wholeTextFiles(directory)
Maps enIre contents of each le in a directory
to a single RDD element
Works only for small les (element must t in
memory)

le2.json

{
"firstName":"Fred",
"lastName":"Flintstone",
"userid":"123"
}

{
"firstName":"Barney",
"lastName":"Rubble",
"userid":"234
}

(file1.json,{"firstName":"Fred","lastName":"Flintstone","userid":"123"} )
(file2.json,{"firstName":"Barney","lastName":"Rubble","userid":234"} )
(file3.xml, )
(file4.xml, )

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera.

11-11

Whole File-Based RDDs (2)


> import json
> myrdd1 = sc.wholeTextFiles(mydir)
> myrdd2 = myrdd1
.map(lambda (fname,s): json.loads(s))
> for record in myrdd2.take(2):
>
print record["firstName"]

Output:
Fred
Barney

> import scala.util.parsing.json.JSON


> val myrdd1 = sc.wholeTextFiles(mydir)
> val myrdd2 = myrdd1
.map(pair => JSON.parseFull(pair._2).get.
asInstanceOf[Map[String,String]])
> for (record <- myrdd2.take(2))
println(record.getOrElse("firstName",null))

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera.

11-12

Chapter Topics
Working With RDDs in Spark

Distributed Data Processing with


Spark

CreaIng RDDs
Other General RDD OperaCons
Conclusion
Homework: Process Data Files with Spark

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera.

11-13

Some Other General RDD OperaIons


Single-RDD TransformaCons
flatMap maps one element in the base RDD to mulIple elements
distinct lter out duplicates
sortBy use provided funcIon to sort
MulC-RDD TransformaCons
intersection create a new RDD with all elements in both original
RDDs
union add all elements of two RDDs into a single new RDD
zip pair each element of the rst RDD with the corresponding
element of the second

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera.

11-14

Example: flatMap and distinct


Python

Scala

> sc.textFile(file) \
.flatMap(lambda line: line.split()) \
.distinct()
> sc.textFile(file).
flatMap(line => line.split(' ')).
distinct()

I've never seen a purple cow.


I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

Ive

Ive

never

never

seen

seen

purple

purple

cow

cow

never

hope

hope

to

to

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera.

11-15

Examples: MulI-RDD TransformaIons


rdd1

rdd2

Chicago

San Francisco

Boston

Boston

Paris

Amsterdam

San Francisco

Mumbai

Tokyo

McMurdo Station

rdd1.subtract(rdd2)

rdd1.zip(rdd2)

rdd1.union(rdd2)

Chicago
Boston
Paris
San Francisco
Tokyo
San Francisco
Boston

Tokyo

(Chicago,San Francisco)

Amsterdam

Paris

(Boston,Boston)

Mumbai

Chicago

(Paris,Amsterdam)

McMurdo Station

(San Francisco,Mumbai)
(Tokyo,McMurdo Station)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera.

11-16

Some Other General RDD OperaIons


Other RDD operaCons
first return the rst element of the RDD
foreach apply a funcIon to each element in an RDD
top(n) return the largest n elements using natural ordering
Sampling operaCons
sample create a new RDD with a sampling of elements
takeSample return an array of sampled elements
Double RDD operaCons
StaIsIcal funcIons, e.g., mean, sum, variance, stdev

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera.

11-17

Chapter Topics
Working With RDDs in Spark

Distributed Data Processing with


Spark

CreaIng RDDs
Other General RDD OperaIons
Conclusion
Homework: Process Data Files with Spark

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera.

11-18

EssenIal Points
RDDs can be created from les, parallelized data in memory, or other
RDDs
sc.textFile reads newline delimited text, one line per RDD record
sc.wholeTextFile reads enCre les into single RDD records
Generic RDDs can consist of any type of data
Generic RDDs provide a wide range of transformaCon operaCons

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera.

11-19

Chapter Topics
Working With RDDs in Spark

Distributed Data Processing with


Spark

CreaIng RDDs
Other General RDD OperaIons
Conclusion
Homework: Process Data Files with Spark

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera.

11-20

Homework: Process Data Files with Spark


In this homework assignment you will
Process a set of XML les using wholeTextFiles
Reformat a dataset to standardize format (bonus)
Please refer to the Homework descripCon

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera.

11-21

You might also like