Working
With
RDDs
in
Spark
Chapter
11
201509
Course
Chapters
1
IntroducIon
2
IntroducIon
to
Hadoop
and
the
Hadoop
Ecosystem
Hadoop
Architecture
and
HDFS
3
ImporIng
RelaIonal
Data
with
Apache
Sqoop
4
IntroducIon
to
Impala
and
Hive
5
6
Modeling
and
Managing
Data
with
Impala
and
Hive
Data
Formats
7
Data
File
ParIIoning
8
9
Capturing
Data
with
Apache
Flume
10
11
12
13
14
15
16
17
Spark
Basics
Working
with
RDDs
in
Spark
AggregaIng
Data
with
Pair
RDDs
WriIng
and
Deploying
Spark
ApplicaIons
Parallel
Processing
in
Spark
Spark
RDD
Persistence
Common
PaEerns
in
Spark
Data
Processing
Spark
SQL
and
DataFrames
18
Conclusion
Course
IntroducIon
IntroducIon
to
Hadoop
ImporIng
and
Modeling
Structured
Data
IngesIng
Streaming
Data
Distributed
Data
Processing
with
Spark
Course
Conclusion
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-2
Working
With
RDDs
In
this
chapter
you
will
learn
How
RDDs
are
created
from
les
or
data
in
memory
How
to
handle
le
formats
with
mulC-line
records
How
to
use
some
addiConal
operaCons
on
RDDs
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-3
Chapter
Topics
Working
With
RDDs
in
Spark
Distributed
Data
Processing
with
Spark
CreaCng
RDDs
Other
General
RDD
OperaIons
Conclusion
Homework:
Process
Data
Files
with
Spark
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-4
RDDs
RDDs
can
hold
any
type
of
element
PrimiIve
types:
integers,
characters,
booleans,
etc.
Sequence
types:
strings,
lists,
arrays,
tuples,
dicts,
etc.
(including
nested
data
types)
Scala/Java
Objects
(if
serializable)
Mixed
types
Some
types
of
RDDs
have
addiConal
funcConality
Pair
RDDs
RDDs
consisIng
of
Key-Value
pairs
Double
RDDs
RDDs
consisIng
of
numeric
data
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-5
CreaIng
RDDs
From
CollecIons
You
can
create
RDDs
from
collecCons
instead
of
les
sc.parallelize(collection)
> myData = ["Alice","Carlos","Frank","Barbara"]
> myRdd = sc.parallelize(myData)
> myRdd.take(2)
['Alice', 'Carlos']
Useful
when
TesIng
GeneraIng
data
programmaIcally
IntegraIng
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-6
CreaIng
RDDs
from
Files
(1)
For
le-based
RDDs,
use
SparkContext.textFile
Accepts
a
single
le,
a
wildcard
list
of
les,
or
a
comma-separated
list
of
les
Examples
sc.textFile("myfile.txt")
sc.textFile("mydata/*.log")
sc.textFile("myfile1.txt,myfile2.txt")
Each
line
in
the
le(s)
is
a
separate
record
in
the
RDD
Files
are
referenced
by
absolute
or
relaCve
URI
Absolute
URI:
file:/home/training/myfile.txt
hdfs://localhost/loudacre/myfile.txt
RelaIve
URI
(uses
default
le
system):
myfile.txt
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-7
CreaIng
RDDs
from
Files
(2)
textFile
maps
each
line
in
a
le
to
a
separate
RDD
element
I've never seen a purple cow.\n
I never hope to see one;\n
But I can tell you, anyhow,\n
I'd rather see than be one.\n
I've never seen a purple cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
textFile
only
works
with
line-delimited
text
les
What
about
other
formats?
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-8
Input
and
Output
Formats
(1)
Spark
uses
Hadoop
InputFormat
and
OutputFormat
Java
classes
Some
examples
from
core
Hadoop
TextInputFormat
/
TextOutputFormat
newline
delimited
text
les
SequenceInputFormat
/
SequenceOutputFormat
FixedLengthInputFormat
Many
implementaIons
available
in
addiIonal
libraries
e.g.
AvroInputFormat
/
AvroOutputFormat
in
the
Avro
library
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-9
Input
and
Output
Formats
(2)
Specify
any
input
format
using
sc.hadoopFile
or
newAPIhadoopFile
for
New
API
classes
Specify
any
output
format
using
rdd.saveAsHadoopFile
or
saveAsNewAPIhadoopFile
for
New
API
classes
textFile
and
saveAsTextFile
are
convenience
funcCons
textFile
just
calls
hadoopFile
specifying
TextInputFormat
saveAsTextFile
calls
saveAsHadoopFile
specifying
TextOutputFormat
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-10
Whole
File-Based
RDDs
(1)
sc.textFile
maps
each
line
in
a
le
to
a
separate
RDD
element
What
about
les
with
a
mulI-line
input
format,
e.g.
XML
or
JSON?
le1.json
sc.wholeTextFiles(directory)
Maps
enIre
contents
of
each
le
in
a
directory
to
a
single
RDD
element
Works
only
for
small
les
(element
must
t
in
memory)
le2.json
{
"firstName":"Fred",
"lastName":"Flintstone",
"userid":"123"
}
{
"firstName":"Barney",
"lastName":"Rubble",
"userid":"234
}
(file1.json,{"firstName":"Fred","lastName":"Flintstone","userid":"123"} )
(file2.json,{"firstName":"Barney","lastName":"Rubble","userid":234"} )
(file3.xml, )
(file4.xml, )
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-11
Whole
File-Based
RDDs
(2)
> import json
> myrdd1 = sc.wholeTextFiles(mydir)
> myrdd2 = myrdd1
.map(lambda (fname,s): json.loads(s))
> for record in myrdd2.take(2):
>
print record["firstName"]
Output:
Fred
Barney
> import scala.util.parsing.json.JSON
> val myrdd1 = sc.wholeTextFiles(mydir)
> val myrdd2 = myrdd1
.map(pair => JSON.parseFull(pair._2).get.
asInstanceOf[Map[String,String]])
> for (record <- myrdd2.take(2))
println(record.getOrElse("firstName",null))
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-12
Chapter
Topics
Working
With
RDDs
in
Spark
Distributed
Data
Processing
with
Spark
CreaIng
RDDs
Other
General
RDD
OperaCons
Conclusion
Homework:
Process
Data
Files
with
Spark
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-13
Some
Other
General
RDD
OperaIons
Single-RDD
TransformaCons
flatMap
maps
one
element
in
the
base
RDD
to
mulIple
elements
distinct
lter
out
duplicates
sortBy
use
provided
funcIon
to
sort
MulC-RDD
TransformaCons
intersection
create
a
new
RDD
with
all
elements
in
both
original
RDDs
union
add
all
elements
of
two
RDDs
into
a
single
new
RDD
zip
pair
each
element
of
the
rst
RDD
with
the
corresponding
element
of
the
second
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-14
Example:
flatMap
and
distinct
Python
Scala
> sc.textFile(file) \
.flatMap(lambda line: line.split()) \
.distinct()
> sc.textFile(file).
flatMap(line => line.split(' ')).
distinct()
I've never seen a purple cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
Ive
Ive
never
never
seen
seen
purple
purple
cow
cow
never
hope
hope
to
to
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-15
Examples:
MulI-RDD
TransformaIons
rdd1
rdd2
Chicago
San Francisco
Boston
Boston
Paris
Amsterdam
San Francisco
Mumbai
Tokyo
McMurdo Station
rdd1.subtract(rdd2)
rdd1.zip(rdd2)
rdd1.union(rdd2)
Chicago
Boston
Paris
San Francisco
Tokyo
San Francisco
Boston
Tokyo
(Chicago,San Francisco)
Amsterdam
Paris
(Boston,Boston)
Mumbai
Chicago
(Paris,Amsterdam)
McMurdo Station
(San Francisco,Mumbai)
(Tokyo,McMurdo Station)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-16
Some
Other
General
RDD
OperaIons
Other
RDD
operaCons
first
return
the
rst
element
of
the
RDD
foreach
apply
a
funcIon
to
each
element
in
an
RDD
top(n)
return
the
largest
n
elements
using
natural
ordering
Sampling
operaCons
sample
create
a
new
RDD
with
a
sampling
of
elements
takeSample
return
an
array
of
sampled
elements
Double
RDD
operaCons
StaIsIcal
funcIons,
e.g.,
mean,
sum,
variance,
stdev
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-17
Chapter
Topics
Working
With
RDDs
in
Spark
Distributed
Data
Processing
with
Spark
CreaIng
RDDs
Other
General
RDD
OperaIons
Conclusion
Homework:
Process
Data
Files
with
Spark
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-18
EssenIal
Points
RDDs
can
be
created
from
les,
parallelized
data
in
memory,
or
other
RDDs
sc.textFile
reads
newline
delimited
text,
one
line
per
RDD
record
sc.wholeTextFile
reads
enCre
les
into
single
RDD
records
Generic
RDDs
can
consist
of
any
type
of
data
Generic
RDDs
provide
a
wide
range
of
transformaCon
operaCons
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-19
Chapter
Topics
Working
With
RDDs
in
Spark
Distributed
Data
Processing
with
Spark
CreaIng
RDDs
Other
General
RDD
OperaIons
Conclusion
Homework:
Process
Data
Files
with
Spark
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-20
Homework:
Process
Data
Files
with
Spark
In
this
homework
assignment
you
will
Process
a
set
of
XML
les
using
wholeTextFiles
Reformat
a
dataset
to
standardize
format
(bonus)
Please
refer
to
the
Homework
descripCon
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-21