Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
34 views64 pages

16 SparkAlgorithms

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views64 pages

16 SparkAlgorithms

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Common Patterns in Spark

Data Processing
Chapter 16

201509
Course Chapters

1 IntroducIon Course IntroducIon


2 IntroducIon to Hadoop and the Hadoop Ecosystem
IntroducIon to Hadoop
3 Hadoop Architecture and HDFS
4 ImporIng RelaIonal Data with Apache Sqoop
5 IntroducIon to Impala and Hive
ImporIng and Modeling Structured
6 Modeling and Managing Data with Impala and Hive
Data
7 Data Formats
8 Data File ParIIoning
9 Capturing Data with Apache Flume IngesIng Streaming Data

10 Spark Basics
11 Working with RDDs in Spark
12 AggregaIng Data with Pair RDDs
13 WriIng and Deploying Spark ApplicaIons Distributed Data Processing with
14 Parallel Processing in Spark Spark
15 Spark RDD Persistence
16 Common Patterns in Spark Data Processing
17 Spark SQL and DataFrames

18 Conclusion Course Conclusion

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐2


Common Patterns in Spark Programming

In this chapter you will learn


■What kinds of processing and analysis Spark is best at
■How to implement an iterative algorithm in Spark
■How MLlib work with Spark

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐3


Chapter Topics

Common Pa,erns in Spark Data Distributed Data Processing with


Processing Spark

■ Common Spark Use Cases


■ Iterative Algorithms in Spark
■ Machine Learning
■ Example: k-‐means
■ Conclusion
■ Homework: Implement an Iterative Algorithm with
Spark

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐4


Common Spark Use Cases (1)

■Spark is especially useful when working with any combination of:


– Large amounts of data
– Distributed storage
– Intensive computaIons
– Distributed compuIng
– Iterative algorithms
– In-‐memory processing and pipelining

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐5


Common Spark Use Cases (2)

■Examples
– Risk analysis
– “How likely is this borrower to pay back a loan?”
– RecommendaIons
– “Which products will this customer enjoy?”
– PredicIons
– “How can we prevent service outages instead of simply reacIng to
them?”
– ClassificaIon
– “How can we tell which mail is spam and which is legiImate?”

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐6


Spark Examples

■Spark includes many example programs that demonstrate some common


Spark programming pa,erns and algorithms
– k-‐means
– LogisIc regression
– Calculate pi
– AlternaIng least squares (ALS)
– Querying Apache web logs
– Processing Twi(er feeds
■Examples
– $DEV1/examples/spark
– spark-examples-version.jar – Java and Scala examples
– python.tar.gz – Pyspark examples

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐7


Chapter Topics

Common Pa,erns in Spark Data Distributed Data Processing with


Processing Spark

■ Common Spark Use Cases


■ Iterative Algorithms in Spark
■ Machine Learning
■ Example: k-‐means
■ Conclusion
■ Homework: Implement an Iterative Algorithm with
Spark

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐8


Example: PageRank

■PageRank gives web pages a ranking score based on links from other
pages
– Higher scores given for more links, and links from other high-ranking
pages
■Why do we care?
– PageRank is a classic example of big data analysis (like WordCount)
– Lots of data – needs an algorithm that is distributable and scalable
– Iterative – the more iteraIons, the better the answer

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐9


PageRank Algorithm (1)

1. Start each page with a rank of 1.0

Page 1
1.0

Page 2 Page 3
1.0 1.0
Page 4
1.0

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐10


PageRank Algorithm (2)

1. Start each page with a rank of 1.0


2. On each iteration:
1. each page contributes to its neighbors its own rank divided by the
number of its neighbors: contribp = rankp / neighborsp

Page 1
1.0

Page 2 Page 3
1.0 1.0
Page 4
1.0

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐11


PageRank Algorithm (3)

1. Start each page with a rank of 1.0


2. On each iteration:
1. each page contributes to its neighbors its own rank divided by the
number of its neighbors: contribp = rankp / neighborsp
2. Set each page’s new rank based on the sum of its neighbors
contribuIon: new-‐rank = Σcontribs * .85 + .15

Page 1 Iteration 1
1.85

Page 2 Page 3
0.58 1.0
Page 4
0.58

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐12


PageRank Algorithm (4)

1. Start each page with a rank of 1.0


2. On each iteration:
1. each page contributes to its neighbors its own rank divided by the
number of its neighbors: contribp = rankp / neighborsp
2. Set each page’s new rank based on the sum of its neighbors
contribuIon: new-‐rank = Σcontribs * .85 + .15
3. Each iteration incrementally improves the page ranking

Page 1 IteraIon 2
1.31

Page 2 Page 3
0.39 1.7
Page 4
0.57

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐13


PageRank Algorithm (5)

1. Start each page with a rank of 1.0


2. On each iteration:
1. each page contributes to its neighbors its own rank divided by the
number of its neighbors: contribp = rankp / neighborsp
2. Set each page’s new rank based on the sum of its neighbors
contribuIon: new-‐rank = Σcontribs * .85 + .15
3. Each iteration incrementally improves the page ranking

Page 1 Iteration 10
1.43 (Final)
Page 2 Page 3
0.46 1.38
Page 4
0.73

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐14


PageRank in Spark: Neighbor ContribuIon FuncIon

def computeContribs(neighbors, rank):


for neighbor in neighbors: yield(neighbor, rank/len(neighbors))

neighbors: [page1,page2] (page1,.5)


rank: 1.0 (page2,.5)

Page 1

Page 2 Page 3

Page 4
1.0

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐15


PageRank in Spark: Example Data

Data Format: page1 page3


source-page destination-page page2 page1
… page4 page1
page3 page1
page4 page2
page3 page4

Page 1

Page 2 Page 3

Page 4

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐16


PageRank in Spark: Pairs of Page Links

page1 page3
page2 page1
def computeContribs(neighbors, rank):… page4 page1
page3 page1
page4 page2
links = sc.textFile(file)\ page3 page4
.map(lambda line: line.split())\
.map(lambda pages: (pages[0],pages[1]))\
.distinct() (page1,page3)
(page2,page1)
(page4,page1)
(page3,page1)
(page4,page2)
(page3,page4)

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐17


PageRank in Spark: Page Links Grouped by Source Page

page1 page3
page2 page1
def computeContribs(neighbors, rank):… page4 page1
page3 page1
page4 page2
links = sc.textFile(file)\ page3 page4
.map(lambda line: line.split())\
.map(lambda pages: (pages[0],pages[1]))\
.distinct()\ (page1,page3)
.groupByKey()
(page2,page1)
(page4,page1)
(page3,page1)
(page4,page2)
(page3,page4)

links
(page4, [page2,page1])
(page2, [page1])
(page3, [page1,page4])
(page1, [page3])

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐18


PageRank in Spark: PersisIng the Link Pair RDD

page1 page3
page2 page1
def computeContribs(neighbors, rank):… page4 page1
page3 page1
page4 page2
links = sc.textFile(file)\ page3 page4
.map(lambda line: line.split())\
.map(lambda pages: (pages[0],pages[1]))\
.distinct()\ (page1,page3)
.groupByKey()\
(page2,page1)
.persist()
(page4,page1)
(page3,page1)
(page4,page2)
(page3,page4)

links
(page4, [page2,page1])
(page2, [page1])
(page3, [page1,page4])
(page1, [page3])

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐19


PageRank in Spark: Set Initial Ranks

links
def computeContribs(neighbors, rank):… (page4, [page2,page1])
(page2, [page1])
links = sc.textFile(file)\ (page3, [page1,page4])
.map(lambda line: line.split())\ (page1, [page3])
.map(lambda pages: (pages[0],pages[1]))\
.distinct()\
.groupByKey()\
.persist() ranks
(page4, 1.0)

ranks=links.map(lambda (page,neighbors): (page,1.0)) (page2, 1.0)


(page3, 1.0)
(page1, 1.0)

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐20


PageRank in Spark: First Iteration (1)

links ranks
def computeContribs(neighbors, rank):… (page4, [page2 ,page1]) (page4, 1.0)
(page2, [page1 ]) (page2, 1.0)
links = … (page3, [page1 ,page4]) (page3, 1.0)
(page1, [page3 ]) (page1, 1.0)
ranks = …

for x in xrange(10):
contribs=links\ (page4, ([page2,page1], 1.0))

.join(ranks) (page2, ([page1], 1.0))


(page3, ([page1,page4], 1.0))
(page1, ([page3], 1.0))

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐21


PageRank in Spark: First Iteration (2)

links ranks
def computeContribs(neighbors, rank):… (page4, [page2 ,page1]) (page4, 1.0)
(page2, [page1 ]) (page2, 1.0)
links = … (page3, [page1 ,page4]) (page3, 1.0)
(page1, [page3 ]) (page1, 1.0)
ranks = …

for x in xrange(10):
contribs=links\ (page4, ([page2,page1], 1.0))

.join(ranks)\ (page2, ([page1], 1.0))


.flatMap(lambda (page,(neighbors,rank)): \ (page3, ([page1,page4], 1.0))
computeContribs(neighbors,rank)) (page1, ([page3], 1.0))

contribs
(page2,0.5)
(page1,0.5)
(page1,1.0)
(page1,0.5)
(page4,0.5)
(page3,1.0)

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐22


PageRank in Spark: First Iteration (3)

contribs
(page2,0.5)
def computeContribs(neighbors, rank):… (page1,0.5)
(page1,1.0)
links = …
(page1,0.5)

ranks = … (page4,0.5)
(page3,1.0)
for x in xrange(10):
contribs=links\
.join(ranks)\ (page4,0.5)
.flatMap(lambda (page,(neighbors,rank)): \
(page2,0.5)
computeContribs(neighbors,rank))
ranks=contribs\ (page3,1.0)
.reduceByKey(lambda v1,v2: v1+v2) (page1,2.0)

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐23


PageRank in Spark: First Iteration (4)

contribs
(page2,0.5)
def computeContribs(neighbors, rank):… (page1,0.5)
(page1,1.0)
links = …
(page1,0.5)

ranks = … (page4,0.5)
(page3,1.0)
for x in xrange(10):
contribs=links\
.join(ranks)\ (page4,0.5)
.flatMap(lambda (page,(neighbors,rank)): \
(page2,0.5)
computeContribs(neighbors,rank))
ranks=contribs\ (page3,1.0)
.reduceByKey(lambda v1,v2: v1+v2)\ (page1,2.0)
.map(lambda (page,contrib): \
(page,contrib * 0.85 + 0.15)) ranks
(page4,.58)
(page2,.58)
(page3,1.0)
(page1,1.85)

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐24


PageRank in Spark: Second Iteration

links ranks
def computeContribs(neighbors, rank):… (page4, [page2,page1]) (page4,0.58)
(page2, [page1]) (page2,0.58)
links = … (page3, [page1,page4]) (page3,1.0)
(page1, [page3]) (page1,1.85)
ranks = …

for x in xrange(10):
contribs=links\
.join(ranks)\
.flatMap(lambda (page,(neighbors,rank)): \
computeContribs(neighbors,rank))

ranks=contribs\
.reduceByKey(lambda v1,v2: v1+v2)\
.map(lambda (page,contrib): \ ranks
(page,contrib * 0.85 + 0.15)) (page4,0.57)
(page2,0.21)
for rank in ranks.collect(): print rank
(page3,1.0)
(page1,0.77)

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐25


Chapter Topics

Common Patterns in Spark Distributed Data Processing with


Data Processing Spark

■ Common Spark Use Cases


■ Iterative Algorithms in Spark
■ Machine Learning
■ Example: k-‐means
■ Conclusion
■ Homework: Implement an Iterative Algorithm with
Spark

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐26


Machine Learning

■Most programs tell computers exactly what to do


– Database transactions and queries
– Controllers
– Phone systems, manufacturing processes, transport, weaponry, etc.
– Media delivery
– Simple search
– Social systems
– Chat, blogs, email, etc.
■An alternative technique is to have computers learn what to do
■Machine Learning refers to programs that leverage collected data to drive
future program behavior
■This represents another major opportunity to gain value from data

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐27


The ‘Three Cs’

■Machine Learning is an active area of research and new applications


■There are three well-‐established categories of techniques for exploiting
data
– Collaborative filtering (recommendaIons)
– Clustering
– ClassificaIon

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐28


Collaborative
Filtering
■Collaborative Filtering is a technique for recommendations
■Example application: given people who each like certain books, learn to
suggest what someone may like in the future based on what they already
like
■Helps users navigate data by expanding to topics that have affinity with
their established interests
■Collaborative Filtering algorithms are agnostic to the different types of
data items involved
– Useful in many different domains

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐29


Clustering

■Clustering algorithms discover structure in collections of data


– Where no formal structure previously existed
■They discover what clusters, or groupings, naturally occur in data
■Examples
– Finding related news articles
– Computer vision (groups of pixels that cohere into objects)

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐30


Classification

■The previous two techniques are considered ‘unsupervised’ learning


– The algorithm discovers groups or recommendations itself
■Classification is a form of ‘supervised’ learning
■A classification system takes a set of data records with known labels
– Learns how to label new records based on that information
■Examples
– Given a set of e-‐mails identified as spam/not spam, label new e-‐mails as
spam/not spam
– Given tumors idenIfied as benign or malignant, classify new tumors

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐31


Machine Learning Challenges

■Highly computation intensive and iterative


■Many traditional numerical processing systems do not scale to very large
datasets
– e.g., MatLab

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐32


MLlib: Machine Learning on Spark

■MLlib is part of Apache Spark


■Includes many common ML functions
– ALS (alternaIng least squares)
– k-‐means
– LogisIc Regression
– Linear Regression
– Gradient Descent

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐33


Chapter Topics

Common Pa,erns in Spark Data Distributed Data Processing with


Processing Spark

■ Common Spark Use Cases


■ Iterative Algorithms in Spark
■ Machine Learning
■ Example: k-‐means
■ Conclusion
■ Homework: Implement an Iterative Algorithm with
Spark

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐34


k-‐means Clustering

■k-‐means Clustering
– A common iterative algorithm used in graph analysis and
machine learning
– You will implement a simplified version in the homework assignment

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐35


Clustering (1)

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐36


Clustering (2)

Goal: Find “clusters” of data


points

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐37


Example: k-‐means Clustering (1)

1. Choose K random points as


starIng centers

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐38


Example: k-‐means Clustering (2)

1. Choose K random points as


starIng centers
2. Find all points closest to each
center

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐39


Example: k-‐means Clustering (3)

1. Choose K random points as


starIng centers
2. Find all points closest to each
center
3. Find the center (mean) of each
cluster

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐40


Example: k-‐means Clustering (4)

1. Choose K random points as


starIng centers
2. Find all points closest to each
center
3. Find the center (mean) of each
cluster
4. If the centers changed, iterate
again

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐41


Example: k-‐means Clustering (5)

1. Choose K random points as


starIng centers
2. Find all points closest to each
center
3. Find the center (mean) of each
cluster
4. If the centers changed, iterate
again

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐42


Example: k-‐means Clustering (6)

1. Choose K random points as


starIng centers
2. Find all points closest to each
center
3. Find the center (mean) of each
cluster
4. If the centers changed, iterate
again

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐43


Example: k-‐means Clustering (7)

1. Choose K random points as


starIng centers
2. Find all points closest to each
center
3. Find the center (mean) of each
cluster
4. If the centers changed, iterate
again

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐44


Example: k-‐means Clustering (8)

1. Choose K random points as


starIng centers
2. Find all points closest to each
center
3. Find the center (mean) of each
cluster
4. If the centers changed, iterate
again

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐45


Example: k-‐means Clustering (9)

1. Choose K random points as


starIng centers
2. Find all points closest to each
center
3. Find the center (mean) of each
cluster
4. If the centers changed, iterate
again

5. Done!

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐46


Example: Approximate k-‐means Clustering

1. Choose K random points as


starIng centers
2. Find all points closest to each
center
3. Find the center (mean) of each
cluster
4. If the centers changed by
more than c, iterate again

5. Close enough!

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐47


Chapter Topics

Common Pa,erns in Spark Data Distributed Data Processing with


Processing Spark

■ Common Spark Use Cases


■ Iterative Algorithms in Spark
■ Machine Learning
■ Example: k-‐means
■ Conclusion
■ Homework: Implement an Iterative Algorithm with
Spark

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐48


EssenIal Points

■Spark is especially suited to big data problems that require iteration


– In-‐memory persistence makes this very efficient
■Common in many types of analysis
– e.g., common algorithms such as PageRank and k-‐means
■Spark includes specialized libraries to implement many common functions
– MLlib
■MLlib
– Efficient, scalable funcIons for machine learning (e.g., logisIc
regression, k-‐means)

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐49


Chapter Topics

Common Pa,erns in Spark Data Distributed Data Processing with


Processing Spark

■ Common Spark Use Cases


■ Iterative Algorithms in Spark
■ Machine Learning
■ Example: k-‐means
■ Conclusion
■ Homework: Implement an Iterative Algorithm with Spark

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐50


Homework

■Iterative Processing in Spark


– In this homework assignment you will
– Implement k-‐means in Spark in order to idenIfy clustered location
data points from Loudacre device status logs
– Find the geographic centers of device activity
■Please refer to the Homework description

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐51


Preparation Exercise

#Locate the file: /home/training/training_materials/data/devicestatus.txt

devstatus = sc.textFile("file:/home/training/training_materials/data/devicestatus.txt")

cleanstatus = devstatus. \

filter(lambda line: len(line) > 20). \

map(lambda line: line.split(line[19:20])). \

filter(lambda values: len(values) == 14)

devicedata = cleanstatus. \

map(lambda values: (values[0], values[1].split(' ')[0], values[2], values[12], values[13]))

# Save to a CSV file as a comma-delimited string (trim parenthesis from tuple toString)

devicedata. \

map(lambda values: ','.join(values)). \

saveAsTextFile("/loudacre/devicestatus_etl")

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐52


Exercise 1

■ Parse the input files, which are delimited by ‘,’ into (latitude, longitude) pairs(the 4 th and 5th fields
in each line). Files located at: /loudacre/devicestatus_etl/*

■ Only include known locations(that is filter out (0,0) locations).

■ Tip: use sum() function to fileter out (0,0) locations.

■ Name the resulting rdd points and persist the RDD

■ Print the first five elements of the rdd points

■Expected output

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐53


Exercise 1 Solution

filename = "/loudacre/devicestatus_etl/*"
points = sc.textFile(filename)\
.map(lambda line: line.split(","))\
.map(lambda fields: [float(fields[3]),float(fields[4])])\
.filter(lambda point: sum(point) != 0)\
.persist()
points.take(5)

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐54


Exercise 2

■Take a sample of 5 from the rdd points without replacement. Use a seed value
of 34. Name the resulting array as kPoints.
■Print the resulting array

■Expected output

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐55


Exercise 2 Solution

K=5
kPoints = points.takeSample(False, K, 34)
print kPoints

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐56


Exercise 3

For each coordinate point, use the provided closestPoint function to map each
point to the index in the kPoints array of the location closest to that point. The
resulting RDD should be keyed by the index, and the value should be the pair:
(point, 1). Name the resulting rdd: closest
Print the first five elements of the resulting rdd, closest

■Expected output

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐57


Exercise 3 Solution

closest = points.map(lambda p : (closestPoint(p, kPoints), (p, 1)))

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐58


Exercise 4

Reduce the result: for each center in the kPoints array, sum the latitudes and
longitudes, respectively, of all the points closest to that center, and the number
of closest points. That is: For each key (k-point index), reduce by adding the
coordinates and number of points.
Name the resulting RDD pointStats and print the rdd.
Tip: Use the provided addPoints function to sum the points.

■Expected output

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐59


Exercise 4 Solution

pointStats = closest.reduceByKey(lambda (point1,n1),(point2,n2):(addPoints(point1,point2),n1+n2) )

pointStats.collect()

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐60


Exercise 5

The reduced RDD should have (at most) K members. Map each to a new center point
by calculating the average latitude and longitude for each set of closest points: that is,
map
(index,(totalX,totalY),n) to (index,(totalX/n,totalY/n))
Perform a collect on the resulting RDD to produce an array.
Name the array: newPoints

■Expected output

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐61


Exercise 5 Solution

newPoints = pointStats.map(lambda (i,(point,n)): (i,[point[0]/n,point[1]/n])).collect()

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐62


Exercise 6

Use the provided distanceSquared method to calculate how much each center
“moved” between the current iteration and the last. That is, for each center in
kPoints, calculate the distance between that point and the corresponding new point,
and sum those distances. That is the delta between iterations; when the delta is less
than convergeDist, stop iterating.
Print the new center points

■Expected output

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐63


Exercise 6 Solution

# calculate the total of the distance between the current points and new points
tempDist=0
for (i,point) in newPoints: tempDist += distanceSquared(kPoints[i],point)
print "Distance between iterations:",tempDist
# Copy the new points to the kPoints array for the next iteration
for (i, point) in newPoints: kPoints[i] = point

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐64

You might also like