16 SparkAlgorithms
16 SparkAlgorithms
Data Processing
Chapter 16
201509
Course Chapters
10 Spark Basics
11 Working with RDDs in Spark
12 AggregaIng Data with Pair RDDs
13 WriIng and Deploying Spark ApplicaIons Distributed Data Processing with
14 Parallel Processing in Spark Spark
15 Spark RDD Persistence
16 Common Patterns in Spark Data Processing
17 Spark SQL and DataFrames
■Examples
– Risk analysis
– “How likely is this borrower to pay back a loan?”
– RecommendaIons
– “Which products will this customer enjoy?”
– PredicIons
– “How can we prevent service outages instead of simply reacIng to
them?”
– ClassificaIon
– “How can we tell which mail is spam and which is legiImate?”
■PageRank gives web pages a ranking score based on links from other
pages
– Higher scores given for more links, and links from other high-ranking
pages
■Why do we care?
– PageRank is a classic example of big data analysis (like WordCount)
– Lots of data – needs an algorithm that is distributable and scalable
– Iterative – the more iteraIons, the better the answer
Page 1
1.0
Page 2 Page 3
1.0 1.0
Page 4
1.0
Page 1
1.0
Page 2 Page 3
1.0 1.0
Page 4
1.0
Page 1 Iteration 1
1.85
Page 2 Page 3
0.58 1.0
Page 4
0.58
Page 1 IteraIon 2
1.31
Page 2 Page 3
0.39 1.7
Page 4
0.57
Page 1 Iteration 10
1.43 (Final)
Page 2 Page 3
0.46 1.38
Page 4
0.73
Page 1
Page 2 Page 3
Page 4
1.0
Page 1
Page 2 Page 3
Page 4
page1 page3
page2 page1
def computeContribs(neighbors, rank):… page4 page1
page3 page1
page4 page2
links = sc.textFile(file)\ page3 page4
.map(lambda line: line.split())\
.map(lambda pages: (pages[0],pages[1]))\
.distinct() (page1,page3)
(page2,page1)
(page4,page1)
(page3,page1)
(page4,page2)
(page3,page4)
page1 page3
page2 page1
def computeContribs(neighbors, rank):… page4 page1
page3 page1
page4 page2
links = sc.textFile(file)\ page3 page4
.map(lambda line: line.split())\
.map(lambda pages: (pages[0],pages[1]))\
.distinct()\ (page1,page3)
.groupByKey()
(page2,page1)
(page4,page1)
(page3,page1)
(page4,page2)
(page3,page4)
links
(page4, [page2,page1])
(page2, [page1])
(page3, [page1,page4])
(page1, [page3])
page1 page3
page2 page1
def computeContribs(neighbors, rank):… page4 page1
page3 page1
page4 page2
links = sc.textFile(file)\ page3 page4
.map(lambda line: line.split())\
.map(lambda pages: (pages[0],pages[1]))\
.distinct()\ (page1,page3)
.groupByKey()\
(page2,page1)
.persist()
(page4,page1)
(page3,page1)
(page4,page2)
(page3,page4)
links
(page4, [page2,page1])
(page2, [page1])
(page3, [page1,page4])
(page1, [page3])
links
def computeContribs(neighbors, rank):… (page4, [page2,page1])
(page2, [page1])
links = sc.textFile(file)\ (page3, [page1,page4])
.map(lambda line: line.split())\ (page1, [page3])
.map(lambda pages: (pages[0],pages[1]))\
.distinct()\
.groupByKey()\
.persist() ranks
(page4, 1.0)
links ranks
def computeContribs(neighbors, rank):… (page4, [page2 ,page1]) (page4, 1.0)
(page2, [page1 ]) (page2, 1.0)
links = … (page3, [page1 ,page4]) (page3, 1.0)
(page1, [page3 ]) (page1, 1.0)
ranks = …
for x in xrange(10):
contribs=links\ (page4, ([page2,page1], 1.0))
links ranks
def computeContribs(neighbors, rank):… (page4, [page2 ,page1]) (page4, 1.0)
(page2, [page1 ]) (page2, 1.0)
links = … (page3, [page1 ,page4]) (page3, 1.0)
(page1, [page3 ]) (page1, 1.0)
ranks = …
for x in xrange(10):
contribs=links\ (page4, ([page2,page1], 1.0))
contribs
(page2,0.5)
(page1,0.5)
(page1,1.0)
(page1,0.5)
(page4,0.5)
(page3,1.0)
contribs
(page2,0.5)
def computeContribs(neighbors, rank):… (page1,0.5)
(page1,1.0)
links = …
(page1,0.5)
ranks = … (page4,0.5)
(page3,1.0)
for x in xrange(10):
contribs=links\
.join(ranks)\ (page4,0.5)
.flatMap(lambda (page,(neighbors,rank)): \
(page2,0.5)
computeContribs(neighbors,rank))
ranks=contribs\ (page3,1.0)
.reduceByKey(lambda v1,v2: v1+v2) (page1,2.0)
contribs
(page2,0.5)
def computeContribs(neighbors, rank):… (page1,0.5)
(page1,1.0)
links = …
(page1,0.5)
ranks = … (page4,0.5)
(page3,1.0)
for x in xrange(10):
contribs=links\
.join(ranks)\ (page4,0.5)
.flatMap(lambda (page,(neighbors,rank)): \
(page2,0.5)
computeContribs(neighbors,rank))
ranks=contribs\ (page3,1.0)
.reduceByKey(lambda v1,v2: v1+v2)\ (page1,2.0)
.map(lambda (page,contrib): \
(page,contrib * 0.85 + 0.15)) ranks
(page4,.58)
(page2,.58)
(page3,1.0)
(page1,1.85)
links ranks
def computeContribs(neighbors, rank):… (page4, [page2,page1]) (page4,0.58)
(page2, [page1]) (page2,0.58)
links = … (page3, [page1,page4]) (page3,1.0)
(page1, [page3]) (page1,1.85)
ranks = …
for x in xrange(10):
contribs=links\
.join(ranks)\
.flatMap(lambda (page,(neighbors,rank)): \
computeContribs(neighbors,rank))
…
ranks=contribs\
.reduceByKey(lambda v1,v2: v1+v2)\
.map(lambda (page,contrib): \ ranks
(page,contrib * 0.85 + 0.15)) (page4,0.57)
(page2,0.21)
for rank in ranks.collect(): print rank
(page3,1.0)
(page1,0.77)
■k-‐means Clustering
– A common iterative algorithm used in graph analysis and
machine learning
– You will implement a simplified version in the homework assignment
devstatus = sc.textFile("file:/home/training/training_materials/data/devicestatus.txt")
cleanstatus = devstatus. \
devicedata = cleanstatus. \
# Save to a CSV file as a comma-delimited string (trim parenthesis from tuple toString)
devicedata. \
saveAsTextFile("/loudacre/devicestatus_etl")
■ Parse the input files, which are delimited by ‘,’ into (latitude, longitude) pairs(the 4 th and 5th fields
in each line). Files located at: /loudacre/devicestatus_etl/*
■Expected output
filename = "/loudacre/devicestatus_etl/*"
points = sc.textFile(filename)\
.map(lambda line: line.split(","))\
.map(lambda fields: [float(fields[3]),float(fields[4])])\
.filter(lambda point: sum(point) != 0)\
.persist()
points.take(5)
■Take a sample of 5 from the rdd points without replacement. Use a seed value
of 34. Name the resulting array as kPoints.
■Print the resulting array
■Expected output
K=5
kPoints = points.takeSample(False, K, 34)
print kPoints
For each coordinate point, use the provided closestPoint function to map each
point to the index in the kPoints array of the location closest to that point. The
resulting RDD should be keyed by the index, and the value should be the pair:
(point, 1). Name the resulting rdd: closest
Print the first five elements of the resulting rdd, closest
■Expected output
Reduce the result: for each center in the kPoints array, sum the latitudes and
longitudes, respectively, of all the points closest to that center, and the number
of closest points. That is: For each key (k-point index), reduce by adding the
coordinates and number of points.
Name the resulting RDD pointStats and print the rdd.
Tip: Use the provided addPoints function to sum the points.
■Expected output
pointStats.collect()
The reduced RDD should have (at most) K members. Map each to a new center point
by calculating the average latitude and longitude for each set of closest points: that is,
map
(index,(totalX,totalY),n) to (index,(totalX/n,totalY/n))
Perform a collect on the resulting RDD to produce an array.
Name the array: newPoints
■Expected output
Use the provided distanceSquared method to calculate how much each center
“moved” between the current iteration and the last. That is, for each center in
kPoints, calculate the distance between that point and the corresponding new point,
and sum those distances. That is the delta between iterations; when the delta is less
than convergeDist, stop iterating.
Print the new center points
■Expected output
# calculate the total of the distance between the current points and new points
tempDist=0
for (i,point) in newPoints: tempDist += distanceSquared(kPoints[i],point)
print "Distance between iterations:",tempDist
# Copy the new points to the kPoints array for the next iteration
for (i, point) in newPoints: kPoints[i] = point