0% found this document useful (0 votes)

34 views64 pages

16 SparkAlgorithms

Uploaded by

priyanka chowdary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views64 pages

16 SparkAlgorithms

Uploaded by

priyanka chowdary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

Common Patterns in Spark

Data Processing
Chapter 16

201509
Course Chapters

1 IntroducIon Course IntroducIon

2 IntroducIon to Hadoop and the Hadoop Ecosystem
IntroducIon to Hadoop
3 Hadoop Architecture and HDFS
4 ImporIng RelaIonal Data with Apache Sqoop
5 IntroducIon to Impala and Hive
ImporIng and Modeling Structured
6 Modeling and Managing Data with Impala and Hive
Data
7 Data Formats
8 Data File ParIIoning
9 Capturing Data with Apache Flume IngesIng Streaming Data

10 Spark Basics
11 Working with RDDs in Spark
12 AggregaIng Data with Pair RDDs
13 WriIng and Deploying Spark ApplicaIons Distributed Data Processing with
14 Parallel Processing in Spark Spark
15 Spark RDD Persistence
16 Common Patterns in Spark Data Processing
17 Spark SQL and DataFrames

18 Conclusion Course Conclusion

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐2

Common Patterns in Spark Programming

In this chapter you will learn

■What kinds of processing and analysis Spark is best at
■How to implement an iterative algorithm in Spark
■How MLlib work with Spark

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐3

Chapter Topics

Common Pa,erns in Spark Data Distributed Data Processing with

Processing Spark

■ Common Spark Use Cases

■ Iterative Algorithms in Spark
■ Machine Learning
■ Example: k-‐means
■ Conclusion
■ Homework: Implement an Iterative Algorithm with
Spark

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐4

Common Spark Use Cases (1)

■Spark is especially useful when working with any combination of:

– Large amounts of data
– Distributed storage
– Intensive computaIons
– Distributed compuIng
– Iterative algorithms
– In-‐memory processing and pipelining

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐5

Common Spark Use Cases (2)

■Examples
– Risk analysis
– “How likely is this borrower to pay back a loan?”
– RecommendaIons
– “Which products will this customer enjoy?”
– PredicIons
– “How can we prevent service outages instead of simply reacIng to
them?”
– ClassiﬁcaIon
– “How can we tell which mail is spam and which is legiImate?”

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐6

Spark Examples

■Spark includes many example programs that demonstrate some common

Spark programming pa,erns and algorithms
– k-‐means
– LogisIc regression
– Calculate pi
– AlternaIng least squares (ALS)
– Querying Apache web logs
– Processing Twi(er feeds
■Examples
– $DEV1/examples/spark
– spark-examples-version.jar – Java and Scala examples
– python.tar.gz – Pyspark examples

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐7

Chapter Topics

Common Pa,erns in Spark Data Distributed Data Processing with

Processing Spark

■ Common Spark Use Cases

■ Iterative Algorithms in Spark
■ Machine Learning
■ Example: k-‐means
■ Conclusion
■ Homework: Implement an Iterative Algorithm with
Spark

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐8

Example: PageRank

■PageRank gives web pages a ranking score based on links from other
pages
– Higher scores given for more links, and links from other high-ranking
pages
■Why do we care?
– PageRank is a classic example of big data analysis (like WordCount)
– Lots of data – needs an algorithm that is distributable and scalable
– Iterative – the more iteraIons, the better the answer

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐9

PageRank Algorithm (1)

1. Start each page with a rank of 1.0

Page 1
1.0

Page 2 Page 3
1.0 1.0
Page 4
1.0

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐10

PageRank Algorithm (2)

1. Start each page with a rank of 1.0

2. On each iteration:
1. each page contributes to its neighbors its own rank divided by the
number of its neighbors: contribp = rankp / neighborsp

Page 1
1.0

Page 2 Page 3
1.0 1.0
Page 4
1.0

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐11

PageRank Algorithm (3)

1. Start each page with a rank of 1.0

2. On each iteration:
1. each page contributes to its neighbors its own rank divided by the
number of its neighbors: contribp = rankp / neighborsp
2. Set each page’s new rank based on the sum of its neighbors
contribuIon: new-‐rank = Σcontribs * .85 + .15

Page 1 Iteration 1
1.85

Page 2 Page 3
0.58 1.0
Page 4
0.58

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐12

PageRank Algorithm (4)

1. Start each page with a rank of 1.0

Page 1 IteraIon 2
1.31

Page 2 Page 3
0.39 1.7
Page 4
0.57

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐13

PageRank Algorithm (5)

1. Start each page with a rank of 1.0

Page 1 Iteration 10
1.43 (Final)
Page 2 Page 3
0.46 1.38
Page 4
0.73

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐14

PageRank in Spark: Neighbor ContribuIon FuncIon

def computeContribs(neighbors, rank):

for neighbor in neighbors: yield(neighbor, rank/len(neighbors))

neighbors: [page1,page2] (page1,.5)

rank: 1.0 (page2,.5)

Page 1

Page 2 Page 3

Page 4
1.0

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐15

PageRank in Spark: Example Data

Data Format: page1 page3

source-page destination-page page2 page1
… page4 page1
page3 page1
page4 page2
page3 page4

Page 1

Page 2 Page 3

Page 4

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐16

PageRank in Spark: Pairs of Page Links

page1 page3
page2 page1
def computeContribs(neighbors, rank):… page4 page1
page3 page1
page4 page2
links = sc.textFile(file)\ page3 page4
.map(lambda line: line.split())\
.map(lambda pages: (pages[0],pages[1]))\
.distinct() (page1,page3)
(page2,page1)
(page4,page1)
(page3,page1)
(page4,page2)
(page3,page4)

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐17

PageRank in Spark: Page Links Grouped by Source Page

page1 page3
page2 page1
def computeContribs(neighbors, rank):… page4 page1
page3 page1
page4 page2
links = sc.textFile(file)\ page3 page4
.map(lambda line: line.split())\
.map(lambda pages: (pages[0],pages[1]))\
.distinct()\ (page1,page3)
.groupByKey()
(page2,page1)
(page4,page1)
(page3,page1)
(page4,page2)
(page3,page4)

links
(page4, [page2,page1])
(page2, [page1])
(page3, [page1,page4])
(page1, [page3])

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐18

PageRank in Spark: PersisIng the Link Pair RDD

page1 page3
page2 page1
def computeContribs(neighbors, rank):… page4 page1
page3 page1
page4 page2
links = sc.textFile(file)\ page3 page4
.map(lambda line: line.split())\
.map(lambda pages: (pages[0],pages[1]))\
.distinct()\ (page1,page3)
.groupByKey()\
(page2,page1)
.persist()
(page4,page1)
(page3,page1)
(page4,page2)
(page3,page4)

links
(page4, [page2,page1])
(page2, [page1])
(page3, [page1,page4])
(page1, [page3])

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐19

PageRank in Spark: Set Initial Ranks

links
def computeContribs(neighbors, rank):… (page4, [page2,page1])
(page2, [page1])
links = sc.textFile(file)\ (page3, [page1,page4])
.map(lambda line: line.split())\ (page1, [page3])
.map(lambda pages: (pages[0],pages[1]))\
.distinct()\
.groupByKey()\
.persist() ranks
(page4, 1.0)

ranks=links.map(lambda (page,neighbors): (page,1.0)) (page2, 1.0)

(page3, 1.0)
(page1, 1.0)

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐20

PageRank in Spark: First Iteration (1)

links ranks
def computeContribs(neighbors, rank):… (page4, [page2 ,page1]) (page4, 1.0)
(page2, [page1 ]) (page2, 1.0)
links = … (page3, [page1 ,page4]) (page3, 1.0)
(page1, [page3 ]) (page1, 1.0)
ranks = …

for x in xrange(10):
contribs=links\ (page4, ([page2,page1], 1.0))

.join(ranks) (page2, ([page1], 1.0))

(page3, ([page1,page4], 1.0))
(page1, ([page3], 1.0))

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐21

PageRank in Spark: First Iteration (2)

for x in xrange(10):
contribs=links\ (page4, ([page2,page1], 1.0))

.join(ranks)\ (page2, ([page1], 1.0))

.flatMap(lambda (page,(neighbors,rank)): \ (page3, ([page1,page4], 1.0))
computeContribs(neighbors,rank)) (page1, ([page3], 1.0))

contribs
(page2,0.5)
(page1,0.5)
(page1,1.0)
(page1,0.5)
(page4,0.5)
(page3,1.0)

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐22

PageRank in Spark: First Iteration (3)

contribs
(page2,0.5)
def computeContribs(neighbors, rank):… (page1,0.5)
(page1,1.0)
links = …
(page1,0.5)

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐23

PageRank in Spark: First Iteration (4)

contribs
(page2,0.5)
def computeContribs(neighbors, rank):… (page1,0.5)
(page1,1.0)
links = …
(page1,0.5)

ranks = … (page4,0.5)
(page3,1.0)
for x in xrange(10):
contribs=links\
.join(ranks)\ (page4,0.5)
.flatMap(lambda (page,(neighbors,rank)): \
(page2,0.5)
computeContribs(neighbors,rank))
ranks=contribs\ (page3,1.0)
.reduceByKey(lambda v1,v2: v1+v2)\ (page1,2.0)
.map(lambda (page,contrib): \
(page,contrib * 0.85 + 0.15)) ranks
(page4,.58)
(page2,.58)
(page3,1.0)
(page1,1.85)

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐24

PageRank in Spark: Second Iteration

links ranks
def computeContribs(neighbors, rank):… (page4, [page2,page1]) (page4,0.58)
(page2, [page1]) (page2,0.58)
links = … (page3, [page1,page4]) (page3,1.0)
(page1, [page3]) (page1,1.85)
ranks = …

for x in xrange(10):
contribs=links\
.join(ranks)\
.flatMap(lambda (page,(neighbors,rank)): \
computeContribs(neighbors,rank))
…
ranks=contribs\
.reduceByKey(lambda v1,v2: v1+v2)\
.map(lambda (page,contrib): \ ranks
(page,contrib * 0.85 + 0.15)) (page4,0.57)
(page2,0.21)
for rank in ranks.collect(): print rank
(page3,1.0)
(page1,0.77)

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐25

Chapter Topics

Common Patterns in Spark Distributed Data Processing with

Data Processing Spark

■ Common Spark Use Cases

■ Iterative Algorithms in Spark
■ Machine Learning
■ Example: k-‐means
■ Conclusion
■ Homework: Implement an Iterative Algorithm with
Spark

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐26

Machine Learning

■Most programs tell computers exactly what to do

– Database transactions and queries
– Controllers
– Phone systems, manufacturing processes, transport, weaponry, etc.
– Media delivery
– Simple search
– Social systems
– Chat, blogs, email, etc.
■An alternative technique is to have computers learn what to do
■Machine Learning refers to programs that leverage collected data to drive
future program behavior
■This represents another major opportunity to gain value from data

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐27

The ‘Three Cs’

■Machine Learning is an active area of research and new applications

■There are three well-‐established categories of techniques for exploiting
data
– Collaborative ﬁltering (recommendaIons)
– Clustering
– ClassiﬁcaIon

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐28

Collaborative
Filtering
■Collaborative Filtering is a technique for recommendations
■Example application: given people who each like certain books, learn to
suggest what someone may like in the future based on what they already
like
■Helps users navigate data by expanding to topics that have affinity with
their established interests
■Collaborative Filtering algorithms are agnostic to the different types of
data items involved
– Useful in many different domains

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐29

Clustering

■Clustering algorithms discover structure in collections of data

– Where no formal structure previously existed
■They discover what clusters, or groupings, naturally occur in data
■Examples
– Finding related news articles
– Computer vision (groups of pixels that cohere into objects)

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐30

Classification

■The previous two techniques are considered ‘unsupervised’ learning

– The algorithm discovers groups or recommendations itself
■Classification is a form of ‘supervised’ learning
■A classification system takes a set of data records with known labels
– Learns how to label new records based on that information
■Examples
– Given a set of e-‐mails identified as spam/not spam, label new e-‐mails as
spam/not spam
– Given tumors idenIfied as benign or malignant, classify new tumors

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐31

Machine Learning Challenges

■Highly computation intensive and iterative

■Many traditional numerical processing systems do not scale to very large
datasets
– e.g., MatLab

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐32

MLlib: Machine Learning on Spark

■MLlib is part of Apache Spark

■Includes many common ML functions
– ALS (alternaIng least squares)
– k-‐means
– LogisIc Regression
– Linear Regression
– Gradient Descent

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐33

Chapter Topics

Common Pa,erns in Spark Data Distributed Data Processing with

Processing Spark

■ Common Spark Use Cases

■ Iterative Algorithms in Spark
■ Machine Learning
■ Example: k-‐means
■ Conclusion
■ Homework: Implement an Iterative Algorithm with
Spark

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐34

k-‐means Clustering

■k-‐means Clustering
– A common iterative algorithm used in graph analysis and
machine learning
– You will implement a simpliﬁed version in the homework assignment

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐35

Clustering (1)

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐36

Clustering (2)

Goal: Find “clusters” of data

points

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐37

Example: k-‐means Clustering (1)

1. Choose K random points as

starIng centers

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐38

Example: k-‐means Clustering (2)

1. Choose K random points as

starIng centers
2. Find all points closest to each
center

© Copyright 2010-‐2015Cloudera.Allrightsreserved.Nottobereproducedorshared without prior wri( en consent from Cloudera. 16-‐39

Example: k-‐means Clustering (3)

1. Choose K random points as

starIng centers
2. Find all points closest to each
center
3. Find the center (mean) of each
cluster