03 MapReduce
03 MapReduce
Dimitris Kotzinos
Today’s Agenda
¢ MapReduce algorithm design
l How do you express everything in terms of m, r, c, p?
l Toward “design patterns”
¢ Indexing / Retrieval
l Basics of indexing and retrieval
l Inverted indexing in MapReduce
¢ Graph Algorithms:
l Graph problems and representations
l Parallel breadth-first search
l PageRank
¢ Relational Algorithms:
l Selection, projection, aggregation
l Group by
l Joins
MapReduce Algorithm Design
MapReduce: Recap
¢ Programmers must specify:
map (k, v) → <k’, v’>*
reduce (k’, v'*) → <k’, v’>*
l All values with the same key are reduced together
¢ Optionally, also:
partition (k’, number of partitions) → partition for k’
lOften a simple hash of the key, e.g., hash(k’) mod n
l Divides up key space for parallel reduce operations
combine (k’, v’) → <k’, v’>*
l Mini-reducers that run in memory after the map phase
l Used as an optimization to reduce network traffic
¢ The execution framework handles everything else…
“Everything Else”
¢ The execution framework handles everything else…
l Scheduling: assigns workers to map and reduce tasks
l “Data distribution”: moves processes to data
l Synchronization: gathers, sorts, and shuffles intermediate data
l Errors and faults: detects worker failures and restarts
¢ Limited control over data and execution flow
l All algorithms must expressed in m, r, c, p
¢ You don’t know:
l Where mappers and reducers run
l When a mapper or reducer begins or finishes
l Which input a particular mapper is processing
l Which intermediate key a particular reducer is processing
Tools for Synchronization
¢ Cleverly-constructed data structures
l Bring partial results together
¢ Sort order of intermediate keys
l Control order in which reducers process keys
¢ Partitioner
l Control which reducer processes which keys
¢ Preserving state in mappers and reducers
l Capture dependencies across multiple keys and values
Preserving State
Mapper object
Fixed?
Computing the Mean: Version 4
(a, b) → 1
(a, c) → 2
(a, d) → 5 a → { b: 1, c: 2, d: 5, e: 3, f: 2 }
(a, e) → 3
(a, f) → 2
count ( A, B) count ( A, B)
f ( B | A) = =
count ( A) å count( A, B' )
B'
¢ Easy!
l One pass to compute (a, *)
l Another pass to directly compute f(B|A)
f(B|A): “Pairs”
Query
Formulation Query
Search Results
Selection Documents
System discovery
Vocabulary discovery
Concept discovery
Document discovery Examination
Information
Delivery
The Central Problem in Search
Author
Searcher
Concepts Concepts
Query Documents
online offline
Representation Representation
Function Function
Comparison
Index
Function
Hits
How do we represent text?
¢ Remember: computers don’t “understand” anything!
¢ “Bag of words”
l Treat all the words in a document as index terms
l Assign a “weight” to each term based on “importance”
(or, in simplest case, presence/absence of word)
l Disregard order, structure, meaning, etc. of the words
l Simple, yet effective!
¢ Assumptions
l Term occurrence is independent
l Document relevance is independent
l “Words” are well-defined
What’s a word?
天主教教宗若望保祿二世因感冒再度住進醫院。
這是他今年第二度因同樣的病因住院。 وﻗﺎل ﻣﺎرك رﯾﺠﯿﻒ- اﻟﻨﺎطﻖ ﺑﺎﺳﻢ
اﻟﺨﺎرﺟﯿﺔ اﻹﺳﺮاﺋﯿﻠﯿﺔ - إن ﺷﺎرون ﻗﺒﻞ
اﻟﺪﻋﻮة وﺳﯿﻘﻮم ﻟﻠﻤﺮة اﻷوﻟﻰ ﺑﺰﯾﺎرة
اﻟﺘﻲ ﻛﺎﻧﺖ ﻟﻔﺘﺮة طﻮﯾﻠﺔ اﻟﻤﻘﺮ ،ﺗﻮﻧﺲ
اﻟﺮﺳﻤﻲ ﻟﻤﻨﻈﻤﺔ اﻟﺘﺤﺮﯾﺮ اﻟﻔﻠﺴﻄﯿﻨﯿﺔ ﺑﻌﺪ ﺧﺮوﺟﮭﺎ ﻣﻦ ﻟﺒﻨﺎن ﻋﺎم 1982.
भारत सरकार ने आर्थिक सर्वेक्षण में वित्तीय वर्ष 2005-06 में सात
फ़ीसदी विकास दर हासिल करने का आकलन किया है और कर सुधार पर ज़ोर
दिया है
日米連合で台頭中国に対処…アーミテージ前副長官提言
Documents
Bag of Words
syntax, semantics, word knowledge, etc.
Inverted
Index
Boolean Retrieval
¢ Users express queries as a Boolean expression
l AND, OR, NOT
l Can be arbitrarily nested
¢ Retrieval is based on the notion of sets
l Any given query divides the collection into two sets:
retrieved, not-retrieved
l Pure Boolean systems do not define an ordering of the results
Inverted Index: Boolean Retrieval
Doc 1 Doc 2 Doc 3 Doc 4
one fish, two fish red fish, blue fish cat in the hat green eggs and ham
1 2 3 4
blue 1 blue 2
cat 1 cat 3
egg 1 egg 4
fish 1 1 fish 1 2
green 1 green 4
ham 1 ham 4
hat 1 hat 3
one 1 one 1
red 1 red 2
two 1 two 1
Boolean Retrieval
¢ To execute a Boolean query:
l Build query syntax tree
OR
( blue AND fish ) OR ham
ham AND
blue fish
l For each clause, look up postings
blue 2
fish 1 2
d3
d1
θ
φ
t1
d5
t2
d4
! !
sim(d j , d k ) = d j × d k = åi =1 wi , j wi ,k
n
Term Weighting
¢ Term weights consist of two components
l Local: how important is the term in this document?
l Global: how important is the term in the collection?
¢ Here’s the intuition:
l Terms that appear often in a document should get high weights
l Terms that appear in many documents should get low weights
¢ How do we capture this mathematically?
l Term frequency (local)
l Inverse document frequency (global)
TF.IDF Term Weighting
N
wi , j = tf i , j × log
ni
wi , j weight assigned to term i in document j
tf i, j number of occurrence of term i in document j
tf
1 2 3 4 df
blue 1 1 blue 1 2 1
cat 1 1 cat 1 3 1
egg 1 1 egg 1 4 1
fish 2 2 2 fish 2 1 2 2 2
green 1 1 green 1 4 1
ham 1 1 ham 1 4 1
hat 1 1 hat 1 3 1
one 1 1 one 1 1 1
red 1 1 red 1 2 1
two 1 1 two 1 1 1
Positional Indexes
¢ Store term position in postings
¢ Supports richer queries (e.g., proximity)
¢ Naturally, leads to larger indexes…
Inverted Index: Positional Information
Doc 1 Doc 2 Doc 3 Doc 4
one fish, two fish red fish, blue fish cat in the hat green eggs and ham
tf
1 2 3 4 df
blue 1 1 blue 1 2 1 [3]
¢ Tradeoffs
l Small memory footprint (good)
l Must read through all postings (bad), but skipping possible
l More disk seeks (bad), but reading in blocks possible
Retrieval: Query-At-A-Time
fish 1 2 9 1 21 3 34 1 35 2 80 3 …
¢ Tradeoffs
l Early termination heuristics (good)
l Large memory footprint (bad), but filtering heuristics possible
MapReduce it?
¢ The indexing problem
l Scalability is critical
l Must be relatively fast, but need not be real time
l Fundamentally a batch operation
l Incremental updates may or may not be important
l For the web, crawling is a challenge in itself
¢ The retrieval problem
l Must have sub-second response time
l For the web, only need relatively few results
Indexing: Performance Analysis
¢ Fundamentally, a large sorting problem
l Terms usually fit in memory
l Postings usually don’t
¢ How is it done on a single machine?
¢ How can it be done with MapReduce?
¢ First, let’s characterize the problem size:
l Size of vocabulary
l Size of postings
Vocabulary Size: Heaps’ Law
M = kT b M is vocabulary size
T is collection size (number of documents)
k and b are constants
k = 44
b = 0.49
Reuters-RCV1 collection: 806,791 newswire documents (Aug 20, 1996-August 19, 1997)
c
cf i = cf is the collection frequency of i-th common term
c is a constant
i
Reuters-RCV1 collection: 806,791 newswire documents (Aug 20, 1996-August 19, 1997)
fish 1 2 fish 2 2
cat 3 1
blue 2 1
Reduce fish 1 2 2 2
hat 3 1
one 1 1
two 1 1
red 2 1
Inverted Indexing: Pseudo-Code
Positional Indexes
Doc 1 Doc 2 Doc 3
one fish, two fish red fish, blue fish cat in the hat
cat 3 1 [1]
blue 2 1 [3]
fish 1 2 9 1 21 3 34 1 35 2 80 3 …
In Practice:
• Don’t encode docnos, encode gaps (or d-gaps)
• But it’s not obvious that this will save space…
fish 1 2 8 1 12 3 13 1 1 2 45 3 …
Overview of Index Compression
¢ Byte-aligned vs. bit-aligned
¢ Non-parameterized bit-aligned
l Unary codes
l g codes
l d codes
¢ Parameterized bit-aligned
l Golomb codes (local Bernoulli model)
Want more detail? Read Managing Gigabytes by Witten, Moffat, and Bell!
Unary Codes
¢ x ³ 1 is coded as x-1 one bits, followed by 1 zero bit
l 3 = 110
l 4 = 1110
¢ Great for small numbers… horrible for large numbers
l Overly-biased for very small gaps
Unary g d Golomb
b=3 b=6
1 0 0 0 0:0 0:00
2 10 10:0 100:0 0:10 0:01
3 110 10:1 100:1 0:11 0:100
4 1110 110:00 101:00 10:0 0:101
5 11110 110:01 101:01 10:10 0:110
6 111110 110:10 101:10 10:11 0:111
7 1111110 110:11 101:11 110:0 10:00
8 11111110 1110:000 11000:00 110:10 10:01
0
9 111111110 1110:001 11000:00 110:11 10:100
1
10 1111111110 1110:010 11000:01 1110:0 10:101
0
Bible TREC
Bible: King James version of the Bible; 31,101 verses (4.3 MB)
TREC: TREC disks 1+2; 741,856 docs (2070 MB)
(key) (value)
fish 1 [2,4]
But wait! How do we set the
fish 9 [9]
Golomb parameter b?
fish 21 [1,8,22]
Recall: optimal b » 0.69 (N/df)
fish 34 [23]
We need the df to set b…
fish 35 [8,41] But we don’t know the df until we’ve
seen all postings!
fish 80 [2,9,76]
Sound familiar?
Getting the df
¢ In the mapper:
l Emit “special” key-value pairs to keep track of df
¢ In the reducer:
l Make sure “special” key-value pairs come first: process them to
determine df
¢ Remember: proper partitioning!
Getting the df: Modified Mapper
Doc 1
one fish, two fish Input document…
(key) (value)
one 1 [1]
two 1 [3]
one « [1]
two « [1]
Getting the df: Modified Reducer
(key) (value)
fish 9 [9]
fish 35 [8,41]
fish 80 [2,9,76]
D T2
Term …
Partitioning
T3
T
Document
Partitioning … T
D1 D2 D3
Katta Architecture
(Distributed Lucene)
http://katta.sourceforge.net/
Graph Algorithms in MapReduce
Some Graph Problems
¢ Finding shortest paths
l Routing Internet traffic and UPS trucks
¢ Finding minimum spanning trees
l Telco laying down fiber
¢ Finding Max Flow
l Airline scheduling
¢ Identify “special” nodes and communities
l Breaking up terrorist cells, spread of avian flu
¢ Bipartite matching
l Monster.com, Match.com
¢ And of course... PageRank
Graphs and MapReduce
¢ Graph algorithms typically involve:
l Performing computations at each node: based on node
features, edge features, and local link structure
l Propagating computations: “traversing” the graph
¢ Key questions:
l How do you represent graph data in MapReduce?
l How do you traverse a graph in MapReduce?
Representing Graphs
¢ G = (V, E)
¢ Two common representations
l Adjacency matrix
l Adjacency list
Adjacency Matrices
Represent a graph as an n x n square matrix M
l n = |V|
l Mij = 1 means a link from node i to j
1
1 2 3 4 3
1 0 1 0 1 4
2 1 0 1 1
Adjacency Matrices: Critique
¢ Advantages:
l Amenable to mathematical manipulation
l Iteration over rows and columns corresponds to
computations on outlinks and inlinks
¢ Disadvantages:
l Lots of zeros for sparse matrices
l Lots of wasted space
Adjacency Lists
Take adjacency matrices… and throw away all the zeros
1: 2, 4
1 2 3 4
2: 1, 3, 4
3: 1
1 0 1 0 1 4: 1, 3
2 1 0 1 1
Adjacency Lists: Critique
¢ Advantages:
l Much more compact representation
l Easy to compute over outlinks
¢ Disadvantages:
l Much more difficult to compute over inlinks
Single Source Shortest Path
¢ Problem: find shortest path from a source node to one
or more target nodes
l Shortest might also mean lowest weight or cost
¢ First, a refresher: Dijkstra’s Algorithm
Dijkstra’s Algorithm Example
1
¥ ¥
10
0 2 3 9 4 6
5 7
¥ ¥
2
1
10 ¥
10
0 2 3 9 4 6
5 7
5 ¥
2
1
10 ¥
10
0 2 3 9 4 6
5 7
5 ¥
2
1
8 14
10
0 2 3 9 4 6
5 7
5 7
2
1
8 13
10
0 2 3 9 4 6
5 7
5 7
2
1
8 9
10
0 2 3 9 4 6
5 7
5 7
2
… d3
m3
Visualizing Parallel BFS
n7
n0 n1
n3 n2
n6
n5
n4
n8
n9
From Intuition to Algorithm
¢ Data representation:
l Key: node n
l Value: d (distance from start), adjacency list (list of nodes
reachable from n)
l Initialization: for all nodes except for start node, d = ¥
¢ Mapper:
l "m Î adjacency list: emit (m, d + 1)
¢ Sort/Shuffle
l Groups distances by reachable nodes
¢ Reducer:
l Selects minimum distance path for each reachable node
l Additional bookkeeping needed to keep track of actual path
Multiple Iterations Needed
¢ Each MapReduce iteration advances the “known
frontier” by one hop
l Subsequent iterations include more and more reachable
nodes as frontier expands
l Multiple iterations are needed to explore entire graph
¢ Preserving graph structure:
l Problem: Where did the adjacency list go?
l Solution: mapper emits (n, adjacency list) as well
BFS Pseudo-Code
Stopping Criterion
¢ How many iterations are needed in parallel BFS (equal
edge weight case)?
¢ Convince yourself: when a node is first “discovered”,
we’ve found the shortest path
¢ Now answer the question...
l Six degrees of separation?
¢ Practicalities of implementation in MapReduce
Comparison to Dijkstra
¢ Dijkstra’s algorithm is more efficient
l At any step it only pursues edges from the minimum-cost
path inside the frontier
¢ MapReduce explores all paths in parallel
l Lots of “waste”
l Useful work is only done at the “frontier”
¢ Why can’t we do better using MapReduce?
Weighted Edges
¢ Now add positive weights to the edges
l Why can’t edge weights be negative?
¢ Simple change: adjacency list now includes a weight
w for each edge
l In mapper, emit (m, d + wp) instead of (m, d + 1) for each node
m
¢ That’s it?
Stopping Criterion
¢ How many iterations are needed in parallel BFS (equal
edge weight case)?
¢ Convince yourself: when a node is first “discovered”,
we’ve found the shortest path
¢ Now answer the question...
l Six degrees of separation?
¢ Practicalities of implementation in MapReduce
Additional Complexities
1
search frontier 1
n6 n7 1
n8
r 10
1 n9
n5
n1
s 1 1
q
p n4
1
n2 1
n3
Stopping Criterion
¢ How many iterations are needed in parallel BFS (equal
edge weight case)?
¢ Convince yourself: when a node is first “discovered”,
we’ve found the shortest path
¢ Now answer the question...
l Six degrees of separation?
¢ Practicalities of implementation in MapReduce
Graphs and MapReduce
¢ Graph algorithms typically involve:
l Performing computations at each node: based on node
features, edge features, and local link structure
l Propagating computations: “traversing” the graph
¢ Key questions:
l How do you represent graph data in MapReduce?
l How do you traverse a graph in MapReduce?
Random Walks Over the Web
¢ Random surfer model:
l User starts at a random Web page
l User randomly clicks on links, surfing from page to page
¢ PageRank
l Characterizes the amount of time spent on any given page
l Mathematically, a probability distribution over pages
¢ PageRank captures notions of page importance
l Correspondence to human intuition?
l One of thousands of features used in web search
l Note: query-independent
PageRank: Defined
Given page x with inlinks t1…tn, where
l C(t) is the out-degree of t
l a is probability of random jump
l N is the total number of nodes in the graph
n
PR(t i )
PR( x )=α ()
1
N
+ (1− α) ∑
i=1 C (t i )
t1
t2
…
tn
Computing PageRank
¢ Properties of PageRank
l Can be computed iteratively
l Effects at each iteration are local
¢ Sketch of algorithm:
l Start with seed PRi values
l Each page distributes PRi “credit” to all pages it links to
l Each target page adds up “credit” from multiple in-bound
links to compute PRi+1
l Iterate until values converge
Simplified PageRank
¢ First, tackle the simple case:
l No random jump factor
l No dangling links
¢ Then, factor in these complexities…
l Why do we need the random jump?
l Where do dangling links come from?
Sample PageRank Iteration (1)
0.1
n1 (0.2) 0.1 0.1 n1 (0.066)
0.1
0.066
0.066 0.066
n5 (0.2) n5 (0.3)
n3 (0.2) n3 (0.166)
0.2 0.2
n4 (0.2) n4 (0.3)
Sample PageRank Iteration (2)
0.033 0.083
n1 (0.066) 0.083 n1 (0.1)
0.033
0.1
0.1 0.1
n5 (0.3) n5 (0.383)
n3 (0.166) n3 (0.183)
0.3 0.166
n4 (0.3) n4 (0.2)
PageRank in MapReduce
Map
n2 n4 n3 n5 n4 n5 n1 n2 n3
n1 n2 n2 n3 n3 n4 n4 n5 n5
Reduce
p'=α()1
∣G∣ ( )
+ (1− α )
m
∣G∣
+p
l p is PageRank value from before, p' is updated PageRank
value
l |G| is the number of nodes in the graph
l m is the missing PageRank mass
PageRank Convergence
¢ Alternative convergence criteria
l Iterate until PageRank values don’t change
l Iterate until PageRank rankings don’t change
l Fixed number of iterations
¢ Convergence for web graphs?
Beyond PageRank
¢ Link structure is important for web search
l PageRank is one of many link-based features: HITS, SALSA,
etc.
l One of many thousands of features used in ranking…
¢ Adversarial nature of web search
l Link spamming
l Spider traps
l Keyword stuffing
l …
Efficient Graph Algorithms
¢ Sparse vs. dense graphs
¢ Graph topologies
Local Aggregation
¢ Use combiners!
l In-mapper combining design pattern also applicable
¢ Maximize opportunities for local aggregation
l Simple tricks: sorting the dataset in specific ways
Relational Processing
on MapReduce
ETL
(Extract, Transform, and Load)
OLTP OLAP
OLTP/OLAP Integration
¢OLTP database for user-facing transactions
l Retain records of all activity
l Periodic ETL (e.g., nightly)
¢Extract-Transform-Load (ETL)
l Extract records from source
l Transform: clean data, check integrity, aggregate, etc.
l Load into OLAP database
¢OLAP database for data warehousing
l Business intelligence: reporting, ad hoc queries, data mining, etc.
l Feedback to improve OLTP services
Business Intelligence
¢Premise: more data leads to better business decisions
l Periodic reporting as well as ad hoc queries
l Analysts, not programmers (importance of tools and dashboards)
¢Examples:
l Slicing-and-dicing activity by different dimensions to better
understand the marketplace
l Analyzing log data to improve OLTP experience
l Analyzing log data to better optimize ad placement
l Analyzing purchasing trends for better supply-chain management
l Mining for correlations between otherwise unrelated activities
OLTP/OLAP Architecture: Hadoop?
OLTP OLAP
Hadoop here?
OLTP/OLAP/Hadoop Architecture
ETL
(Extract, Transform, and Load)
a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8
a 1 b 2 c 9 a 5 c 2 b 7 c 8
r1 s1 r2 s2 r3 s3
“Everything Else”
¢The execution framework handles everything else…
l Scheduling: assigns workers to map and reduce tasks
l “Data distribution”: moves processes to data
l Synchronization: gathers, sorts, and shuffles intermediate data
l Errors and faults: detects worker failures and restarts
¢Limited control over data and execution flow
l All algorithms must expressed in m, r, c, p
¢You don’t know:
l Where mappers and reducers run
l When a mapper or reducer begins or finishes
l Which input a particular mapper is processing
l Which intermediate key a particular reducer is processing
MapReduce algorithms
for processing relational data
Design Pattern: Secondary Sorting
¢MapReduce sorts input to reducers by key
l Values are arbitrarily ordered
¢What if want to sort value also?
l E.g., k → (v1, r), (v3, r), (v4, r), (v8, r)…
Secondary Sorting: Solutions
¢Solution 1:
l Buffer values in memory, then sort
l Why is this a bad idea?
¢Solution 2:
l “Value-to-key conversion” design pattern: form composite
intermediate key, (k, v1)
l Let execution framework do the sorting
l Preserve state across multiple key-value pairs to handle
processing
l Anything else we need to do?
Value-to-Key Conversion
Before
k → (v1, r), (v4, r), (v8, r), (v3, r)…
Values arrive in arbitrary order…
After
(k, v1) → (v1, r) Values arrive in sorted order…
(k, v3) → (v3, r) Process by preserving state across multiple keys
Remember to partition correctly!
(k, v4) → (v4, r)
(k, v8) → (v8, r)
…
Working Scenario
¢Two tables:
l User demographics (gender, age, income, etc.)
l User page visits (URL, time spent, etc.)
¢Analyses we might want to perform:
l Statistics on demographic characteristics
l Statistics on page visits
l Statistics on page visits by URL
l Statistics on page visits by demographic characteristic
l …
Relational Algebra
¢Primitives
l Projection (p)
l Selection (s)
l Cartesian product (´)
l Set union (È)
l Set difference (-)
l Rename (r)
¢Other operations
l Join (⋈)
l Group by… aggregation
l …
Projection
R1 R1
R2 R2
R3 R3
R4 R4
R5 R5
Projection in MapReduce
¢Easy!
l Map over tuples, emit new tuples with appropriate attributes
l No reducers, unless for regrouping or resorting tuples
l Alternatively: perform in reducer, after some other processing
¢Basically limited by HDFS streaming speeds
l Speed of encoding/decoding tuples becomes important
l Relational databases take advantage of compression
l Semistructured data? No problem!
Selection
R1
R2
R1
R3
R3
R4
R5
Selection in MapReduce
¢Easy!
l Map over tuples, emit only tuples that meet criteria
l No reducers, unless for regrouping or resorting tuples
l Alternatively: perform in reducer, after some other processing
¢Basically limited by HDFS streaming speeds
l Speed of encoding/decoding tuples becomes important
l Relational databases take advantage of compression
l Semistructured data? No problem!
Group by… Aggregation
¢Example: What is the average time spent per URL?
¢In SQL:
l SELECT url, AVG(time) FROM visits GROUP BY url
¢In MapReduce:
l Map over tuples, emit time, keyed by url
l Framework automatically groups values by keys
l Compute average in reducer
l Optimize with combiners
Relational Joins
R2 S2
R3 S3
R4 S4
R1 S2
R2 S4
R3 S1
R4 S3
Types of Relationships
R1 R1
R4 R4
S2 S2
S3 S3
Reduce
keys values
R1 S2
S3 R4
R1 R1
S2 S2
S3 S3
S9 S9
Reduce
keys values
R1 S2 S3 …
S3
S9
S7
Reduce-side Join: many-to-many
In reducer…
keys values
R1
R5 Hold in memory
R8
S3
S9
R1 S2
R2 S4
R4 S3
R3 S1
ETL
(Extract, Transform, and Load)