Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views184 pages

03 MapReduce

The document discusses the design and implementation of algorithms using MapReduce, covering topics such as indexing, retrieval, graph algorithms, and relational algorithms. It emphasizes the importance of local aggregation and optimization techniques like combiners to improve performance. Additionally, it addresses challenges in synchronization, debugging, and the representation of text in information retrieval systems.

Uploaded by

Frederic Vargas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views184 pages

03 MapReduce

The document discusses the design and implementation of algorithms using MapReduce, covering topics such as indexing, retrieval, graph algorithms, and relational algorithms. It emphasizes the importance of local aggregation and optimization techniques like combiners to improve performance. Additionally, it addresses challenges in synchronization, debugging, and the representation of text in information retrieval systems.

Uploaded by

Frederic Vargas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 184

Algorithms in MapReduce

Dimitris Kotzinos
Today’s Agenda
¢ MapReduce algorithm design
l How do you express everything in terms of m, r, c, p?
l Toward “design patterns”
¢ Indexing / Retrieval
l Basics of indexing and retrieval
l Inverted indexing in MapReduce

¢ Graph Algorithms:
l Graph problems and representations
l Parallel breadth-first search
l PageRank
¢ Relational Algorithms:
l Selection, projection, aggregation
l Group by
l Joins
MapReduce Algorithm Design
MapReduce: Recap
¢ Programmers must specify:
map (k, v) → <k’, v’>*
reduce (k’, v'*) → <k’, v’>*
l All values with the same key are reduced together
¢ Optionally, also:
partition (k’, number of partitions) → partition for k’
lOften a simple hash of the key, e.g., hash(k’) mod n
l Divides up key space for parallel reduce operations
combine (k’, v’) → <k’, v’>*
l Mini-reducers that run in memory after the map phase
l Used as an optimization to reduce network traffic
¢ The execution framework handles everything else…
“Everything Else”
¢ The execution framework handles everything else…
l Scheduling: assigns workers to map and reduce tasks
l “Data distribution”: moves processes to data
l Synchronization: gathers, sorts, and shuffles intermediate data
l Errors and faults: detects worker failures and restarts
¢ Limited control over data and execution flow
l All algorithms must expressed in m, r, c, p
¢ You don’t know:
l Where mappers and reducers run
l When a mapper or reducer begins or finishes
l Which input a particular mapper is processing
l Which intermediate key a particular reducer is processing
Tools for Synchronization
¢ Cleverly-constructed data structures
l Bring partial results together
¢ Sort order of intermediate keys
l Control order in which reducers process keys
¢ Partitioner
l Control which reducer processes which keys
¢ Preserving state in mappers and reducers
l Capture dependencies across multiple keys and values
Preserving State

Mapper object

one object per task


state state

configure API initialization hook configure

one call per input


key-value pair
map reduce
one call per
intermediate key

close API cleanup hook close


Scalable Hadoop Algorithms: Themes
¢ Avoid object creation
l Inherently costly operation
l Garbage collection
¢ Avoid buffering
l Limited heap size
l Works for small datasets, but won’t scale!
Importance of Local Aggregation
¢ Ideal scaling characteristics:
l Twice the data, twice the running time
l Twice the resources, half the running time
¢ Why can’t we achieve this?
l Synchronization requires communication
l Communication kills performance
¢ Thus… avoid communication!
l Reduce intermediate data via local aggregation
l Combiners can help
Word Count: Baseline

What’s the impact of combiners?


Word Count: Version 1

Are combiners still needed?


Word Count: Version 2

Are combiners still needed?


Design Pattern for Local Aggregation
¢ “In-mapper combining”
l Fold the functionality of the combiner into the mapper by
preserving state across multiple map calls
¢ Advantages
l Speed
l Why is this faster than actual combiners?
¢ Disadvantages
l Explicit memory management required
l Potential for order-dependent bugs
Combiner Design
¢ Combiners and reducers share same method signature
l Sometimes, reducers can serve as combiners
l Often, not…
¢ Remember: combiner are optional optimizations
l Should not affect algorithm correctness
l May be run 0, 1, or multiple times
¢ Example: find average of all integers associated with the
same key
Computing the Mean: Version 1

Why can’t we use reducer as combiner?


Computing the Mean: Version 2

Why doesn’t this work?


Computing the Mean: Version 3

Fixed?
Computing the Mean: Version 4

Are combiners still needed?


Algorithm Design: Running Example
¢ Term co-occurrence matrix for a text collection
l M = N x N matrix (N = vocabulary size)
l Mij: number of times i and j co-occur in some context
(for concreteness, let’s say context = sentence)
¢ Why?
l Distributional profiles as a way of measuring semantic distance
l Semantic distance useful for many language processing tasks
MapReduce: Large Counting Problems
¢ Term co-occurrence matrix for a text collection
= specific instance of a large counting problem
l A large event space (number of terms)
l A large number of observations (the collection itself)
l Goal: keep track of interesting statistics about the events
¢ Basic approach
l Mappers generate partial counts
l Reducers aggregate partial counts

How do we aggregate partial counts efficiently?


First Try: “Pairs”
¢ Each mapper takes a sentence:
l Generate all co-occurring term pairs
l For all pairs, emit (a, b) → count
¢ Reducers sum up counts associated with these pairs
¢ Use combiners!
Pairs: Pseudo-Code
“Pairs” Analysis
¢ Advantages
l Easy to implement, easy to understand
¢ Disadvantages
l Lots of pairs to sort and shuffle around (upper bound?)
l Not many opportunities for combiners to work
Another Try: “Stripes”
¢ Idea: group together pairs into an associative array

(a, b) → 1
(a, c) → 2
(a, d) → 5 a → { b: 1, c: 2, d: 5, e: 3, f: 2 }
(a, e) → 3
(a, f) → 2

¢ Each mapper takes a sentence:


l Generate all co-occurring term pairs
l For each term, emit a → { b: countb, c: countc, d: countd … }
¢ Reducers perform element-wise sum of associative arrays
a → { b: 1, d: 5, e: 3 }
+ a → { b: 1, c: 2, d: 2, f: 2 }
a → { b: 2, c: 2, d: 7, e: 3, f: 2 }
Stripes: Pseudo-Code
“Stripes” Analysis
¢ Advantages
l Far less sorting and shuffling of key-value pairs
l Can make better use of combiners
¢ Disadvantages
l More difficult to implement
l Underlying object more heavyweight
l Fundamental limitation in terms of size of event space
Cluster size: 38 cores
Data Source: Associated Press Worldstream (APW) of the English Gigaword Corpus (v3),
which contains 2.27 million documents (1.8 GB compressed, 5.7 GB uncompressed)
Relative Frequencies
¢ How do we estimate relative frequencies from counts?

count ( A, B) count ( A, B)
f ( B | A) = =
count ( A) å count( A, B' )
B'

¢ Why do we want to do this?


¢ How do we do this with MapReduce?
f(B|A): “Stripes”

a → {b1:3, b2 :12, b3 :7, b4 :1, … }

¢ Easy!
l One pass to compute (a, *)
l Another pass to directly compute f(B|A)
f(B|A): “Pairs”

(a, *) → 32 Reducer holds this value in memory

(a, b1) → 3 (a, b1) → 3 / 32


(a, b2) → 12 (a, b2) → 12 / 32
(a, b3) → 7 (a, b3) → 7 / 32
(a, b4) → 1 (a, b4) → 1 / 32
… …

¢ For this to work:


l Must emit extra (a, *) for every bn in mapper
l Must make sure all a’s get sent to same reducer (use partitioner)
l Must make sure (a, *) comes first (define sort order)
l Must hold state in reducer across different key-value pairs
“Order Inversion”
¢ Common design pattern
l Computing relative frequencies requires marginal counts
l But marginal cannot be computed until you see all counts
l Buffering is a bad idea!
l Trick: getting the marginal counts to arrive at the reducer before
the joint counts
¢ Optimizations
l Apply in-memory combining pattern to accumulate marginal counts
l Should we apply combiners?
Synchronization: Pairs vs. Stripes
¢ Approach 1: turn synchronization into an ordering problem
l Sort keys into correct order of computation
l Partition key space so that each reducer gets the appropriate set
of partial results
l Hold state in reducer across multiple key-value pairs to perform
computation
l Illustrated by the “pairs” approach
¢ Approach 2: construct data structures that bring partial
results together
l Each reducer receives all the data it needs to complete the
computation
l Illustrated by the “stripes” approach
Secondary Sorting
¢ MapReduce sorts input to reducers by key
l Values may be arbitrarily ordered
¢ What if want to sort value also?
l E.g., k → (v1, r), (v3, r), (v4, r), (v8, r)…
Secondary Sorting: Solutions
¢ Solution 1:
l Buffer values in memory, then sort
l Why is this a bad idea?
¢ Solution 2:
l “Value-to-key conversion” design pattern: form composite
intermediate key, (k, v1)
l Let execution framework do the sorting
l Preserve state across multiple key-value pairs to handle
processing
l Anything else we need to do?
Recap: Tools for Synchronization
¢ Cleverly-constructed data structures
l Bring data together
¢ Sort order of intermediate keys
l Control order in which reducers process keys
¢ Partitioner
l Control which reducer processes which keys
¢ Preserving state in mappers and reducers
l Capture dependencies across multiple keys and values
Issues and Tradeoffs
¢ Number of key-value pairs
l Object creation overhead
l Time for sorting and shuffling pairs across the network
¢ Size of each key-value pair
l De/serialization overhead
¢ Local aggregation
l Opportunities to perform local aggregation varies
l Combiners make a big difference
l Combiners vs. in-mapper combining
l RAM vs. disk vs. network
Debugging at Scale
¢ Works on small datasets, won’t scale… why?
l Memory management issues (buffering and object creation)
l Too much intermediate data
l Mangled input records
¢ Real-world data is messy!
l Word count: how many unique words in Wikipedia?
l There’s no such thing as “consistent data”
l Watch out for corner cases
l Isolate unexpected behavior, bring local
First, nomenclature…
¢ Information retrieval (IR)
l Focus on textual information (= text/document retrieval)
l Other possibilities include image, video, music, …
¢ What do we search?
l Generically, “collections”
l Less-frequently used, “corpora”
¢ What do we find?
l Generically, “documents”
l Even though we may be referring to web pages, PDFs,
PowerPoint slides, paragraphs, etc.
Information Retrieval Cycle
Source
Selection Resource

Query
Formulation Query

Search Results

Selection Documents
System discovery
Vocabulary discovery
Concept discovery
Document discovery Examination
Information

Delivery
The Central Problem in Search
Author
Searcher

Concepts Concepts

Query Terms Document Terms


“tragic love story” “fateful star-crossed romance”

Do these represent the same concepts?


Abstract IR Architecture

Query Documents

online offline
Representation Representation
Function Function

Query Representation Document Representation

Comparison
Index
Function

Hits
How do we represent text?
¢ Remember: computers don’t “understand” anything!
¢ “Bag of words”
l Treat all the words in a document as index terms
l Assign a “weight” to each term based on “importance”
(or, in simplest case, presence/absence of word)
l Disregard order, structure, meaning, etc. of the words
l Simple, yet effective!
¢ Assumptions
l Term occurrence is independent
l Document relevance is independent
l “Words” are well-defined
What’s a word?
天主教教宗若望保祿二世因感冒再度住進醫院。
這是他今年第二度因同樣的病因住院。 ‫ وﻗﺎل ﻣﺎرك رﯾﺠﯿﻒ‬- ‫اﻟﻨﺎطﻖ ﺑﺎﺳﻢ‬
‫اﻟﺨﺎرﺟﯿﺔ اﻹﺳﺮاﺋﯿﻠﯿﺔ‬ - ‫إن ﺷﺎرون ﻗﺒﻞ‬
‫اﻟﺪﻋﻮة وﺳﯿﻘﻮم ﻟﻠﻤﺮة اﻷوﻟﻰ ﺑﺰﯾﺎرة‬
‫اﻟﺘﻲ ﻛﺎﻧﺖ ﻟﻔﺘﺮة طﻮﯾﻠﺔ اﻟﻤﻘﺮ‬ ،‫ﺗﻮﻧﺲ‬
‫اﻟﺮﺳﻤﻲ ﻟﻤﻨﻈﻤﺔ اﻟﺘﺤﺮﯾﺮ اﻟﻔﻠﺴﻄﯿﻨﯿﺔ ﺑﻌﺪ ﺧﺮوﺟﮭﺎ ﻣﻦ ﻟﺒﻨﺎن ﻋﺎم‬ 1982.

Выступая в Мещанском суде Москвы экс-глава ЮКОСа


заявил не совершал ничего противозаконного, в чем
обвиняет его генпрокуратура России.

भारत सरकार ने आर्थिक सर्वेक्षण में वित्तीय वर्ष 2005-06 में सात
फ़ीसदी विकास दर हासिल करने का आकलन किया है और कर सुधार पर ज़ोर
दिया है

日米連合で台頭中国に対処…アーミテージ前副長官提言

조재영 기자= 서울시는 25일 이명박 시장이 `행정중심복합도시'' 건설안


에 대해 `군대라도 동원해 막고싶은 심정''이라고 말했다는 일부 언론의
보도를 부인했다.
Sample Document
McDonald's slims down spuds
Fast-food chain to reduce certain types of
“Bag of Words”
fat in its french fries with new cooking
oil.
14 × McDonalds
NEW YORK (CNN/Money) - McDonald's Corp. is
cutting the amount of "bad" fat in its french
fries nearly in half, the fast-food chain said
12 × fat
Tuesday as it moves to make all its fried
menu items healthier. 11 × fries
But does that mean the popular shoestring fries
won't taste the same? The company says 8 × new
no. "It's a win-win for our customers because
they are getting the same great french-fry
taste along with an even healthier nutrition
7 × french
profile," said Mike Roberts, president of
McDonald's USA. 6 × company, said, nutrition
But others are not so sure. McDonald's will not
specifically discuss the kind of oil it plans to 5 × food, oil, percent,
use, but at least one nutrition expert says
playing with the formula could mean a reduce, taste, Tuesday
different taste.
Shares of Oak Brook, Ill.-based McDonald's …
(MCD: down $0.54 to $23.22, Research,
Estimates) were lower Tuesday afternoon. It
was unclear Tuesday whether competitors
Burger King and Wendy's International
(WEN: down $0.80 to $34.91, Research,
Estimates) would follow suit. Neither
company could immediately be reached for
comment.

Counting Words…

Documents

case folding, tokenization, stopword removal, stemming

Bag of Words
syntax, semantics, word knowledge, etc.

Inverted
Index
Boolean Retrieval
¢ Users express queries as a Boolean expression
l AND, OR, NOT
l Can be arbitrarily nested
¢ Retrieval is based on the notion of sets
l Any given query divides the collection into two sets:
retrieved, not-retrieved
l Pure Boolean systems do not define an ordering of the results
Inverted Index: Boolean Retrieval
Doc 1 Doc 2 Doc 3 Doc 4
one fish, two fish red fish, blue fish cat in the hat green eggs and ham

1 2 3 4

blue 1 blue 2

cat 1 cat 3

egg 1 egg 4

fish 1 1 fish 1 2

green 1 green 4

ham 1 ham 4

hat 1 hat 3

one 1 one 1

red 1 red 2

two 1 two 1
Boolean Retrieval
¢ To execute a Boolean query:
l Build query syntax tree
OR
( blue AND fish ) OR ham
ham AND

blue fish
l For each clause, look up postings
blue 2

fish 1 2

l Traverse postings and apply Boolean operator


¢ Efficiency analysis
l Postings traversal is linear (assuming sorted postings)
l Start with shortest posting first
Strengths and Weaknesses
¢ Strengths
l Precise, if you know the right strategies
l Precise, if you have an idea of what you’re looking for
l Implementations are fast and efficient
¢ Weaknesses
l Users must learn Boolean logic
l Boolean logic insufficient to capture the richness of language
l No control over size of result set: either too many hits or none
l When do you stop reading? All documents in the result set are
considered “equally good”
l What about partial matches? Documents that “don’t quite match”
the query may be useful also
Ranked Retrieval
¢ Order documents by how likely they are to be relevant to
the information need
l Estimate relevance(q, di)
l Sort documents by relevance
l Display sorted results
¢ User model
l Present hits one screen at a time, best results first
l At any point, users can decide to stop looking
¢ How do we estimate relevance?
l Assume document is relevant if it has a lot of query terms
l Replace relevance(q, di) with sim(q, di)
l Compute similarity of vector representations
Vector Space Model
t3
d2

d3
d1
θ
φ
t1

d5
t2
d4

Assumption: Documents that are “close together” in


vector space
“talk about” the same things
Therefore, retrieve documents based on how close the
document is to the query (i.e., similarity ~ “closeness”)
Similarity Metric
¢ Use “angle” between the vectors:
! !
d j × dk
cos(q ) = ! !
d j dk
! !
å
n
d j × dk wi , j wi ,k
sim(d j , d k ) = ! ! = i =1

åi=1 w åi=1 i,k


n n
d j dk 2
i, j w 2

¢ Or, more generally, inner products:

! !
sim(d j , d k ) = d j × d k = åi =1 wi , j wi ,k
n
Term Weighting
¢ Term weights consist of two components
l Local: how important is the term in this document?
l Global: how important is the term in the collection?
¢ Here’s the intuition:
l Terms that appear often in a document should get high weights
l Terms that appear in many documents should get low weights
¢ How do we capture this mathematically?
l Term frequency (local)
l Inverse document frequency (global)
TF.IDF Term Weighting

N
wi , j = tf i , j × log
ni
wi , j weight assigned to term i in document j
tf i, j number of occurrence of term i in document j

N number of documents in entire collection

ni number of documents with term i


Inverted Index: TF.IDF
Doc 1 Doc 2 Doc 3 Doc 4
one fish, two fish red fish, blue fish cat in the hat green eggs and ham

tf
1 2 3 4 df
blue 1 1 blue 1 2 1

cat 1 1 cat 1 3 1

egg 1 1 egg 1 4 1

fish 2 2 2 fish 2 1 2 2 2

green 1 1 green 1 4 1

ham 1 1 ham 1 4 1

hat 1 1 hat 1 3 1

one 1 1 one 1 1 1

red 1 1 red 1 2 1

two 1 1 two 1 1 1
Positional Indexes
¢ Store term position in postings
¢ Supports richer queries (e.g., proximity)
¢ Naturally, leads to larger indexes…
Inverted Index: Positional Information
Doc 1 Doc 2 Doc 3 Doc 4
one fish, two fish red fish, blue fish cat in the hat green eggs and ham

tf
1 2 3 4 df
blue 1 1 blue 1 2 1 [3]

cat 1 1 cat 1 3 1 [1]

egg 1 1 egg 1 4 1 [2]

fish 2 2 2 fish 2 1 2 [2,4] 2 2 [2,4]

green 1 1 green 1 4 1 [1]

ham 1 1 ham 1 4 1 [3]

hat 1 1 hat 1 3 1 [2]

one 1 1 one 1 1 1 [1]

red 1 1 red 1 2 1 [1]

two 1 1 two 1 1 1 [3]


Retrieval in a Nutshell
¢ Look up postings lists corresponding to query terms
¢ Traverse postings for each query term
¢ Store partial query-document scores in accumulators
¢ Select top k results to return
Retrieval: Document-at-a-Time
¢ Evaluate documents one at a time (score all query terms)
blue 9 2 21 1 35 1 …
fish 1 2 9 1 21 3 34 1 35 2 80 3 …

Document score in top k?


Accumulators Yes: Insert document score, extract-min if queue too large
(e.g. priority queue) No: Do nothing

¢ Tradeoffs
l Small memory footprint (good)
l Must read through all postings (bad), but skipping possible
l More disk seeks (bad), but reading in blocks possible
Retrieval: Query-At-A-Time

¢ Evaluate documents one query term at a time


l Usually, starting from most rare term (often with tf-sorted postings)
blue 9 2 21 1 35 1 …
Score{q=x}(doc n) = s Accumulators
(e.g., hash)

fish 1 2 9 1 21 3 34 1 35 2 80 3 …

¢ Tradeoffs
l Early termination heuristics (good)
l Large memory footprint (bad), but filtering heuristics possible
MapReduce it?
¢ The indexing problem
l Scalability is critical
l Must be relatively fast, but need not be real time
l Fundamentally a batch operation
l Incremental updates may or may not be important
l For the web, crawling is a challenge in itself
¢ The retrieval problem
l Must have sub-second response time
l For the web, only need relatively few results
Indexing: Performance Analysis
¢ Fundamentally, a large sorting problem
l Terms usually fit in memory
l Postings usually don’t
¢ How is it done on a single machine?
¢ How can it be done with MapReduce?
¢ First, let’s characterize the problem size:
l Size of vocabulary
l Size of postings
Vocabulary Size: Heaps’ Law

M = kT b M is vocabulary size
T is collection size (number of documents)
k and b are constants

Typically, k is between 30 and 100, b is between 0.4 and 0.6

¢ Heaps’ Law: linear in log-log space


¢ Vocabulary size grows unbounded!
Heaps’ Law for RCV1

k = 44
b = 0.49

First 1,000,020 terms:


Predicted = 38,323
Actual = 38,365

Reuters-RCV1 collection: 806,791 newswire documents (Aug 20, 1996-August 19, 1997)

Manning, Raghavan, Schütze, Introduction to Information Retrieval (2008)


Postings Size: Zipf’s Law

c
cf i = cf is the collection frequency of i-th common term
c is a constant
i

¢ Zipf’s Law: (also) linear in log-log space


l Specific case of Power Law distributions
¢ In other words:
l A few elements occur very frequently
l Many elements occur very infrequently
Zipf’s Law for RCV1

Fit isn’t that good…


but good enough!

Reuters-RCV1 collection: 806,791 newswire documents (Aug 20, 1996-August 19, 1997)

Manning, Raghavan, Schütze, Introduction to Information Retrieval (2008)


Figure from: Newman, M. E. J. (2005) “Power laws, Pareto
distributions and Zipf's law.” Contemporary Physics 46:323–351.
MapReduce: Index Construction
¢ Map over all documents
l Emit term as key, (docno, tf) as value
l Emit other information as necessary (e.g., term position)
¢ Sort/shuffle: group postings by term
¢ Reduce
l Gather and sort the postings (e.g., by docno or tf)
l Write postings to disk
¢ MapReduce does all the heavy lifting!
Inverted Indexing with MapReduce
Doc 1 Doc 2 Doc 3
one fish, two fish red fish, blue fish cat in the hat

one 1 1 red 2 1 cat 3 1

Map two 1 1 blue 2 1 hat 3 1

fish 1 2 fish 2 2

Shuffle and Sort: aggregate values by keys

cat 3 1
blue 2 1
Reduce fish 1 2 2 2
hat 3 1
one 1 1
two 1 1
red 2 1
Inverted Indexing: Pseudo-Code
Positional Indexes
Doc 1 Doc 2 Doc 3
one fish, two fish red fish, blue fish cat in the hat

one 1 1 [1] red 2 1 [1] cat 3 1 [1]

Map two 1 1 [3] blue 2 1 [3] hat 3 1 [2]

fish 1 2 [2,4] fish 2 2 [2,4]

Shuffle and Sort: aggregate values by keys

cat 3 1 [1]
blue 2 1 [3]

Reduce fish 1 2 [2,4] 2 2 [2,4]


hat 3 1 [2]
one 1 1 [1]
two 1 1 [3]
red 2 1 [1]
Inverted Indexing: Pseudo-Code
Scalability Bottleneck
¢ Initial implementation: terms as keys, postings as values
l Reducers must buffer all postings associated with key (to sort)
l What if we run out of memory to buffer postings?
¢ Uh oh!
Another Try…
(key) (values) (keys) (values)

fish 1 2 [2,4] fish 1 [2,4]

34 1 [23] fish 9 [9]

21 3 [1,8,22] fish 21 [1,8,22]

35 2 [8,41] fish 34 [23]

80 3 [2,9,76] fish 35 [8,41]

9 1 [9] fish 80 [2,9,76]

How is this different?


• Let the framework do the sorting
• Term frequency implicitly stored
• Directly write postings to disk!

Where have we seen this before?


Postings Encoding
Conceptually:

fish 1 2 9 1 21 3 34 1 35 2 80 3 …

In Practice:
• Don’t encode docnos, encode gaps (or d-gaps)
• But it’s not obvious that this will save space…

fish 1 2 8 1 12 3 13 1 1 2 45 3 …
Overview of Index Compression
¢ Byte-aligned vs. bit-aligned
¢ Non-parameterized bit-aligned
l Unary codes
l g codes
l d codes
¢ Parameterized bit-aligned
l Golomb codes (local Bernoulli model)

Want more detail? Read Managing Gigabytes by Witten, Moffat, and Bell!
Unary Codes
¢ x ³ 1 is coded as x-1 one bits, followed by 1 zero bit
l 3 = 110
l 4 = 1110
¢ Great for small numbers… horrible for large numbers
l Overly-biased for very small gaps

Watch out! Slightly different definitions in different textbooks


g codes
¢ x ³ 1 is coded in two parts: length and offset
l Start with binary encoded, remove highest-order bit = offset
l Length is number of binary digits, encoded in unary code
l Concatenate length + offset codes
¢ Example: 9 in binary is 1001
l Offset = 001
l Length = 4, in unary code = 1110
l g code = 1110:001
¢ Analysis
l Offset = ëlog xû
l Length = ëlog xû +1
l Total = 2 ëlog xû +1
d codes
¢ Similar to g codes, except that length is encoded in g code
¢ Example: 9 in binary is 1001
l Offset = 001
l Length = 4, in g code = 11000
l d code = 11000:001
¢ g codes = more compact for smaller numbers
d codes = more compact for larger numbers
Golomb Codes
¢ x ³ 1, parameter b:
l q + 1 in unary, where q = ë( x - 1 ) / bû
l r in binary, where r = x - qb - 1, in ëlog bû or élog bù bits
¢ Example:
l b = 3, r = 0, 1, 2 (0, 10, 11)
l b = 6, r = 0, 1, 2, 3, 4, 5 (00, 01, 100, 101, 110, 111)
l x = 9, b = 3: q = 2, r = 2, code = 110:11
l x = 9, b = 6: q = 1, r = 2, code = 10:100
¢ Optimal b » 0.69 (N/df)
l Different b for every term!
Comparison of Coding Schemes

Unary g d Golomb

b=3 b=6
1 0 0 0 0:0 0:00
2 10 10:0 100:0 0:10 0:01
3 110 10:1 100:1 0:11 0:100
4 1110 110:00 101:00 10:0 0:101
5 11110 110:01 101:01 10:10 0:110
6 111110 110:10 101:10 10:11 0:111
7 1111110 110:11 101:11 110:0 10:00
8 11111110 1110:000 11000:00 110:10 10:01
0
9 111111110 1110:001 11000:00 110:11 10:100
1
10 1111111110 1110:010 11000:01 1110:0 10:101
0

Witten, Moffat, Bell, Managing Gigabytes (1999)


Index Compression: Performance

Comparison of Index Size (bits per pointer)

Bible TREC

Unary 262 1918


Binary 15 20
g 6.51 6.63
d 6.23 6.38
Golomb 6.09 5.84 Recommend best practice

Bible: King James version of the Bible; 31,101 verses (4.3 MB)
TREC: TREC disks 1+2; 741,856 docs (2070 MB)

Witten, Moffat, Bell, Managing Gigabytes (1999)


Chicken and Egg?

(key) (value)

fish 1 [2,4]
But wait! How do we set the
fish 9 [9]
Golomb parameter b?
fish 21 [1,8,22]
Recall: optimal b » 0.69 (N/df)
fish 34 [23]
We need the df to set b…
fish 35 [8,41] But we don’t know the df until we’ve
seen all postings!
fish 80 [2,9,76]

Write directly to disk

Sound familiar?
Getting the df
¢ In the mapper:
l Emit “special” key-value pairs to keep track of df
¢ In the reducer:
l Make sure “special” key-value pairs come first: process them to
determine df
¢ Remember: proper partitioning!
Getting the df: Modified Mapper
Doc 1
one fish, two fish Input document…

(key) (value)

fish 1 [2,4] Emit normal key-value pairs…

one 1 [1]

two 1 [3]

fish « [1] Emit “special” key-value pairs to keep track of df…

one « [1]

two « [1]
Getting the df: Modified Reducer
(key) (value)

fish First, compute the df by summing contributions


« [63] [82] [27] …
from all “special” key-value pair…

Compute Golomb parameter b…


fish 1 [2,4]

fish 9 [9]

fish 21 [1,8,22] Important: properly define sort order to


make sure “special” key-value pairs come first!
fish 34 [23]

fish 35 [8,41]

fish 80 [2,9,76]

… Write postings directly to disk

Where have we seen this before?


MapReduce it?
¢ The indexing problem Just covered
l Scalability is paramount
l Must be relatively fast, but need not be real time
l Fundamentally a batch operation
l Incremental updates may or may not be important
l For the web, crawling is a challenge in itself
¢ The retrieval problem Now
l Must have sub-second response time
l For the web, only need relatively few results
Retrieval with MapReduce?
¢ MapReduce is fundamentally batch-oriented
l Optimized for throughput, not latency
l Startup of mappers and reducers is expensive
¢ MapReduce is not suitable for real-time queries!
l Use separate infrastructure for retrieval…
Important Ideas
¢ Partitioning (for scalability)
¢ Replication (for redundancy)
¢ Caching (for speed)
¢ Routing (for load balancing)

The rest is just details!


Term vs. Document Partitioning
D
T1

D T2
Term …
Partitioning
T3

T
Document
Partitioning … T

D1 D2 D3
Katta Architecture
(Distributed Lucene)

http://katta.sourceforge.net/
Graph Algorithms in MapReduce
Some Graph Problems
¢ Finding shortest paths
l Routing Internet traffic and UPS trucks
¢ Finding minimum spanning trees
l Telco laying down fiber
¢ Finding Max Flow
l Airline scheduling
¢ Identify “special” nodes and communities
l Breaking up terrorist cells, spread of avian flu
¢ Bipartite matching
l Monster.com, Match.com
¢ And of course... PageRank
Graphs and MapReduce
¢ Graph algorithms typically involve:
l Performing computations at each node: based on node
features, edge features, and local link structure
l Propagating computations: “traversing” the graph
¢ Key questions:
l How do you represent graph data in MapReduce?
l How do you traverse a graph in MapReduce?
Representing Graphs
¢ G = (V, E)
¢ Two common representations
l Adjacency matrix
l Adjacency list
Adjacency Matrices
Represent a graph as an n x n square matrix M
l n = |V|
l Mij = 1 means a link from node i to j

1
1 2 3 4 3

1 0 1 0 1 4

2 1 0 1 1
Adjacency Matrices: Critique
¢ Advantages:
l Amenable to mathematical manipulation
l Iteration over rows and columns corresponds to
computations on outlinks and inlinks
¢ Disadvantages:
l Lots of zeros for sparse matrices
l Lots of wasted space
Adjacency Lists
Take adjacency matrices… and throw away all the zeros

1: 2, 4
1 2 3 4
2: 1, 3, 4
3: 1
1 0 1 0 1 4: 1, 3

2 1 0 1 1
Adjacency Lists: Critique
¢ Advantages:
l Much more compact representation
l Easy to compute over outlinks
¢ Disadvantages:
l Much more difficult to compute over inlinks
Single Source Shortest Path
¢ Problem: find shortest path from a source node to one
or more target nodes
l Shortest might also mean lowest weight or cost
¢ First, a refresher: Dijkstra’s Algorithm
Dijkstra’s Algorithm Example

1
¥ ¥

10

0 2 3 9 4 6

5 7

¥ ¥
2

Example from CLR


Dijkstra’s Algorithm Example

1
10 ¥

10

0 2 3 9 4 6

5 7

5 ¥
2

Example from CLR


Dijkstra’s Algorithm Example

1
10 ¥

10

0 2 3 9 4 6

5 7

5 ¥
2

Example from CLR


Dijkstra’s Algorithm Example

1
8 14

10

0 2 3 9 4 6

5 7

5 7
2

Example from CLR


Dijkstra’s Algorithm Example

1
8 13

10

0 2 3 9 4 6

5 7

5 7
2

Example from CLR


Dijkstra’s Algorithm Example

1
8 9

10

0 2 3 9 4 6

5 7

5 7
2

Example from CLR


Single Source Shortest Path
¢ Problem: find shortest path from a source node to one
or more target nodes
l Shortest might also mean lowest weight or cost
¢ First, a refresher: Dijkstra’s Algorithm
Finding the Shortest Path
¢ Consider simple case of equal edge weights
¢ Solution to the problem can be defined inductively
¢ Here’s the intuition:
l Define: b is reachable from a if b is on adjacency list of a
l DistanceTo(s) = 0
l For all nodes p reachable from s,
DistanceTo(p) = 1
l For all nodes n reachable from some other set of nodes M,
DistanceTo(n) = 1 + min(DistanceTo(m), m Î M)
d1 m1

d2
s … n
m2

… d3
m3
Visualizing Parallel BFS

n7
n0 n1

n3 n2
n6

n5
n4
n8

n9
From Intuition to Algorithm
¢ Data representation:
l Key: node n
l Value: d (distance from start), adjacency list (list of nodes
reachable from n)
l Initialization: for all nodes except for start node, d = ¥
¢ Mapper:
l "m Î adjacency list: emit (m, d + 1)
¢ Sort/Shuffle
l Groups distances by reachable nodes
¢ Reducer:
l Selects minimum distance path for each reachable node
l Additional bookkeeping needed to keep track of actual path
Multiple Iterations Needed
¢ Each MapReduce iteration advances the “known
frontier” by one hop
l Subsequent iterations include more and more reachable
nodes as frontier expands
l Multiple iterations are needed to explore entire graph
¢ Preserving graph structure:
l Problem: Where did the adjacency list go?
l Solution: mapper emits (n, adjacency list) as well
BFS Pseudo-Code
Stopping Criterion
¢ How many iterations are needed in parallel BFS (equal
edge weight case)?
¢ Convince yourself: when a node is first “discovered”,
we’ve found the shortest path
¢ Now answer the question...
l Six degrees of separation?
¢ Practicalities of implementation in MapReduce
Comparison to Dijkstra
¢ Dijkstra’s algorithm is more efficient
l At any step it only pursues edges from the minimum-cost
path inside the frontier
¢ MapReduce explores all paths in parallel
l Lots of “waste”
l Useful work is only done at the “frontier”
¢ Why can’t we do better using MapReduce?
Weighted Edges
¢ Now add positive weights to the edges
l Why can’t edge weights be negative?
¢ Simple change: adjacency list now includes a weight
w for each edge
l In mapper, emit (m, d + wp) instead of (m, d + 1) for each node
m
¢ That’s it?
Stopping Criterion
¢ How many iterations are needed in parallel BFS (equal
edge weight case)?
¢ Convince yourself: when a node is first “discovered”,
we’ve found the shortest path
¢ Now answer the question...
l Six degrees of separation?
¢ Practicalities of implementation in MapReduce
Additional Complexities

1
search frontier 1
n6 n7 1
n8
r 10
1 n9
n5
n1
s 1 1
q
p n4
1
n2 1
n3
Stopping Criterion
¢ How many iterations are needed in parallel BFS (equal
edge weight case)?
¢ Convince yourself: when a node is first “discovered”,
we’ve found the shortest path
¢ Now answer the question...
l Six degrees of separation?
¢ Practicalities of implementation in MapReduce
Graphs and MapReduce
¢ Graph algorithms typically involve:
l Performing computations at each node: based on node
features, edge features, and local link structure
l Propagating computations: “traversing” the graph
¢ Key questions:
l How do you represent graph data in MapReduce?
l How do you traverse a graph in MapReduce?
Random Walks Over the Web
¢ Random surfer model:
l User starts at a random Web page
l User randomly clicks on links, surfing from page to page
¢ PageRank
l Characterizes the amount of time spent on any given page
l Mathematically, a probability distribution over pages
¢ PageRank captures notions of page importance
l Correspondence to human intuition?
l One of thousands of features used in web search
l Note: query-independent
PageRank: Defined
Given page x with inlinks t1…tn, where
l C(t) is the out-degree of t
l a is probability of random jump
l N is the total number of nodes in the graph
n
PR(t i )
PR( x )=α ()
1
N
+ (1− α) ∑
i=1 C (t i )

t1

t2


tn
Computing PageRank
¢ Properties of PageRank
l Can be computed iteratively
l Effects at each iteration are local
¢ Sketch of algorithm:
l Start with seed PRi values
l Each page distributes PRi “credit” to all pages it links to
l Each target page adds up “credit” from multiple in-bound
links to compute PRi+1
l Iterate until values converge
Simplified PageRank
¢ First, tackle the simple case:
l No random jump factor
l No dangling links
¢ Then, factor in these complexities…
l Why do we need the random jump?
l Where do dangling links come from?
Sample PageRank Iteration (1)

Iteration 1 n2 (0.2) n2 (0.166)

0.1
n1 (0.2) 0.1 0.1 n1 (0.066)

0.1
0.066
0.066 0.066
n5 (0.2) n5 (0.3)
n3 (0.2) n3 (0.166)
0.2 0.2

n4 (0.2) n4 (0.3)
Sample PageRank Iteration (2)

Iteration 2 n2 (0.166) n2 (0.133)

0.033 0.083
n1 (0.066) 0.083 n1 (0.1)

0.033
0.1
0.1 0.1
n5 (0.3) n5 (0.383)
n3 (0.166) n3 (0.183)
0.3 0.166

n4 (0.3) n4 (0.2)
PageRank in MapReduce

n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5] n5 [n1, n2, n3]

Map
n2 n4 n3 n5 n4 n5 n1 n2 n3

n1 n2 n2 n3 n3 n4 n4 n5 n5

Reduce

n1 [n2, n4]n2 [n3, n5] n3 [n4] n4 [n5] n5 [n1, n2, n3]


PageRank Pseudo-Code
Complete PageRank
¢ Two additional complexities
l What is the proper treatment of dangling nodes?
l How do we factor in the random jump factor?
¢ Solution:
l Second pass to redistribute “missing PageRank mass” and
account for random jumps

p'=α()1
∣G∣ ( )
+ (1− α )
m
∣G∣
+p
l p is PageRank value from before, p' is updated PageRank
value
l |G| is the number of nodes in the graph
l m is the missing PageRank mass
PageRank Convergence
¢ Alternative convergence criteria
l Iterate until PageRank values don’t change
l Iterate until PageRank rankings don’t change
l Fixed number of iterations
¢ Convergence for web graphs?
Beyond PageRank
¢ Link structure is important for web search
l PageRank is one of many link-based features: HITS, SALSA,
etc.
l One of many thousands of features used in ranking…
¢ Adversarial nature of web search
l Link spamming
l Spider traps
l Keyword stuffing
l …
Efficient Graph Algorithms
¢ Sparse vs. dense graphs
¢ Graph topologies
Local Aggregation
¢ Use combiners!
l In-mapper combining design pattern also applicable
¢ Maximize opportunities for local aggregation
l Simple tricks: sorting the dataset in specific ways
Relational Processing
on MapReduce

Content obtained from many sources,


notably:Jerome Simeon (IBM) and Jimmy Lin course on MapReduce.
•Our Plan Today
1. Recap:
– Key relational DBMS notes
– Key Hadoop notes
2. Relational Algorithms on MapReduce
– How to do a select, groupby, join etc
3. Queries on MapReduce: Hive and Pig
Big Data Analysis
¢Peta-scale datasets are everywhere:
l Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
l eBay has 6.5 PB of user data + 50 TB/day (5/2009)
l …
¢A lot of these datasets have some structure
l Query logs
l Point-of-sale records
l User data (e.g., demographics)
l …
¢How do we perform data analysis at scale?
l Relational databases and SQL
l MapReduce (Hadoop)
Relational Databases vs. MapReduce
¢Relational databases:
l Multipurpose: analysis and transactions; batch and interactive
l Data integrity via ACID transactions
l Lots of tools in software ecosystem (for ingesting, reporting, etc.)
l Supports SQL (and SQL integration, e.g., JDBC)
l Automatic SQL query optimization
¢MapReduce (Hadoop):
l Designed for large clusters, fault tolerant
l Data is accessed in “native format”
l Supports many query languages
l Programmers retain control over performance
l Open source

Source: O’Reilly Blog post by Joseph Hellerstein (11/19/2008)


Database Workloads
¢OLTP (online transaction processing)
l Typical applications: e-commerce, banking, airline reservations
l User facing: real-time, low latency, highly-concurrent
l Tasks: relatively small set of “standard” transactional queries
l Data access pattern: random reads, updates, writes (involving
relatively small amounts of data)
¢OLAP (online analytical processing)
l Typical applications: business intelligence, data mining
l Back-end processing: batch workloads, less concurrency
l Tasks: complex analytical queries, often ad hoc
l Data access pattern: table scans, large amounts of data involved
per query
One Database or Two?
¢Downsides of co-existing OLTP and OLAP workloads
l Poor memory management
l Conflicting data access patterns
l Variable latency
¢Solution: separate databases
l User-facing OLTP database for high-volume transactions
l Data warehouse for OLAP workloads
l How do we connect the two?
OLTP/OLAP Architecture

ETL
(Extract, Transform, and Load)

OLTP OLAP
OLTP/OLAP Integration
¢OLTP database for user-facing transactions
l Retain records of all activity
l Periodic ETL (e.g., nightly)
¢Extract-Transform-Load (ETL)
l Extract records from source
l Transform: clean data, check integrity, aggregate, etc.
l Load into OLAP database
¢OLAP database for data warehousing
l Business intelligence: reporting, ad hoc queries, data mining, etc.
l Feedback to improve OLTP services
Business Intelligence
¢Premise: more data leads to better business decisions
l Periodic reporting as well as ad hoc queries
l Analysts, not programmers (importance of tools and dashboards)
¢Examples:
l Slicing-and-dicing activity by different dimensions to better
understand the marketplace
l Analyzing log data to improve OLTP experience
l Analyzing log data to better optimize ad placement
l Analyzing purchasing trends for better supply-chain management
l Mining for correlations between otherwise unrelated activities
OLTP/OLAP Architecture: Hadoop?

What about here?


ETL
(Extract, Transform, and Load)

OLTP OLAP

Hadoop here?
OLTP/OLAP/Hadoop Architecture

ETL
(Extract, Transform, and Load)

OLTP Hadoop OLAP

Why does this make sense?


ETL Bottleneck
¢Reporting is often a nightly task:
l ETL is often slow: why?
l What happens if processing 24 hours of data takes longer than 24
hours?
¢Hadoop is perfect:
l Most likely, you already have some data warehousing solution
l Ingest is limited by speed of HDFS
l Scales out with more nodes
l Massively parallel
l Ability to use any processing tool
l Much cheaper than parallel databases
l ETL is a batch process anyway!
MapReduce: Recap
¢Programmers must specify:
map (k, v) → <k’, v’>*
reduce (k’, v’) → <k’, v’>*
l All values with the same key are reduced together
¢Optionally, also:
partition (k’, number of partitions) → partition for k’
l Often a simple hash of the key, e.g., hash(k’) mod n
l Divides up key space for parallel reduce operations
combine (k’, v’) → <k’, v’>*
l Mini-reducers that run in memory after the map phase
l Used as an optimization to reduce network traffic
¢The execution framework handles everything else…
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6

map map map map

a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8

combine combine combine combine

a 1 b 2 c 9 a 5 c 2 b 7 c 8

partition partition partition partition

Shuffle and Sort: aggregate values by keys


a 1 5 b 2 7 c 2 9 8

reduce reduce reduce

r1 s1 r2 s2 r3 s3
“Everything Else”
¢The execution framework handles everything else…
l Scheduling: assigns workers to map and reduce tasks
l “Data distribution”: moves processes to data
l Synchronization: gathers, sorts, and shuffles intermediate data
l Errors and faults: detects worker failures and restarts
¢Limited control over data and execution flow
l All algorithms must expressed in m, r, c, p
¢You don’t know:
l Where mappers and reducers run
l When a mapper or reducer begins or finishes
l Which input a particular mapper is processing
l Which intermediate key a particular reducer is processing
MapReduce algorithms
for processing relational data
Design Pattern: Secondary Sorting
¢MapReduce sorts input to reducers by key
l Values are arbitrarily ordered
¢What if want to sort value also?
l E.g., k → (v1, r), (v3, r), (v4, r), (v8, r)…
Secondary Sorting: Solutions
¢Solution 1:
l Buffer values in memory, then sort
l Why is this a bad idea?
¢Solution 2:
l “Value-to-key conversion” design pattern: form composite
intermediate key, (k, v1)
l Let execution framework do the sorting
l Preserve state across multiple key-value pairs to handle
processing
l Anything else we need to do?
Value-to-Key Conversion

Before
k → (v1, r), (v4, r), (v8, r), (v3, r)…
Values arrive in arbitrary order…

After
(k, v1) → (v1, r) Values arrive in sorted order…
(k, v3) → (v3, r) Process by preserving state across multiple keys
Remember to partition correctly!
(k, v4) → (v4, r)
(k, v8) → (v8, r)

Working Scenario
¢Two tables:
l User demographics (gender, age, income, etc.)
l User page visits (URL, time spent, etc.)
¢Analyses we might want to perform:
l Statistics on demographic characteristics
l Statistics on page visits
l Statistics on page visits by URL
l Statistics on page visits by demographic characteristic
l …
Relational Algebra
¢Primitives
l Projection (p)
l Selection (s)
l Cartesian product (´)
l Set union (È)
l Set difference (-)
l Rename (r)
¢Other operations
l Join (⋈)
l Group by… aggregation
l …
Projection

R1 R1

R2 R2

R3 R3

R4 R4

R5 R5
Projection in MapReduce
¢Easy!
l Map over tuples, emit new tuples with appropriate attributes
l No reducers, unless for regrouping or resorting tuples
l Alternatively: perform in reducer, after some other processing
¢Basically limited by HDFS streaming speeds
l Speed of encoding/decoding tuples becomes important
l Relational databases take advantage of compression
l Semistructured data? No problem!
Selection

R1

R2
R1
R3
R3
R4

R5
Selection in MapReduce
¢Easy!
l Map over tuples, emit only tuples that meet criteria
l No reducers, unless for regrouping or resorting tuples
l Alternatively: perform in reducer, after some other processing
¢Basically limited by HDFS streaming speeds
l Speed of encoding/decoding tuples becomes important
l Relational databases take advantage of compression
l Semistructured data? No problem!
Group by… Aggregation
¢Example: What is the average time spent per URL?
¢In SQL:
l SELECT url, AVG(time) FROM visits GROUP BY url
¢In MapReduce:
l Map over tuples, emit time, keyed by url
l Framework automatically groups values by keys
l Compute average in reducer
l Optimize with combiners
Relational Joins

Source: Microsoft Office Clip Art


Relational Joins
R1 S1

R2 S2

R3 S3

R4 S4

R1 S2

R2 S4

R3 S1

R4 S3
Types of Relationships

Many-to-Many One-to-Many One-to-One


Join Algorithms in MapReduce
¢Reduce-side join
¢Map-side join
¢In-memory join
l Striped variant
l Memcached variant
Reduce-side Join
¢Basic idea: group by join key
l Map over both sets of tuples
l Emit tuple as value with join key as the intermediate key
l Execution framework brings together tuples sharing the same key
l Perform actual join in reducer
l Similar to a “sort-merge join” in database terminology
¢Two variants
l 1-to-1 joins
l 1-to-many and many-to-many joins
Reduce-side Join: 1-to-1
Map
keys values

R1 R1

R4 R4

S2 S2

S3 S3

Reduce
keys values
R1 S2

S3 R4

Note: no guarantee if R is going to come first or S


Reduce-side Join: 1-to-many
Map
keys values

R1 R1

S2 S2

S3 S3

S9 S9

Reduce
keys values
R1 S2 S3 …

What’s the problem?


Reduce-side Join: V-to-K Conversion
In reducer…
keys values
R1 New key encountered: hold in memory
Cross with records from other set
S2

S3

S9

R4 New key encountered: hold in memory


Cross with records from other set
S3

S7
Reduce-side Join: many-to-many
In reducer…
keys values
R1

R5 Hold in memory

R8

S2 Cross with records from other set

S3

S9

What’s the problem?


Map-side Join: Basic Idea
Assume two datasets are sorted by the join key:

R1 S2

R2 S4

R4 S3

R3 S1

A sequential scan through both datasets to join


(called a “merge join” in database terminology)
Map-side Join: Parallel Scans
¢ If datasets are sorted by join key, join can be accomplished
by a scan over both datasets
¢ How can we accomplish this in parallel?
l Partition and sort both datasets in the same manner
¢ In MapReduce:
l Map over one dataset, read from other corresponding partition
l No reducers necessary (unless to repartition or resort)
¢ Consistently partitioned datasets: realistic to expect?
In-Memory Join
¢ Basic idea: load one dataset into memory, stream over
other dataset
l Works if R << S and R fits into memory
l Called a “hash join” in database terminology
¢ MapReduce implementation
l Distribute R to all nodes
l Map over S, each mapper loads R in memory, hashed by join key
l For every tuple in S, look up join key in R
l No reducers, unless for regrouping or resorting tuples
In-Memory Join: Variants
¢Striped variant:
l R too big to fit into memory?
l Divide R into R1, R2, R3, … s.t. each Rn fits into memory
l Perform in-memory join: "n, Rn ⋈ S
l Take the union of all join results
¢Memcached join:
l Load R into memcached
l Replace in-memory hash lookup with memcached lookup
Memcached

Caching servers: 15 million requests per second,


95% handled by memcache (15 TB of RAM)
Database layer: 800 eight-core Linux servers
running MySQL (40 TB user data)

Source: Technology Review (July/August, 2008)


Memcached Join
¢Memcached join:
l Load R into memcached
l Replace in-memory hash lookup with memcached lookup
¢Capacity and scalability?
l Memcached capacity >> RAM of individual node
l Memcached scales out with cluster
¢Latency?
l Memcached is fast (basically, speed of network)
l Batch requests to amortize latency costs

Source: See tech report by Lin et al. (2009)


Which join to use?
¢ In-memory join > map-side join > reduce-side join
l Why?
¢ Limitations of each?
l In-memory join: memory
l Map-side join: sort order and partitioning
l Reduce-side join: general purpose
Processing Relational Data: Summary
¢MapReduce algorithms for processing relational data:
l Group by, sorting, partitioning are handled automatically by
shuffle/sort in MapReduce
l Selection, projection, and other computations (e.g., aggregation),
are performed either in mapper or reducer
l Multiple strategies for relational joins
¢Complex operations require multiple MapReduce jobs
l Example: top ten URLs in terms of average time spent
l Opportunities for automatic optimization
Evolving roles for
relational database and MapReduce
OLTP/OLAP/Hadoop Architecture

ETL
(Extract, Transform, and Load)

OLTP Hadoop OLAP

Why does this make sense?


Need for High-Level Languages
¢Hadoop is great for large-data processing!
l But writing Java programs for everything is verbose and slow
l Analysts don’t want to (or can’t) write Java
¢Solution: develop higher-level data processing languages
l Hive: HQL is like SQL
l Pig: Pig Latin is a bit like Perl
Hive and Pig
¢Hive: data warehousing application in Hadoop
l Query language is HQL, variant of SQL
l Tables stored on HDFS as flat files
l Developed by Facebook, now open source
¢Pig: large-scale data processing system
l Scripts are written in Pig Latin, a dataflow language
l Developed by Yahoo!, now open source
l Roughly 1/3 of all Yahoo! internal jobs
¢Common idea:
l Provide higher-level language to facilitate large-data processing
l Higher-level language “compiles down” to Hadoop jobs
Hive: Example
¢Hive looks similar to an SQL database
¢Relational join on two tables:
l Table of word counts from Shakespeare collection
l Table of word counts from the bible

SELECT s.word, s.freq, k.freq FROM shakespeare s


JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1
ORDER BY s.freq DESC LIMIT 10;

the 25848 62394


I 23031 8854
and 19671 38985
to 18038 13526
of 16700 34654
a 14170 8057
you 12702 2720
my 11297 4135
in 10797 12445
is 88826884
Source: Material drawn from Cloudera training VM
Hive: Behind the Scenes
SELECT s.word, s.freq, k.freq FROM shakespeare s
JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1
ORDER BY s.freq DESC LIMIT 10;

(Abstract Syntax Tree)


(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF shakespeare s) (TOK_TABREF bible k) (= (. (TOK_TABLE_OR_COL s)
word) (. (TOK_TABLE_OR_COL k) word)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT
(TOK_SELEXPR (. (TOK_TABLE_OR_COL s) word)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL s) freq)) (TOK_SELEXPR (.
(TOK_TABLE_OR_COL k) freq))) (TOK_WHERE (AND (>= (. (TOK_TABLE_OR_COL s) freq) 1) (>= (. (TOK_TABLE_OR_COL k)
freq) 1))) (TOK_ORDERBY (TOK_TABSORTCOLNAMEDESC (. (TOK_TABLE_OR_COL s) freq))) (TOK_LIMIT 10)))

(one or more of MapReduce jobs)


Hive: Behind the Scenes
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage: Stage-2
Stage-0 is a root stage
Map Reduce
Alias -> Map Operator Tree:
hdfs://localhost:8022/tmp/hive-training/364214370/10002
STAGE PLANS: Reduce Output Operator
Stage: Stage-1 key expressions:
Map Reduce expr: _col1
Alias -> Map Operator Tree: type: int
s sort order: -
TableScan tag: -1
alias: s value expressions:
Filter Operator expr: _col0
predicate: type: string
expr: (freq >= 1) expr: _col1
type: boolean type: int
Reduce Output Operator expr: _col2
key expressions: type: int
expr: word Reduce Operator Tree:
type: string Extract
sort order: + Reduce Operator Tree: Limit
Map-reduce partition columns: Join Operator File Output Operator
expr: word condition map: compressed: false
type: string Inner Join 0 to 1 GlobalTableId: 0
tag: 0 condition expressions: table:
value expressions: 0 {VALUE._col0} {VALUE._col1} input format: org.apache.hadoop.mapred.TextInputFormat
expr: freq 1 {VALUE._col0} output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
type: int outputColumnNames: _col0, _col1, _col2
expr: word Filter Operator
type: string predicate:
k expr: ((_col0 >= 1) and (_col2 >= 1))
TableScan type: boolean
alias: k Select Operator Stage: Stage-0
Filter Operator expressions: Fetch Operator
predicate: expr: _col1 limit: 10
expr: (freq >= 1) type: string
type: boolean expr: _col0
Reduce Output Operator type: int
key expressions: expr: _col2
expr: word type: int
type: string outputColumnNames: _col0, _col1, _col2
sort order: + File Output Operator
Map-reduce partition columns: compressed: false
expr: word GlobalTableId: 0
type: string table:
tag: 1 input format: org.apache.hadoop.mapred.SequenceFileInputFormat
value expressions: output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
expr: freq
type: int
Questions?

Source: Wikipedia (Japanese rock garden)

You might also like