MapReduce and Hadoop
Debapriyo Majumdar
Data Mining – Fall 2014
Indian Statistical Institute Kolkata
November 10, 2014
Let’s keep the intro short
Modern data mining: process immense amount of data quickly
Exploit parallelism
Traditional parallelism
Bring data to compute
MapReduce
Bring compute to data
Pictures courtesy: Glenn K. Lockwood, glennklockwood.com 2
The MapReduce paradigm
Split Map Shuffle and sort Reduce Final
Final
original output
Input Input <Key,Value>
chunks pairs
<Key,Value> Output May not
May be pairs grouped chunks need to
already by keys combine
split in
filesystem The user needs to write the map() and the reduce() 3
An example: word frequency counting
Split Map Shuffle and sort Reduce Final
the pairs (w,1) for the same words
reduce: count the number (n) of
subcollections of documnts
pairs for each w, make it (w,n)
collection of documnts
output: (w,n) for
map: for each word w,
output pairs (w,1)
are grouped together
each w
Final
original output
Input Input <Key,Value>
chunks pairs
<Key,Value> Output
Problem: Given a collection
of documents, count the
pairs grouped chunks
number of times each by keys
word occurs in the map: for each word w, reduce: count the number (n) of
collection emit pairs (w,1) pairs for each w, make it (w,n) 4
An example: word frequency counting
Split Map Shuffle and sort Reduce Final
(apple,1)
(apple,2)
(apple,1)
apple (apple,1)
orange (orange,1) (orange,1)
apple orange peach (peach,1) (orange,1) (orange,3)
peach (orange,1)
orange (orange,1)
plum (plum,1) (guava,1) (guava,1) (apple,2)
orange plum
(orange,3)
orange (orange,1) (guava,1)
orange apple (plum,1)
apple (apple,1) (plum,2) (plum,2)
guava (plum,1)
guava (guava,1) (cherry,2)
(fig,2)
cherry fig (cherry,1) (cherry,1)
cherry fig (cherry,2) (peach,3)
(fig,1) (cherry,1)
peach fig
peach (peach,1) (fig,1)
peach fig
(fig,1) (fig,1)
(fig,2) Final
original peach
(peach,1) (peach,1) output
Input Input <Key,Value> (peach,1) (peach,3)
(peach,1)
chunks pairs
<Key,Value> Output
Problem: Given a collection
of documents, count the
pairs grouped chunks
number of times each by keys
word occurs in the map: for each word w, reduce: count the number (n) of
collection output pairs (w,1) pairs for each w, make it (w,n) 5
Apache Hadoop
An open source MapReduce framework
HADOOP
6
Hadoop
Two main components
– Hadoop Distributed File System (HDFS): to store data
– MapReduce engine: to process data
Master – slave architecture using commodity servers
The HDFS
– Master: Namenode
– Slave: Datanode
MapReduce
– Master: JobTracker
– Slave: TaskTracker
7
HDFS: Blocks
Datanode 1
Block 1 Block 2
Block 1 Block 2 Block 3
Datanode 2
Block 1 Block 3
Big File Block 3 Block 4 Block 4
Datanode 3
Block 2 Block 6
Block 5 Block 6
Block 5
Datanode 4
Runs on top of existing filesystem
Blocks are 64MB (128MB recommended) Block 4 Block 6
Single file can be > any single disk Block 5
POSIX based permissions
8
Fault tolerant
HDFS: Namenode and Datanode
Namenode
– Only one per Hadoop Cluster
– Manages the filesystem namespace
– The filesystem tree
– An edit log
– For each block block i, the datanode(s) in which block i is saved
– All the blocks residing in each datanode
Secondary Namenode
– Backup namenode
Datanodes
– Many per Hadoop cluster
– Controls block operations
– Physically puts the block in the nodes
– Do the physical replication
9
HDFS: an example
10
MapReduce: JobTracker and TaskTracker
1. JobClient submits job to JobTracker; Binary copied into HDFS
2. JobTracker talks to Namenode
3. JobTracker creates execution plan
4. JobTracker submits work to TaskTrackers
5. TaskTrackers report progress via heartbeat
6. JobTracker updates status 11
Map, Shuffle and Reduce: internal steps
1. Splits data up to send it to the mapper
2. Transforms splits into key/value pairs
3. (Key-Value) with same key sent to the same reducer
4. Aggregates key/value pairs based on user-defined code
5. Determines how the result are saved
12
Fault Tolerance
If the master fails
– MapReduce would fail, have to restart the entire job
A map worker node fails
– Master detects (periodic ping would timeout)
– All the map tasks for this node have to be restarted
• Even if the map tasks were done, the output were at the node
A reduce worker fails
– Master sets the status of its currently executing reduce
tasks to idle
– Reschedule these tasks on another reduce worker
13
Some algorithms using MapReduce
USING MAPREDUCE
14
Matrix – Vector Multiplication
Multiply M = (mij) (an n × n matrix) and v = (vi) (an n-vector)
If n = 1000, no need of MapReduce!
n
M v n
Case 1: Large n, M does not fit into main memory, but v does
Since v fits into main memory, v is available to every map task
Map: for each matrix element mij, emit key value pair (i, mijvj)
Shuffle and sort: groups all mijvj values together for the same i
Reduce: sum mijvj for all j for the same i
15
Matrix – Vector Multiplication
Multiply M = (mij) (an n × n matrix) and v = (vi) (an n-vector)
If n = 1000, no need of MapReduce!
This much will fit into main
memory
This whole chunk does not fit
in main memory anymore
Case 2: Very large n, even v does not fit into main memory
For every map, many accesses to disk (for parts of v) required!
Solution:
– How much of v will fit in?
– Partition v and rows of M so that each partition of v fits into memory
– Take dot product of one partition of v and the corresponding partition of M
– Map and reduce same as before
16
Relational Alegebra
Relation R(A1, A3, …, An) is Attr1 Attr2 Attr3 Attr4
xyz abc 1 true
a relation with attributes Ai
abc xyz 1 true
Schema: set of attributes xyz def 1 false
Selection on condition C: bcd def 2 true
apply C on each tuple in R,
output only those which
satisfy C Links between URLs
Projection on a subset S of URL1 URL2
attributes: output the url1 url2
components for the url2 url1
attributes in S url3 url5
Union, Intersection, Join… url1 url3
17
Selection using MapReduce
Trivial Links between URLs
Map: For each tuple t in R, test if t URL1 URL2
satisfies C. If so, produce the key-value url1 url2
pair (t, t). url2 url1
Reduce: The identity function. It simply url3 url5
passes each key-value pair to the output. url1 url3
18
Union using MapReduce
Union of two relations R and S Links between URLs
Suppose R and S have the same schema URL1 URL2
Map tasks are generated from chunks of url1 url2
both R and S url2 url1
Map: For each tuple t, produce the key- url3 url5
value pair (t, t) url1 url3
Reduce: Only need to remove duplicates
– For all key t, there would be either one or
two values
– Output (t, t) in either case
19
Natural join using MapReduce
Join R(A,B) with S(B,C) on attribute B R
Map:
A B
– For each tuple t = (a,b) of R, emit key value pair
(b,(R,a)) x a
– For each tuple t = (b,c) of S, emit key value pair y b
(b,(S,c)) z c
Reduce: w d
– Each key b would be associated with a list of
values that are of the form (R,a) or (S,c)
S
– Construct all pairs consisting of one with first
component R and the other with first component B C
S , say (R,a ) and (S,c ). The output from this key a 1
and value list is a sequence of key-value pairs c 3
– The key is irrelevant. Each value is one of the
d 4
triples (a, b, c ) such that (R,a ) and (S,c) are on
the input list of values g 7
20
Grouping and Aggregation using MapReduce
Group and aggregate on a relation R
R(A,B) using aggregation function γ(B), A B
group by x 2
y 1
Map:
z 4
– For each tuple t = (a,b) of R, emit key
z 1
value pair (a,b)
x 5
Reduce:
– For all group {(a,b1), …, (a,bm)} select A, sum(B) from R
group by A;
represented by a key a, apply γ to obtain
b a = b1 + … + bm
A SUM(B)
– Output (a,ba)
x 7
y 1
z 5
21
Matrix multiplication using MapReduce
n l l
m A n = C
B m
(m × n) (m × l)
(n × l)
Think of a matrix as a relation with three attributes
For example matrix A is represented by the relation A(I, J, V)
– For every non-zero entry (i, j, aij), the row number is the value of I,
column number is the value of J, the entry is the value in V
– Also advantage: usually most large matrices would be sparse, the relation
would have less number of entries
The product is ~ a natural join followed by a grouping with
aggregation
22
Matrix multiplication using MapReduce
n l l
A
m n
B = C
(m × n) m
(n × l) (m × l)
(i, j, aij)
(j, k, bjk)
Natural join of (I,J,V) and (J,K,W) tuples (i, j, k, aij, bjk)
Map:
– For every (i, j, aij), emit key value pair (j, (A, i, aij))
– For every (j, k, bjk), emit key value pair (j, (B, k, bjk))
Reduce:
for each key j
for each value (A, i, aij) and (B, k, bjk)
produce a key value pair ((i,k),(aijbjk))
23
Matrix multiplication using MapReduce
n l l
A
m n
B = C
(m × n) m
(n × l) (m × l)
(i, j, aij)
(j, k, bjk)
First MapReduce process has produced key value pairs ((i,k),
(aijbjk))
Another MapReduce process to group and aggregate
Map: identity, just emit the key value pair ((i,k),(aijbjk))
Reduce:
for each key (i,k)
produce the sum of the all the values for the key:
24
Matrix multiplication using MapReduce: Method 2
n l l
A
m n
B = C
(m × n) m
(n × l) (m × l)
(i, j, aij)
(j, k, bjk)
A method with one MapReduce step
Map:
– For every (i, j, aij), emit for all k = 1,…, l, the key value ((i,k), (A, j, aij))
– For every (j, k, bjk), emit for all i = 1,…, m, the key value ((i,k), (B, j, bjk))
Reduce: May not fit in
for each key (i,k) main memory.
sort values (A, j, aij) and (B, j, bjk) by j to group them by j Expensive
for each j multiply aij and bjk external sort!
sum the products for the key (i,k) to produce
25
References and acknowledgements
Mining of Massive Datasets, by Leskovec, Rajaraman and
Ullman, Chapter 2
Slides by Dwaipayan Roy
26