MapReduce Examples
CSE532: Theory of Database Systems
Fusheng Wang
Department of Biomedical Informatics
Department of Computer Science
Word Count Execution
Input
the
quick
brown
fox
the fox
ate the
mouse
how now
brown
cow
Execution
Map
Map
Shuffle & Sort
Reduce
Output
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
Reduce
ate, 1
cow, 1
mouse, 1
quick, 1
brown, 1
fox, 1
the, 1
fox, 1
the, 1
the, 1
Map
brown, 1
how, 1
now, 1
Map
quick, 1
ate, 1
mouse, 1
cow, 1
Total Order Sorting by Mapper Output Keys
The output key-value pairs from the Mapper are sorted
by keys before they reach the reducers
The sort order for keys is controlled by RawComparator
mapreduce.job.output.key.comparator.class
Keys are a subclass of WritableComparable
Or the RawComparator: compare records read from a
stream without deserializing them into objects
Partitioner (a customizable hashing function) will decide
how the keys are split into reducers
Each reducer will merge the keys from multiple reducers
and preserve the order
Across reducers, there is no total order
A single reducer will generate a total order of keys but will
be too slow
Sorting
Idea
Idea: produce a set of sorted files that, if
concatenated, would form a globally sorted file
The secret: use a partitioner that respects the total
order of the output
e.g.: sort the weather dataset by temperature
Reducer
<-10C
-10C - 0C
-0C - 10C
>= 10C
Sorting
Total Order Partioner
HashPartitioner (default) hashes a records key to
determine which partition/reducer the record belongs in
Goals of a total order partitioner:
The number of partitions equals to the number of reducers
The size of each partition should be balanced
Sampling the key space to estimate the distribution
and generate partitioning boundaries for partitioning
The ImputSampler runs on client limiting splits for sampling
The InputSampler writes a partition file to share with the tasks
running on the cluster with Distributed Cache
Distributed Cache is a facility provided by the Map-Reduce
framework to cache files (text, archives, jars etc.) needed by
applications
Sorting
Overview of Total Order Sorting
Sorting
Example Code
Sorting
Joins
Repartition joinA reduce-side join for situations
where you are joining two or more large datasets
Replication joinA map-side join that works in
situations where one of the datasets is small enough
to cache
Semi-joinAnother map-side
join where one dataset is initially
too large to fit into memory, but
after some filtering can be
reduced down to a size that can
fit in memory
Repartition Join
A repartition join is a reduce-side join implemented as
a single MapReduce job, and supports multi-way join
The map phase reads the data from multiple datasets,
determining the join value for each record, and
emitting that join value as the output key
A (key, value) B (key, value)
(key, value(value, tag) ): tag annotates the table name
The output value contains needed for combining datasets in the
reducer to produce the job output
A reducer receives all of the values for a join key
emitted by the map function, and partition them based
on data sources
The reducer performs a Cartesian product across all
partitions and emits the results of each join
Repartition Join
Repartition Join
Example
Join Customers (CID, Name,
Phone) with Orders (CID,
OrderID, Price Date):
Find orders for each customer
Mapper: same key (CID) for
both inputs; value is customer
info for Customers, order info
for Orders, PLUS a tag on
data source
Repartition Join
(CID, Name, Phone)
(CID, OrderID, Price, Date)
The Reducer Side of Repartioned Join
For a given join key,
the reduce task
performs a full crossproduct of values
from different sources
Repartition Join
Replicated Joins
Repartioned join happens late at Reducer phase, major
overhead on moving data to reducer nodes
Replicated join: join operation between one large and
many small data sets that can be performed on the map
side
Completely eliminates the need to shuffle any data to the
reduce phase
All the data sets except the very large one are essentially
read into memory during the setup phase of each map task
Join is done entirely in the map phase, with the very large
data set being the input for the MapReduce job
Restriction: a replicated join is really useful only for an
inner or a left outer join where the large data set is the
left data set
Replicated Join
Replicated Joins
Replicated Joins
// Read cached table into Hashtable
Replicated Join
References
[1] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters.
Commun. ACM, 51(1):107113, 2008.
[2] Fuhui Wu et al. Comparison & Performance Analysis of Join Approach in MapReduce
ISCTCS 2012, CCIS 320, pp. 629636
[3] Marko Lali et al. Comparison of a Sequential & a MapReduce Approach to Joining Large
Datasets MIPRO 2013, pp.1289-1291
[4] Spyros, B., Jignesh, M.P., Vuk, E., Jun, R., Eugene, J., Yuanyuan, T.: A Comparison of
Join Algorithms for Log Processing in MapReduce. In: SIGMOD 2010, June 611. ACM,
Indianapolis (2010)
[5] Foto N. Afrati et al. Optimizing Multiway Joins in a Map-Reduce Environment IEEE
Transactions On Knowledge And Data Engineering, pp. 1282- 1298, 2011
[6] Alper Okcan et al. Processing Theta-Joins using MapReduce SIGMOD11, June 1216,
2011 pp. 949-951
[7] Xiaofei Zhang et al. Efficient Multiway Theta Join Processing Using MapReduce
Proceedings of the VLDB Endowment, Vol. 5, No. 11, pp.1184-1196
[8] Anwar Shaikh et al. Join Query Processing in MapReduce Environment CNC 2012,
LNICST , pp.275-281
Join References