0% found this document useful (0 votes)

164 views16 pages

MapReduce Sorting and Joins Guide

This document provides examples and explanations of MapReduce concepts including word count, sorting, partitioning, and different types of joins. It explains how word count works by mapping words to counts, shuffling and sorting, then reducing counts. It describes how total order sorting preserves key order across reducers using a partitioner that respects the order. Repartition and replicated joins are covered, with repartition joining on the reducer side and replicating joining on the mapper side when one dataset is small enough to cache.

Uploaded by

icecream-likey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

164 views16 pages

MapReduce Sorting and Joins Guide

Uploaded by

icecream-likey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

MapReduce Examples

CSE532: Theory of Database Systems

Fusheng Wang
Department of Biomedical Informatics
Department of Computer Science

Word Count Execution

Input
the
quick
brown
fox
the fox
ate the
mouse
how now
brown
cow
Execution

Map

Shuffle & Sort

Reduce

Output

Reduce

brown, 2
fox, 2
how, 1
now, 1
the, 3

Reduce

ate, 1
cow, 1
mouse, 1
quick, 1

brown, 1
fox, 1
the, 1

fox, 1
the, 1
the, 1

Map
brown, 1
how, 1
now, 1

Map

quick, 1
ate, 1
mouse, 1

cow, 1

Total Order Sorting by Mapper Output Keys

The output key-value pairs from the Mapper are sorted
by keys before they reach the reducers
The sort order for keys is controlled by RawComparator
mapreduce.job.output.key.comparator.class
Keys are a subclass of WritableComparable
Or the RawComparator: compare records read from a
stream without deserializing them into objects

Partitioner (a customizable hashing function) will decide

how the keys are split into reducers
Each reducer will merge the keys from multiple reducers
and preserve the order
Across reducers, there is no total order
A single reducer will generate a total order of keys but will
be too slow
Sorting

Idea
Idea: produce a set of sorted files that, if
concatenated, would form a globally sorted file
The secret: use a partitioner that respects the total
order of the output

e.g.: sort the weather dataset by temperature

Reducer
<-10C
-10C - 0C

-0C - 10C
>= 10C

Sorting

Total Order Partioner

HashPartitioner (default) hashes a records key to
determine which partition/reducer the record belongs in
Goals of a total order partitioner:
The number of partitions equals to the number of reducers
The size of each partition should be balanced

Sampling the key space to estimate the distribution

and generate partitioning boundaries for partitioning
The ImputSampler runs on client limiting splits for sampling

The InputSampler writes a partition file to share with the tasks

running on the cluster with Distributed Cache
Distributed Cache is a facility provided by the Map-Reduce
framework to cache files (text, archives, jars etc.) needed by
applications
Sorting

Overview of Total Order Sorting

Sorting

Example Code

Sorting

Joins
Repartition joinA reduce-side join for situations
where you are joining two or more large datasets
Replication joinA map-side join that works in
situations where one of the datasets is small enough
to cache
Semi-joinAnother map-side
join where one dataset is initially
too large to fit into memory, but
after some filtering can be
reduced down to a size that can
fit in memory

Repartition Join
A repartition join is a reduce-side join implemented as
a single MapReduce job, and supports multi-way join

The map phase reads the data from multiple datasets,

determining the join value for each record, and
emitting that join value as the output key
A (key, value) B (key, value)
(key, value(value, tag) ): tag annotates the table name
The output value contains needed for combining datasets in the
reducer to produce the job output

A reducer receives all of the values for a join key

emitted by the map function, and partition them based
on data sources
The reducer performs a Cartesian product across all
partitions and emits the results of each join
Repartition Join

Repartition Join

Example
Join Customers (CID, Name,
Phone) with Orders (CID,
OrderID, Price Date):
Find orders for each customer

Mapper: same key (CID) for

both inputs; value is customer
info for Customers, order info
for Orders, PLUS a tag on
data source

Repartition Join

(CID, Name, Phone)

(CID, OrderID, Price, Date)

The Reducer Side of Repartioned Join

For a given join key,
the reduce task
performs a full crossproduct of values
from different sources

Repartition Join

Replicated Joins
Repartioned join happens late at Reducer phase, major
overhead on moving data to reducer nodes
Replicated join: join operation between one large and
many small data sets that can be performed on the map
side
Completely eliminates the need to shuffle any data to the
reduce phase
All the data sets except the very large one are essentially
read into memory during the setup phase of each map task
Join is done entirely in the map phase, with the very large
data set being the input for the MapReduce job
Restriction: a replicated join is really useful only for an
inner or a left outer join where the large data set is the
left data set
Replicated Join

Replicated Joins

// Read cached table into Hashtable

Replicated Join

References
[1] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters.
Commun. ACM, 51(1):107113, 2008.
[2] Fuhui Wu et al. Comparison & Performance Analysis of Join Approach in MapReduce
ISCTCS 2012, CCIS 320, pp. 629636
[3] Marko Lali et al. Comparison of a Sequential & a MapReduce Approach to Joining Large
Datasets MIPRO 2013, pp.1289-1291
[4] Spyros, B., Jignesh, M.P., Vuk, E., Jun, R., Eugene, J., Yuanyuan, T.: A Comparison of
Join Algorithms for Log Processing in MapReduce. In: SIGMOD 2010, June 611. ACM,
Indianapolis (2010)
[5] Foto N. Afrati et al. Optimizing Multiway Joins in a Map-Reduce Environment IEEE
Transactions On Knowledge And Data Engineering, pp. 1282- 1298, 2011
[6] Alper Okcan et al. Processing Theta-Joins using MapReduce SIGMOD11, June 1216,
2011 pp. 949-951
[7] Xiaofei Zhang et al. Efficient Multiway Theta Join Processing Using MapReduce
Proceedings of the VLDB Endowment, Vol. 5, No. 11, pp.1184-1196
[8] Anwar Shaikh et al. Join Query Processing in MapReduce Environment CNC 2012,
LNICST , pp.275-281

Join References

Neo4J - Sample Questions 3 - Glitchdata PDF
No ratings yet
Neo4J - Sample Questions 3 - Glitchdata PDF
25 pages
Hadoop Setup for Beginners
No ratings yet
Hadoop Setup for Beginners
4 pages
COBOL-85 File Handling Guide
No ratings yet
COBOL-85 File Handling Guide
39 pages
Hadoop Python MapReduce Tutorial For Beginners
No ratings yet
Hadoop Python MapReduce Tutorial For Beginners
15 pages
Describe The Functions and Features of HDP
100% (2)
Describe The Functions and Features of HDP
16 pages
SolidWorks Education Drawing Exercises
No ratings yet
SolidWorks Education Drawing Exercises
51 pages
SAS Basics
100% (1)
SAS Basics
42 pages
Module 3 Nosql
No ratings yet
Module 3 Nosql
12 pages
Course Contents of Hadoop and Big Data
No ratings yet
Course Contents of Hadoop and Big Data
11 pages
Hadoop Exams
No ratings yet
Hadoop Exams
14 pages
MapR Sandbox For Hadoop DocUpdateFor3.1.1
No ratings yet
MapR Sandbox For Hadoop DocUpdateFor3.1.1
7 pages
Hadoop and BigData LAB MANUAL
50% (4)
Hadoop and BigData LAB MANUAL
59 pages
Hadoop Setup Guide for Windows Users
No ratings yet
Hadoop Setup Guide for Windows Users
29 pages
File Types in Data Engineering!
No ratings yet
File Types in Data Engineering!
18 pages
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
No ratings yet
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
112 pages
Mapr Snapshots
No ratings yet
Mapr Snapshots
31 pages
BIG DATA WITH HADOOP, HDFS & MAPREDUCE (Hands On Training)
No ratings yet
BIG DATA WITH HADOOP, HDFS & MAPREDUCE (Hands On Training)
35 pages
Hadoop Quiz and Exam Answers
No ratings yet
Hadoop Quiz and Exam Answers
10 pages
NoSQL Intro
No ratings yet
NoSQL Intro
26 pages
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
No ratings yet
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
27 pages
2 HDFS Commands
No ratings yet
2 HDFS Commands
7 pages
Cloudera Administration Study Guide
No ratings yet
Cloudera Administration Study Guide
3 pages
PySpark RDD Assignment
No ratings yet
PySpark RDD Assignment
1 page
Cluster Maintenance Guide
No ratings yet
Cluster Maintenance Guide
19 pages
Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming
No ratings yet
Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming
33 pages
Synchronous Replication
100% (2)
Synchronous Replication
26 pages
MapReduce Introduction
No ratings yet
MapReduce Introduction
34 pages
HDFS Exercises - Basic
No ratings yet
HDFS Exercises - Basic
5 pages
Big Data & Hadoop Quiz
No ratings yet
Big Data & Hadoop Quiz
24 pages
Hive Queries
No ratings yet
Hive Queries
5 pages
Questions Certif BigData
No ratings yet
Questions Certif BigData
12 pages
Top 60 Hadoop Interview Q&A Guide
No ratings yet
Top 60 Hadoop Interview Q&A Guide
3 pages
Bigdataaaaa
No ratings yet
Bigdataaaaa
180 pages
HBase Interview Questions
No ratings yet
HBase Interview Questions
12 pages
CH 23
No ratings yet
CH 23
126 pages
Consensus
No ratings yet
Consensus
77 pages
Hadoop Log Level MapReduce Tutorial
No ratings yet
Hadoop Log Level MapReduce Tutorial
3 pages
Bigdata Bits PDF
No ratings yet
Bigdata Bits PDF
2 pages
Big Data and Apache Spark Overview
No ratings yet
Big Data and Apache Spark Overview
211 pages
IBM Spectrum Protect Node Replication: Disclaimer
No ratings yet
IBM Spectrum Protect Node Replication: Disclaimer
27 pages
Cassandra Quick Guide
No ratings yet
Cassandra Quick Guide
60 pages
Hadoop FS Shell Commands Guide
No ratings yet
Hadoop FS Shell Commands Guide
5 pages
Writing An Hadoop MapReduce Program in Python
No ratings yet
Writing An Hadoop MapReduce Program in Python
21 pages
BigData Exam C2122 PDF
100% (1)
BigData Exam C2122 PDF
6 pages
Sqoop Data Transfer Guide
No ratings yet
Sqoop Data Transfer Guide
18 pages
DB2 9 DBA Certification Exam 731 Prep, Part 1:: Server Management
No ratings yet
DB2 9 DBA Certification Exam 731 Prep, Part 1:: Server Management
44 pages
How To Install Enthuware Test Studio PDF
No ratings yet
How To Install Enthuware Test Studio PDF
3 pages
Big Data Analytics - Lab-Manual
No ratings yet
Big Data Analytics - Lab-Manual
19 pages
HDFS Commands
No ratings yet
HDFS Commands
15 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
Mcca Study Guide 7.2017 Uvawomo
No ratings yet
Mcca Study Guide 7.2017 Uvawomo
30 pages
Hadoop I/O for Data Engineers
No ratings yet
Hadoop I/O for Data Engineers
36 pages
Big Data Hadoop Insight
No ratings yet
Big Data Hadoop Insight
46 pages
Hadoop Interview Questions
No ratings yet
Hadoop Interview Questions
28 pages
14-Lesson Cloudera Hive
No ratings yet
14-Lesson Cloudera Hive
9 pages
Map-Reduce 2
No ratings yet
Map-Reduce 2
38 pages
S MapReduce Types Formats Features 06
No ratings yet
S MapReduce Types Formats Features 06
26 pages
S MapReduce Types Formats Features 03
No ratings yet
S MapReduce Types Formats Features 03
16 pages
Hadoop MapReduce Tutorial
No ratings yet
Hadoop MapReduce Tutorial
25 pages
Medha 8059
No ratings yet
Medha 8059
4 pages
MapReduce BDA
No ratings yet
MapReduce BDA
32 pages
MR Databases
No ratings yet
MR Databases
52 pages
03 Lists MD
No ratings yet
03 Lists MD
244 pages
Elementary Data Types
No ratings yet
Elementary Data Types
29 pages
Tetra Connectivity Server (TCS) : Integrated Applications Boost Efficiency
No ratings yet
Tetra Connectivity Server (TCS) : Integrated Applications Boost Efficiency
2 pages
Power Supplies: Qualified Vendors List - Devices
No ratings yet
Power Supplies: Qualified Vendors List - Devices
16 pages
PRACTICAL of Python
No ratings yet
PRACTICAL of Python
14 pages
Huawei Smartax Ea5800 Olt Datasheet
0% (1)
Huawei Smartax Ea5800 Olt Datasheet
11 pages
SAP FM Migration: WS_UPLOAD to GUI_UPLOAD
No ratings yet
SAP FM Migration: WS_UPLOAD to GUI_UPLOAD
12 pages
EMV Android POS with NFC & Printer
No ratings yet
EMV Android POS with NFC & Printer
2 pages
KY 020 Joy IT
No ratings yet
KY 020 Joy IT
3 pages
Kogge-Stone Adder Design Review
No ratings yet
Kogge-Stone Adder Design Review
3 pages
Nagios Windows Monitoring Guide
No ratings yet
Nagios Windows Monitoring Guide
2 pages
Network Systems Admin Expertise
No ratings yet
Network Systems Admin Expertise
3 pages
C Chapter 10
No ratings yet
C Chapter 10
32 pages
iPerf3 Python Wrapper Guide
No ratings yet
iPerf3 Python Wrapper Guide
21 pages
Cloud Computing Fundamentals
No ratings yet
Cloud Computing Fundamentals
38 pages
Manual Blue Cherry
No ratings yet
Manual Blue Cherry
223 pages
Chapter 1: Introduction To Computers and Programming: Starting Out With C++ Early Objects Ninth Edition, Global Edition
No ratings yet
Chapter 1: Introduction To Computers and Programming: Starting Out With C++ Early Objects Ninth Edition, Global Edition
34 pages
Increase OBIEE Saw Log Levels Guide
No ratings yet
Increase OBIEE Saw Log Levels Guide
5 pages
Setupapi Dev Log
100% (1)
Setupapi Dev Log
546 pages
ARC3000H-W2 (868) Datasheet 20220628
No ratings yet
ARC3000H-W2 (868) Datasheet 20220628
2 pages
rtl8201 (F FL FN) - VB-CG Datasheet 1.4 PDF
No ratings yet
rtl8201 (F FL FN) - VB-CG Datasheet 1.4 PDF
66 pages
Flowcharts
No ratings yet
Flowcharts
25 pages
The Number System CS
No ratings yet
The Number System CS
19 pages
Blockchain Lab SOHAM KORE Exp 3
No ratings yet
Blockchain Lab SOHAM KORE Exp 3
5 pages
Introduction To C: Department of CSE, BUET
No ratings yet
Introduction To C: Department of CSE, BUET
90 pages
HT Commands 10.09
No ratings yet
HT Commands 10.09
12 pages
K007791E - Getting Started
No ratings yet
K007791E - Getting Started
180 pages
7.2 General Purpose Computers
No ratings yet
7.2 General Purpose Computers
2 pages

MapReduce Sorting and Joins Guide

Uploaded by

MapReduce Sorting and Joins Guide

Uploaded by

MapReduce Examples

CSE532: Theory of Database Systems

Word Count Execution

Shuffle & Sort

Total Order Sorting by Mapper Output Keys

Partitioner (a customizable hashing function) will decide

e.g.: sort the weather dataset by temperature

Total Order Partioner

Sampling the key space to estimate the distribution

The InputSampler writes a partition file to share with the tasks

Overview of Total Order Sorting

The map phase reads the data from multiple datasets,

A reducer receives all of the values for a join key

Mapper: same key (CID) for

(CID, Name, Phone)

(CID, OrderID, Price, Date)

The Reducer Side of Repartioned Join

// Read cached table into Hashtable

You might also like