0% found this document useful (0 votes)

9 views38 pages

Map-Reduce 2

The document discusses the Map Reduce computing paradigm, focusing on the Map and Reduce functions, partitioning, and sorting mechanisms. It covers various refinements, multi-stage jobs, and exercises related to finding top N records and performing joins. Additionally, it highlights issues with Map Reduce, such as the need for full scans and lack of iteration, while mentioning higher-level abstractions like Hive and Pig to simplify operations.

Uploaded by

funbobbythewineguy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views38 pages

Map-Reduce 2

Uploaded by

funbobbythewineguy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

“Map Reduce” Computing Paradigm

pm jat @ daiict
Make sure that you understood !
• Map and Reduce function
– Input
– Output
• Partition (Shuffle) and Partition function (default and User Defined partition)
• Sort
• We can, and sometimes require defining
– Customized “Partition Function”
– Customized “Compare” or “sort” function for data type of Key

11-Aug-25 map-reduce computing 2

MR data flow [2]

Figure Source: [2]

11-Aug-25 map-reduce computing 3

Refinements

• Following MR refinements are presented in the article

– Combiner Function
– Partitioning Function

11-Aug-25 map-reduce computing 4

Partitioning Function
• A default partitioning function is “hash(key) mod R”
– Does a fairly well-balanced partitions
• In some cases, this may not be enough
– For example, URL is the key and we want all URLs of a host to be landed on the
same reducer
– For this, we may have a partitioning function as following
“hash(Hostname(urlkey)) mod R”

11-Aug-25 map-reduce computing 5

Multi Stage “Map Reduce Job”?
• Two-step solution
• Stage 1: Compute department-wise average
– Map Output: dno, salary
– Reducer Output: dno, avg_salary
• Stage 2:
– Map output: recno, record that meet the criteria
“e.salary > avg_salary [e.dno]”
Where is avg_salary ?
– Reducer:
• Does nothing
• Here is an example of multi-stage MR is available:
https://mrjob.readthedocs.io/en/latest/guides/writing-mrjobs.html
• Code: https://colab.research.google.com/drive/1d_oakD_VHu-V7TV6MY1t8w6xy9523xVD

11-Aug-25 map-reduce computing 6

Exercise #: find map-reduce solution for
• List Employees who have a salary greater than 1.5 times of department’s average
salary.
• That is equivalent to following:
select * from employee where salary >
(select avg_sal from
(select dno, avg(salary) as avg_sal
from employee group by dno))

11-Aug-25 map-reduce computing 7

Some more examples

• Top-N
• SORT
• JOIN
• ?

11-Aug-25 map-reduce computing 8

Sorting
• For Sorting, we can simply use “Sorting” functionality of the map-reduce
infrastructure
• For this purpose, we choose, sorting attribute as output Key of Map function.
• Suppose we want to sort data records on attribute A, then value for A is key and the
corresponding record as value for a data record.
• Data records shall be sorted (ascending) on the Key!
• PS: However it needs to ensured that
– (1) Data type of Key defines “comparable” operators ( <, <=, ==, => > )
– (2) “partition (hash) function” is such that hash (K1) < hash (K2) when K1 < K2
• Also, to sort in descending – Hash function should work reverse!
11-Aug-25 map-reduce computing 9
Exercise #: Top N

• Suppose you have following data file:

CustNo, OrderAmountSum and want to compute top N Customers in terms of
OrderAmountSum

11-Aug-25 map-reduce computing 10

Top N: MR Strategy

Multiple Mappers are getting Added!

11-Aug-25 map-reduce computing 11

Compute Top-N the approach

11-Aug-25 map-reduce computing paradigm 12

Exercise #: Top N
• Every Map Task maintains a MIN-HEAP of size N, and finally produces it as output
• Let Min-Heap be maintained as following:
– Initially empty
– Every record is ADDED to Min-Heap, where “Sort-Key” as “Key” and Data-Record
as “Value”
– If addition leads to having size N+1, remove first
• Reducer tasks
– Needs to have a Single Reducer
– Merges all Min-Heap trees and produces a final one

11-Aug-25 map-reduce computing 13

Exercise #: Top N
• At Mapper
– We maintain a “Min Heap Tree of Size N”
–…
– At the end, Map function produces ToP N for all of its own data!

11-Aug-25 map-reduce examples 14

Mapper “Top-N”
• INITialization
– N = 10 (say)
– Construct empty Min-Heap
• Map Function
– Parse record
=> SortKey, Record
– Push to the Heap
– Remove smallest if size > N
• Finally (Destructor)
– Output Top-N locally

11-Aug-25 map-reduce computing 15

Reducer “Top-N”
• INITialization
– N = 10 (say)
– Construct empty Min-Heap
• Reduce Function
– Iterate through records(“values”)
– Keeping pushing records into
MIN-HEAP while maintaining its
size to N
– Finally Output Top-N

11-Aug-25 map-reduce computing 16

Top N: MR Strategy
• Compute top N at each mapper, and
• Shuffle output of all mappers to a “Single Reducer”
• To do this all map outputs may have the same key.
• A single reducer should be fine here as normally N is smaller; say 10; for 1000
mappers, the total data records for a reducer is 10000; not very large for a single
reducer!
• Data records of top N from records from all mappers are aggregated, and the final
top N is computed!

11-Aug-25 map-reduce computing 17

Computing Join using Map-Reduce
• Here we discuss JOIN approaches discussed in an article "A comparison of join
algorithms for log processing in map reduce." by Blanas Spyros [2], et al. in a
reputed SIGMOD conference in 2010.
• Let us say there are two data files L and R.
– Say Join Condition is L.A=R.B
– A and B are such that A in L refers to B in R, and
• The approach here is bit modified than the original with an assumption that R has a
distinct record for a value of B (that is like, saying A is FK in L and B is PK in R, which
is a typical situation in most joins)

11-Aug-25 map-reduce computing 18

A simple algorithm for MR Join [2]
• Called as “Standard Repartition Join” in [2]
• Here is bit modified version of the original algorithm, and with assumption that is R,
in the join of L and R, has one record for a distinct value of Join Attribute, where as L
may have multiple !!

11-Aug-25 map-reduce computing 19

A simple algorithm for MR Join [2]

11-Aug-25 map-reduce computing 20

Example: Computing Join using Map-Reduce
• Let us attempt performing JOIN of following two files:
customers.csv (cid,country,state)
orders.csv (ordrno,cid,amount)

• That is:
SELECT * FROM “customers.csv” AS c JOIN
“orders.csv” as o ON c.cid=o.cid;

11-Aug-25 map-reduce computing 21

Map-Reduce Join

11-Aug-25 map-reduce computing 22

A simple algorithm for MR Join [2]
• Here, we use two mappers functions – one for each input file.
• Each mapper produces “Key, Value”; where key is joining attribute (in case of both
mappers), and the values part contains projected attributes from the respected file
that are to be included in the join result.
• Outputs of both the mappers are combined and shuffled to reducers.
• Reducers then actually perform the join.
• Since Join happens at the reduce side, therefore, it is also called “reducer side join”

11-Aug-25 map-reduce computing 23

Here is pseudo implementation

11-Aug-25 map-reduce computing 24

“pseudo code” – reduce function

11-Aug-25 map-reduce computing 25

“pseudo code” – reduce function

11-Aug-25 map-reduce computing 26

Issues with Standard Repartition Join
• (1) Requires two scans of values in the reducer?
• (2) Requires caching all values, and hence limited by reducers memory
• There is an Improved version described in the article[2] called as
“Improved Repartition Join”.
• The Strategy goes as following:
– Have “File Tag” appended with the “Join Key”. File tag is designed such that This
helps us in getting record(s) of right file (with PK) preceding to all records of left
file (with FK)
– But this added Tag, should not be used for “partition” purpose. Therefore require
to have a customized partition function such that it uses “Join Key” only for
partition and excludes file tag!

11-Aug-25 map-reduce computing 27

Improved Repartition Join

11-Aug-25 map-reduce computing 28

Improved Repartition Join from [2]

11-Aug-25 map-reduce computing 29

Customize Partition Function
• In some case we need to customize the partition function;
Typically, when Partitioning based on key is not enough; or when Key is composite,
and user defined.
• In this case, output key of Map function is JOIN Key and Tag (composite), that is CNO
and Tag; where Tag is C or O indicating source of map output record!
• Here we want to “SORT” based on composite key where is “partitioning” (shuffling)
based on only CNO.
• Therefore we need to customize CNO
the “Partition” by defining a
partition function.

11-Aug-25 map-reduce computing 30

Modified of algorithm MR Join from [2]

11-Aug-25 map-reduce computing 31

“Broadcast MR Join” Algorithm from [2]
• Suppose R and L is to be Joined – read as Reference Table R and Log table L.
• Normally, R is much smaller than the L, i.e. |R| ≪ |L|.
• Broadcast join is run as a map-only job.
• In this approach, broadcast the smaller table R to mappers.
– Mapper node loads R in a split (that is mapper data chunks) as a Hash Table.
– This is done in MR INIT function
• Map function does all the join job
– Takes data records from L, one at a time
– For each data record l in L, HR (R Hash Table) is probed. If found, sar as r; a join is
computed as (l, r) and outputted !!

11-Aug-25 map-reduce computing 32

“Broadcast MR Join” Algorithm from [2]

11-Aug-25 map-reduce computing 33

Higher Level Abstractions
• Database users are used to SQL
• For database operations like SELECT, PROJECT, and JOIN programming in Map
Reduce is quiet a pain!
– Complex API, too much of programming, etc
• Higher level abstractions are available from early days of Map Reduce!
– Hive and Pig are popular tools and make life much simpler!

11-Aug-25 map-reduce computing 34

Higher Level Abstractions
• Hive: SQL like interface for HDFS files. Initially developed at Facebook (now Apache
project).

• Pig: scripting language for various data transformations. Initially developed at

Yahoo. Now again apache project

11-Aug-25 https://rxin.github.io/talks/2017-12-05_cs145-stanford.pdf
map-reduce computing 35
Issues with “Map Reduce”
• There are issues with Map-Reduce while performing many “queries” and analytical tasks
• Some of the issues listed here are from a survey article [3]
• Requirement of FULL SCAN of the File
– Very efficient when queries “low selectivity” are to be executed
– We can not early terminate the scan. Conditional termination of file processing is not
possible.
• Lack of iteration: If we require iterating through a dataset for multiple times, then every
time we read data from disk files; and that happens to be the case with many Analytical
and most Machine Learning tasks.
• Lack of Caching: Multiple MR jobs are processing same data almost at the same time.
Cashing can help it to run faster x100 times

11-Aug-25 map-reduce computing 36

Issues with “Map Reduce”
• The system lacks to “reuse” results of previously executed queries/jobs.
• Quick retrieval of approximate results [for example if we want to process only 10%
data from the file.
• Lack of interactive or real-time processing – Map-Reduce runs in background, and
there is no interaction till it finishes the job.

11-Aug-25 map-reduce computing 37

Sources/References
[1] Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: Simplified data processing on large clusters."
(2004)
[2] Blanas, Spyros, et al. "A comparison of join algorithms for log processing in mapreduce." Proceedings
of the 2010 ACM SIGMOD International Conference on Management of data. 2010.
[3] Doulkeridis, Christos, and Kjetil NØrvåg. "A survey of large-scale analytical query processing in
MapReduce." The VLDB Journal—The International Journal on Very Large Data Bases 23.3 (2014):
355-380.

11-Aug-25 map-reduce computing 38

Fundamentals of Data Engineering
No ratings yet
Fundamentals of Data Engineering
16 pages
HCSA-Sales-Storage V1.0 Training Material
No ratings yet
HCSA-Sales-Storage V1.0 Training Material
57 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
46 pages
DVTK QR SCP Emulator User Manual
No ratings yet
DVTK QR SCP Emulator User Manual
20 pages
Database Terminologies
No ratings yet
Database Terminologies
13 pages
SQL Queries for Worker Database
No ratings yet
SQL Queries for Worker Database
10 pages
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
No ratings yet
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
36 pages
Understanding Inputs and Outputs of Mapreduce
No ratings yet
Understanding Inputs and Outputs of Mapreduce
13 pages
Active Directory Basics Guide
No ratings yet
Active Directory Basics Guide
54 pages
MapReduce Sorting and Joins Guide
No ratings yet
MapReduce Sorting and Joins Guide
16 pages
Map Reduce Workflow Colloquim
No ratings yet
Map Reduce Workflow Colloquim
30 pages
MapReduce Algorithm Explained
No ratings yet
MapReduce Algorithm Explained
8 pages
10987C ENU PowerPoint Day 4
No ratings yet
10987C ENU PowerPoint Day 4
79 pages
MapReduce Term Co-occurrence Guide
No ratings yet
MapReduce Term Co-occurrence Guide
46 pages
BDA Module 3
No ratings yet
BDA Module 3
66 pages
Cloud Computing & MapReduce Basics
No ratings yet
Cloud Computing & MapReduce Basics
55 pages
Bda - Unit I - Lecture 6, 7
No ratings yet
Bda - Unit I - Lecture 6, 7
48 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
53 pages
MR Databases
No ratings yet
MR Databases
52 pages
Computational Tools DTU Presentation Week3
No ratings yet
Computational Tools DTU Presentation Week3
33 pages
Introduction To DBMS
No ratings yet
Introduction To DBMS
18 pages
MapReduce for Data Engineers
No ratings yet
MapReduce for Data Engineers
59 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
MapReduce for Data Engineers
No ratings yet
MapReduce for Data Engineers
28 pages
SAP Workflow Interview Questions and Answers
No ratings yet
SAP Workflow Interview Questions and Answers
1 page
Dllction To MAPREDUCE Afflrlling: L Tro
No ratings yet
Dllction To MAPREDUCE Afflrlling: L Tro
12 pages
MIS Project: ER Diagram & Queries
No ratings yet
MIS Project: ER Diagram & Queries
26 pages
Module2 D MapReduceParadigm
No ratings yet
Module2 D MapReduceParadigm
84 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
43 pages
Pgpool II Tutorial
No ratings yet
Pgpool II Tutorial
6 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
Performance Tips and Techniques For Power BI
No ratings yet
Performance Tips and Techniques For Power BI
27 pages
CSC 120 Exam 2 Review Guide 4
No ratings yet
CSC 120 Exam 2 Review Guide 4
20 pages
MapReduce Programming Model Guide
No ratings yet
MapReduce Programming Model Guide
55 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
OpenText InfoArchive
No ratings yet
OpenText InfoArchive
14 pages
DRKP Module 3
No ratings yet
DRKP Module 3
44 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Map Reduce
No ratings yet
Map Reduce
26 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Oracle SQL Exam Prep Guide
No ratings yet
Oracle SQL Exam Prep Guide
8 pages
S MapReduce Types Formats Features 06
No ratings yet
S MapReduce Types Formats Features 06
26 pages
5 RK - MapReduce - v3
No ratings yet
5 RK - MapReduce - v3
30 pages
9ez5wjz9 Java Persistence Query Language Tutorial
No ratings yet
9ez5wjz9 Java Persistence Query Language Tutorial
6 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
Day 6
No ratings yet
Day 6
12 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Gab Assignment
No ratings yet
Gab Assignment
7 pages
Nosql Qbsol Ia-02
No ratings yet
Nosql Qbsol Ia-02
18 pages
MapReduce: Big Data Processing Guide
No ratings yet
MapReduce: Big Data Processing Guide
25 pages
Map Reduce
No ratings yet
Map Reduce
35 pages
Big Data Analytics Lab Exam Guide
No ratings yet
Big Data Analytics Lab Exam Guide
2 pages
Map Reduce
No ratings yet
Map Reduce
7 pages
6.unit 3 Bda
No ratings yet
6.unit 3 Bda
18 pages
Made by Vince Midterm Exam (S.A D.D) Not For Sale: : Answer
No ratings yet
Made by Vince Midterm Exam (S.A D.D) Not For Sale: : Answer
6 pages
BI & Azure Professionals' Resume
No ratings yet
BI & Azure Professionals' Resume
6 pages
Patching Oracle Database Physical Standby With Enterprise Manager 12c
No ratings yet
Patching Oracle Database Physical Standby With Enterprise Manager 12c
5 pages
Hadoop MapReduce Tutorial
No ratings yet
Hadoop MapReduce Tutorial
25 pages
DBMS-2 Questions
No ratings yet
DBMS-2 Questions
3 pages
MapReduce for Data Engineers
No ratings yet
MapReduce for Data Engineers
26 pages
Lesson 7: Producing Readable Output With iSQL Plus: SQL Sample Questions
No ratings yet
Lesson 7: Producing Readable Output With iSQL Plus: SQL Sample Questions
15 pages
Using The VTS To Optimize Tape in zOS
No ratings yet
Using The VTS To Optimize Tape in zOS
15 pages
BDA - Unit - III-1
No ratings yet
BDA - Unit - III-1
57 pages
Lecture 03
No ratings yet
Lecture 03
26 pages
03 MapReduce
No ratings yet
03 MapReduce
184 pages
Unit 3
No ratings yet
Unit 3
22 pages
Relational Algebra Practice - 2
No ratings yet
Relational Algebra Practice - 2
4 pages
Module - 4 - UNDERSTANDING MAP REDUCE FUNDAMENTALS
No ratings yet
Module - 4 - UNDERSTANDING MAP REDUCE FUNDAMENTALS
6 pages
1 Introduction MIR
No ratings yet
1 Introduction MIR
35 pages
Describe The MapReduce Execution Steps With A Neat Diagram
No ratings yet
Describe The MapReduce Execution Steps With A Neat Diagram
10 pages
Data Modeler Release Notes
No ratings yet
Data Modeler Release Notes
92 pages
Map Reduce PArt 2
No ratings yet
Map Reduce PArt 2
40 pages
Camellia Institute of Technology: Sujay Kumar Kotal
No ratings yet
Camellia Institute of Technology: Sujay Kumar Kotal
12 pages
MapReduce BDA
No ratings yet
MapReduce BDA
32 pages
Hadoop MapReduce
No ratings yet
Hadoop MapReduce
25 pages
Module 3
No ratings yet
Module 3
36 pages
Map-Reduce 1
No ratings yet
Map-Reduce 1
49 pages
Data Analytics PowerBI
No ratings yet
Data Analytics PowerBI
5 pages
Module2 D MapReduceParadigm
No ratings yet
Module2 D MapReduceParadigm
90 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
BDP 2023 10
No ratings yet
BDP 2023 10
25 pages

Map-Reduce 2

Uploaded by

Map-Reduce 2

Uploaded by

“Map Reduce” Computing Paradigm

11-Aug-25 map-reduce computing 2

Figure Source: [2]

11-Aug-25 map-reduce computing 3

• Following MR refinements are presented in the article

11-Aug-25 map-reduce computing 4

11-Aug-25 map-reduce computing 5

11-Aug-25 map-reduce computing 6

11-Aug-25 map-reduce computing 7

11-Aug-25 map-reduce computing 8

• Suppose you have following data file:

11-Aug-25 map-reduce computing 10

Multiple Mappers are getting Added!

11-Aug-25 map-reduce computing 11

11-Aug-25 map-reduce computing paradigm 12

11-Aug-25 map-reduce computing 13

11-Aug-25 map-reduce examples 14

11-Aug-25 map-reduce computing 15

11-Aug-25 map-reduce computing 16

11-Aug-25 map-reduce computing 17

11-Aug-25 map-reduce computing 18

11-Aug-25 map-reduce computing 19

11-Aug-25 map-reduce computing 20

11-Aug-25 map-reduce computing 21

11-Aug-25 map-reduce computing 22

11-Aug-25 map-reduce computing 23

11-Aug-25 map-reduce computing 24

11-Aug-25 map-reduce computing 25

11-Aug-25 map-reduce computing 26

11-Aug-25 map-reduce computing 27

11-Aug-25 map-reduce computing 28

11-Aug-25 map-reduce computing 29

11-Aug-25 map-reduce computing 30

11-Aug-25 map-reduce computing 31

11-Aug-25 map-reduce computing 32

11-Aug-25 map-reduce computing 33

11-Aug-25 map-reduce computing 34

• Pig: scripting language for various data transformations. Initially developed at

11-Aug-25 map-reduce computing 36

11-Aug-25 map-reduce computing 37

11-Aug-25 map-reduce computing 38

You might also like