Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views38 pages

Map-Reduce 2

The document discusses the Map Reduce computing paradigm, focusing on the Map and Reduce functions, partitioning, and sorting mechanisms. It covers various refinements, multi-stage jobs, and exercises related to finding top N records and performing joins. Additionally, it highlights issues with Map Reduce, such as the need for full scans and lack of iteration, while mentioning higher-level abstractions like Hive and Pig to simplify operations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views38 pages

Map-Reduce 2

The document discusses the Map Reduce computing paradigm, focusing on the Map and Reduce functions, partitioning, and sorting mechanisms. It covers various refinements, multi-stage jobs, and exercises related to finding top N records and performing joins. Additionally, it highlights issues with Map Reduce, such as the need for full scans and lack of iteration, while mentioning higher-level abstractions like Hive and Pig to simplify operations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

“Map Reduce” Computing Paradigm

pm jat @ daiict
Make sure that you understood !
• Map and Reduce function
– Input
– Output
• Partition (Shuffle) and Partition function (default and User Defined partition)
• Sort
• We can, and sometimes require defining
– Customized “Partition Function”
– Customized “Compare” or “sort” function for data type of Key

11-Aug-25 map-reduce computing 2


MR data flow [2]

Figure Source: [2]

11-Aug-25 map-reduce computing 3


Refinements

• Following MR refinements are presented in the article


– Combiner Function
– Partitioning Function

11-Aug-25 map-reduce computing 4


Partitioning Function
• A default partitioning function is “hash(key) mod R”
– Does a fairly well-balanced partitions
• In some cases, this may not be enough
– For example, URL is the key and we want all URLs of a host to be landed on the
same reducer
– For this, we may have a partitioning function as following
“hash(Hostname(urlkey)) mod R”

11-Aug-25 map-reduce computing 5


Multi Stage “Map Reduce Job”?
• Two-step solution
• Stage 1: Compute department-wise average
– Map Output: dno, salary
– Reducer Output: dno, avg_salary
• Stage 2:
– Map output: recno, record that meet the criteria
“e.salary > avg_salary [e.dno]”
Where is avg_salary ?
– Reducer:
• Does nothing
• Here is an example of multi-stage MR is available:
https://mrjob.readthedocs.io/en/latest/guides/writing-mrjobs.html
• Code: https://colab.research.google.com/drive/1d_oakD_VHu-V7TV6MY1t8w6xy9523xVD

11-Aug-25 map-reduce computing 6


Exercise #: find map-reduce solution for
• List Employees who have a salary greater than 1.5 times of department’s average
salary.
• That is equivalent to following:
select * from employee where salary >
(select avg_sal from
(select dno, avg(salary) as avg_sal
from employee group by dno))

11-Aug-25 map-reduce computing 7


Some more examples

• Top-N
• SORT
• JOIN
• ?

11-Aug-25 map-reduce computing 8


Sorting
• For Sorting, we can simply use “Sorting” functionality of the map-reduce
infrastructure
• For this purpose, we choose, sorting attribute as output Key of Map function.
• Suppose we want to sort data records on attribute A, then value for A is key and the
corresponding record as value for a data record.
• Data records shall be sorted (ascending) on the Key!
• PS: However it needs to ensured that
– (1) Data type of Key defines “comparable” operators ( <, <=, ==, => > )
– (2) “partition (hash) function” is such that hash (K1) < hash (K2) when K1 < K2
• Also, to sort in descending – Hash function should work reverse!
11-Aug-25 map-reduce computing 9
Exercise #: Top N

• Suppose you have following data file:


CustNo, OrderAmountSum and want to compute top N Customers in terms of
OrderAmountSum

11-Aug-25 map-reduce computing 10


Top N: MR Strategy

Multiple Mappers are getting Added!

11-Aug-25 map-reduce computing 11


Compute Top-N the approach

11-Aug-25 map-reduce computing paradigm 12


Exercise #: Top N
• Every Map Task maintains a MIN-HEAP of size N, and finally produces it as output
• Let Min-Heap be maintained as following:
– Initially empty
– Every record is ADDED to Min-Heap, where “Sort-Key” as “Key” and Data-Record
as “Value”
– If addition leads to having size N+1, remove first
• Reducer tasks
– Needs to have a Single Reducer
– Merges all Min-Heap trees and produces a final one

11-Aug-25 map-reduce computing 13


Exercise #: Top N
• At Mapper
– We maintain a “Min Heap Tree of Size N”
–…
– At the end, Map function produces ToP N for all of its own data!

11-Aug-25 map-reduce examples 14


Mapper “Top-N”
• INITialization
– N = 10 (say)
– Construct empty Min-Heap
• Map Function
– Parse record
=> SortKey, Record
– Push to the Heap
– Remove smallest if size > N
• Finally (Destructor)
– Output Top-N locally

11-Aug-25 map-reduce computing 15


Reducer “Top-N”
• INITialization
– N = 10 (say)
– Construct empty Min-Heap
• Reduce Function
– Iterate through records(“values”)
– Keeping pushing records into
MIN-HEAP while maintaining its
size to N
– Finally Output Top-N

11-Aug-25 map-reduce computing 16


Top N: MR Strategy
• Compute top N at each mapper, and
• Shuffle output of all mappers to a “Single Reducer”
• To do this all map outputs may have the same key.
• A single reducer should be fine here as normally N is smaller; say 10; for 1000
mappers, the total data records for a reducer is 10000; not very large for a single
reducer!
• Data records of top N from records from all mappers are aggregated, and the final
top N is computed!

11-Aug-25 map-reduce computing 17


Computing Join using Map-Reduce
• Here we discuss JOIN approaches discussed in an article "A comparison of join
algorithms for log processing in map reduce." by Blanas Spyros [2], et al. in a
reputed SIGMOD conference in 2010.
• Let us say there are two data files L and R.
– Say Join Condition is L.A=R.B
– A and B are such that A in L refers to B in R, and
• The approach here is bit modified than the original with an assumption that R has a
distinct record for a value of B (that is like, saying A is FK in L and B is PK in R, which
is a typical situation in most joins)

11-Aug-25 map-reduce computing 18


A simple algorithm for MR Join [2]
• Called as “Standard Repartition Join” in [2]
• Here is bit modified version of the original algorithm, and with assumption that is R,
in the join of L and R, has one record for a distinct value of Join Attribute, where as L
may have multiple !!

11-Aug-25 map-reduce computing 19


A simple algorithm for MR Join [2]

11-Aug-25 map-reduce computing 20


Example: Computing Join using Map-Reduce
• Let us attempt performing JOIN of following two files:
customers.csv (cid,country,state)
orders.csv (ordrno,cid,amount)

• That is:
SELECT * FROM “customers.csv” AS c JOIN
“orders.csv” as o ON c.cid=o.cid;

11-Aug-25 map-reduce computing 21


Map-Reduce Join

11-Aug-25 map-reduce computing 22


A simple algorithm for MR Join [2]
• Here, we use two mappers functions – one for each input file.
• Each mapper produces “Key, Value”; where key is joining attribute (in case of both
mappers), and the values part contains projected attributes from the respected file
that are to be included in the join result.
• Outputs of both the mappers are combined and shuffled to reducers.
• Reducers then actually perform the join.
• Since Join happens at the reduce side, therefore, it is also called “reducer side join”

11-Aug-25 map-reduce computing 23


Here is pseudo implementation

11-Aug-25 map-reduce computing 24


“pseudo code” – reduce function

11-Aug-25 map-reduce computing 25


“pseudo code” – reduce function

11-Aug-25 map-reduce computing 26


Issues with Standard Repartition Join
• (1) Requires two scans of values in the reducer?
• (2) Requires caching all values, and hence limited by reducers memory
• There is an Improved version described in the article[2] called as
“Improved Repartition Join”.
• The Strategy goes as following:
– Have “File Tag” appended with the “Join Key”. File tag is designed such that This
helps us in getting record(s) of right file (with PK) preceding to all records of left
file (with FK)
– But this added Tag, should not be used for “partition” purpose. Therefore require
to have a customized partition function such that it uses “Join Key” only for
partition and excludes file tag!

11-Aug-25 map-reduce computing 27


Improved Repartition Join

11-Aug-25 map-reduce computing 28


Improved Repartition Join from [2]

11-Aug-25 map-reduce computing 29


Customize Partition Function
• In some case we need to customize the partition function;
Typically, when Partitioning based on key is not enough; or when Key is composite,
and user defined.
• In this case, output key of Map function is JOIN Key and Tag (composite), that is CNO
and Tag; where Tag is C or O indicating source of map output record!
• Here we want to “SORT” based on composite key where is “partitioning” (shuffling)
based on only CNO.
• Therefore we need to customize CNO
the “Partition” by defining a
partition function.

11-Aug-25 map-reduce computing 30


Modified of algorithm MR Join from [2]

11-Aug-25 map-reduce computing 31


“Broadcast MR Join” Algorithm from [2]
• Suppose R and L is to be Joined – read as Reference Table R and Log table L.
• Normally, R is much smaller than the L, i.e. |R| ≪ |L|.
• Broadcast join is run as a map-only job.
• In this approach, broadcast the smaller table R to mappers.
– Mapper node loads R in a split (that is mapper data chunks) as a Hash Table.
– This is done in MR INIT function
• Map function does all the join job
– Takes data records from L, one at a time
– For each data record l in L, HR (R Hash Table) is probed. If found, sar as r; a join is
computed as (l, r) and outputted !!

11-Aug-25 map-reduce computing 32


“Broadcast MR Join” Algorithm from [2]

11-Aug-25 map-reduce computing 33


Higher Level Abstractions
• Database users are used to SQL
• For database operations like SELECT, PROJECT, and JOIN programming in Map
Reduce is quiet a pain!
– Complex API, too much of programming, etc
• Higher level abstractions are available from early days of Map Reduce!
– Hive and Pig are popular tools and make life much simpler!

11-Aug-25 map-reduce computing 34


Higher Level Abstractions
• Hive: SQL like interface for HDFS files. Initially developed at Facebook (now Apache
project).

• Pig: scripting language for various data transformations. Initially developed at


Yahoo. Now again apache project

11-Aug-25 https://rxin.github.io/talks/2017-12-05_cs145-stanford.pdf
map-reduce computing 35
Issues with “Map Reduce”
• There are issues with Map-Reduce while performing many “queries” and analytical tasks
• Some of the issues listed here are from a survey article [3]
• Requirement of FULL SCAN of the File
– Very efficient when queries “low selectivity” are to be executed
– We can not early terminate the scan. Conditional termination of file processing is not
possible.
• Lack of iteration: If we require iterating through a dataset for multiple times, then every
time we read data from disk files; and that happens to be the case with many Analytical
and most Machine Learning tasks.
• Lack of Caching: Multiple MR jobs are processing same data almost at the same time.
Cashing can help it to run faster x100 times

11-Aug-25 map-reduce computing 36


Issues with “Map Reduce”
• The system lacks to “reuse” results of previously executed queries/jobs.
• Quick retrieval of approximate results [for example if we want to process only 10%
data from the file.
• Lack of interactive or real-time processing – Map-Reduce runs in background, and
there is no interaction till it finishes the job.

11-Aug-25 map-reduce computing 37


Sources/References
[1] Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: Simplified data processing on large clusters."
(2004)
[2] Blanas, Spyros, et al. "A comparison of join algorithms for log processing in mapreduce." Proceedings
of the 2010 ACM SIGMOD International Conference on Management of data. 2010.
[3] Doulkeridis, Christos, and Kjetil NØrvåg. "A survey of large-scale analytical query processing in
MapReduce." The VLDB Journal—The International Journal on Very Large Data Bases 23.3 (2014):
355-380.

11-Aug-25 map-reduce computing 38

You might also like