Map Reduce

The document explains the MapReduce programming model used for distributed computing, particularly for processing large datasets efficiently. It outlines the three main phases of MapReduce: Map Phase, Shuffle Phase, and Reduce Phase, detailing the roles of mappers and reducers in aggregating data. Additionally, it provides a Python example of a word frequency counter using the MapReduce framework, illustrating the mapper and reducer functions.

Uploaded by

sanchayitaghosh376

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views33 pages

Map Reduce

Uploaded by

sanchayitaghosh376

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 33

MapReduce

Nabamita
Majumder
A Given Problem
• Suppose you are the head of West Bengal Sensex & you
have to calculate population of West bengal people.

• You have to do this job within 4 months.

• So, how can you proceed?

DataNode
A Given Problem
• Suppose next year the same task is assigned to you but
now within 2 months you have to do it.

• So now, how can you proceed?

DataNode
MapReduce

• So we have to use same model to do sensex population

calculation within 1 month for next year, but only
difference is that to do the work within 1 month we have
to double the resources.

• The Model here is called MapReduce.

• MapReduce is a programming model for distributed

computing. It is not a programming language but is a
programming Model which is used to process huge
dataset in a distributed environment.
Phases Involved to MapReduce

• Map Phase:- Phase where individual collects the population of

assigned city or part of city is called Map Phase.
• Mappers:- Individual person involved in actual calculation is called
Mapper.
• Input Splits:- City or part of city is Input Split.
• Key-Value Pairs:- Output from each mapper is a Key Value Pair.
• Reduce Phase:- This phase aggregate the intermediate results from
each city on mapper in the headquarter is called Reduce Phase.
• Reducer:- Individual works in headquarter is called Reducer because
they reduce or consolidate output from many different mappers.
• Result:- Each reducer will produce resultset.
• Shuffle Phase:- The phase in which the value from the different
mappers are copied or transfer to reducers is known as Suffle Phase.
It comes between Map Phase & Reduce Phase.
• The Map Phase, Shuffle Phase & Reduce Phase are 3 phases of Map
Reduce.
Phases Involved to MapReduce
What is MapReduce?
Sample Big Data Problem
Sample Big Data Problem
Max Closing Price Algorithm
Max Closing Price Algorithm

• There is no parallelism.
• If you have a huge dataset you have extremely long
computation time.
• So how can you solve this problem?
• The answer is using MapReduce.
Sample Big Data Problem in Distributed Environment
Block vs Input Split
MapReduce Phase

 Map Phase:- First divide the dataset into chunk & you have
separate proceed working on each chunk of data. The
chunks are called Input Splits.
 Mapper:- The process working on chunk is called mapper.
Each mapper process record at a time. Each mapper
executes same set of code on every single record. Output of
the mapper is key-value pair.
 Input Splits are not same as the blocks. Block is the hard
division of data at the block size. If block size in a cluster is
128MB , each block for the dataset will be 128 MB except
for the last block which could be less than the block size if
the file size is not entirely divisible by block size.
 Since block is a hard core, block can end even before an
record ends.
MapReduce Phase Contd....

 Input split is not physical chunk of data. So Mapper will

read the data & it must know where to start & where to
end.
 Input split records logical record boundary.
 During MapReduce process execution Hadoop scans
through the blocks & creates input splits which follows
record boundary.
 Mapper in Hadoop can be written in many different
programming languages like c++, python, java etc.
 Number of mappers is equal to no. of input splits.
 Output of the mapper is key value pair. In our e.g,
stockname is key & closing price is value.
Map Phase
Reduce Phase
 How do you decide what should be the key & what should
be the value? --- Reduce phase will give you the answer.
 Reducer work on the output of mappers. Output of
individual mapper are grouped by the key, in our case the
stock symbol & pass to the reducer.
 Reducer will receive key & list of values for that key for
each input.
 If 10 stock & 100 records are there for each stock then the
total no. of records= 10*100=1000 records. So we will have
1000 key value pair from all mappers.
 Therefore, Reducer will receive 10 records to process. 1
record for each symbol as we have information of 10 stocks.
 For each record , the reducer will have a symbol for the key
& a list of closing prices for that key.
Reduce Phase Contd...
 So the reducer reduced the list to calculate maximum
closing price of stocks & output the results.
 What needs to be reduced is the Value.
 No. of reducer can be set by user.
 Without reducers no. of output will be the output of all
mappers.
 But it is advisable to have more than 1 reducers.
 So, output of the individual mappers create groups by
symbols & reach to the reducer.
 The magic happens in Shuffle Phase.
Reduce Phase
Shuffle Phase
 The Shuffle Phase is a phase which also has key
components.
 The process in which the mapper output is transferred to
the reducer is known as Shuffle Phase.
 All key value pair of one stock has to go to one reducer.
 In the map phase each key is assigned to a partition. So if
we have 3 reducers we have 3 partitions.
 Each key is assigned to the partition by class called
partition.
 If any partition decides any key value pair of stock “xyz” , it
should go to the partition1 then all key values of that stock
will go to partition1.Each partition will assign to reducer. For
e.g, Partition1 will go to reducer1, Partition2 will go to
Reducer2 & so on.
Shuffle Phase Cond...
 These partitioning happen across all the mappers at Map
Phase.
 Key Value pair in the partition are sorted by key.
 Once the keys are sorted, then it is ready to copy each
partition to the appropriate reducer. This is known as Copy
Phase.
 Data in a partition can come from many mappers.
 Each mapper will process all the records in its assigned
input splits & will output a key value pair for each record.
 At each reducer the key value pairs coming from different
mappers will be merged according to sorted order.
 In the e.g here, Reducer1 will run 3 times one for each
symbol, Reducer2 will run 2 times one for each symbol.
That is the end - to end process of MapReduce.
Shuffle Phase with Multiple Reducers
Combiner (Optional)
 Combiner can be used to reduce data before sent to
reducer.
 Combiner is like a mini reducer that advances the map
phase.
 It is helpful to reduce the load on the reducers, so
increasing the performance.
 It is optimal.
Combiner (Optional)
Find the word frequency from a text file using MAP-REDUCE
program in python.
Content of WordFrequency.py file:-
from mrjob.job import MRJob
class MRWordFrequencyCount(MRJob):
def mapper(self, _, line):
words = line.split()
for word in words:
yield word.lower(), 1
def reducer(self, key, values):
yield key, sum(values)
if __name__ == '__main__':
MRWordFrequencyCount.run()
Content of f1.txt file:-
Hello mckvie
hello
Execution of WordFrequency.py from Anaconda Prompt or Anaconda Powershell:-
(base) C:\Users\yy>python WordFrequency.py f1.txt
Output:-
hello 2
mckvie 1
Explanation of Previous Program:
Content of WordFrequency.py file:-
from mrjob.job import MRJob /* This line imports the MRJob class from the
mrjob.job module. The MRJob class is the base class that you'll extend to create
MapReduce jobs in Python. */
class MRWordFrequencyCount(MRJob): /* Here, a new class
MRWordFrequencyCount is being defined, which inherits from MRJob. This class
will contain the logic for the Map and Reduce steps in the MapReduce process. */
def mapper(self, _, line): /* The mapper function processes each input line in the
input dataset: The line is split into a list of words using the split() function.
Each word is then converted to lowercase using word.lower() (to make the count
case-insensitive). The yield statement emits a tuple (word, 1) for each word. This
means the mapper is emitting each word along with the number 1, representing
the occurrence of that word in the line. */
words = line.split()
for word in words:
yield word.lower(), 1
def reducer(self, key, values): /* The reducer function processes the output of the
mapper: The key is a word, and the values are all the 1s associated with that word
from different mapper outputs. The sum(values) computes the total occurrences
of that word, as all mappers will emit a 1 for each occurrence of a word.
The yield statement emits a tuple (key, total_count), where key is the word and
total_count is the sum of occurrences. */
Explanation of Previous Program Contd….
yield key, sum(values)
if __name__ == '__main__':
MRWordFrequencyCount.run() /* This is the entry point of the script. When the
script is run directly, the MRWordFrequencyCount.run() method is called. This
starts the MapReduce job, where: The mapper function is applied to each line in
the input data. The reducer function aggregates the results by summing up the
occurrences of each word. */

Here the Python script is implementing a MapReduce job using the mrjob library, which
is a Python framework for writing MapReduce jobs that can run on Apache Hadoop
or locally.
What This Script Does:
This script is a word frequency counter that takes an input dataset (in the form of lines
of text), splits each line into words, and counts how many times each word appears
in total across the entire dataset.
Mapper: Processes the input line by line and yields each word along with the value 1.
Reducer: Takes the list of values for each word (all 1s) and sums them up to get the
total frequency of the word.
Meaning of def mapper(self, _, line):
1. self: This refers to the instance of the class MRWordFrequencyCount that is calling
the mapper method. It allows the method to access attributes and other methods
defined in the class. In object-oriented programming, self is a reference to the
current instance of the class, so it's a standard part of instance methods in Python.
2. _: The underscore (_) is a convention in Python indicating that the argument is
unused or irrelevant. In this case, the mapper function takes two arguments: a key
and a value, but the key is not being used in the logic. Typically, the key could
represent the position of the line (like the line number), but in this example, the key
is not needed, so it’s replaced with an underscore to indicate it's not being used.
Using _ is just a way to tell other developers that the value is intentionally ignored.
3. line: line represents the actual input data that the mapper processes. In the context
of this script, each line of text from the input dataset is passed into the mapper
method one at a time. The mapper function will then process this line by splitting it
into words and yielding word counts.
In summary:
The mapper method takes two arguments: _ (which is ignored and doesn't serve a
purpose in this function) line (the actual line of text from the input). The method
then processes the line, splits it into words, and yields a pair consisting of each
word and the number 1, which the reducer later uses to count the occurrences of
each word.
Meaning of def reducer(self, key, values):
1. self: This refers to the instance of the class (MRWordFrequencyCount) that is calling
the reducer method. It’s a reference to the current object, allowing the method to
access instance attributes and other methods in the class. This is typical for instance
methods in Python.

2. key: In MapReduce, the key is the output from the mapper phase. It represents the
item you want to aggregate over. In this script, the key will be a word. For example,
if the input data contains the word "hello", then the key will be "hello". The key is
passed from the mapper to the reducer, and the reducer will operate on the values
associated with this key.

3. values: The values are a list (or iterator) of all the values that have been emitted by
the mapper for a given key. In this script, each value is 1, which is the result emitted
by the mapper for each occurrence of a word. For example, if the word "hello"
appeared 3 times in the input data, the values list would be [1, 1, 1] for the key
"hello".
The reducer will sum up all these 1s to get the total frequency of the word.
Meaning of def reducer(self, key, values) Contd…:
Example:
If the input to the mapper included the word "hello" three times, the mapper would
emit something like: hello 1 hello 1 hello 1
The reducer then receives: key = "hello“ values = [1, 1, 1]
The reducer will compute: sum([1, 1, 1]) = 3
And the output from the reducer will be: hello 3
In summary:
key is the word, which was generated by the mapper.
values is the list of 1s associated with that word, representing its occurrences.
The reducer sums up the 1s for each word & yields the word along with its total count.
Thank You

Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Introduction To Batch Processing
No ratings yet
Introduction To Batch Processing
23 pages
Describe The MapReduce Execution Steps With A Neat Diagram
No ratings yet
Describe The MapReduce Execution Steps With A Neat Diagram
10 pages
MapReduce for Big Data Enthusiasts
No ratings yet
MapReduce for Big Data Enthusiasts
18 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
Map-Reduce 1
No ratings yet
Map-Reduce 1
49 pages
Unit 3 - Map Reduce Applications
No ratings yet
Unit 3 - Map Reduce Applications
25 pages
Speed Up Rubik's Cube Solving: CFOP Guide
60% (5)
Speed Up Rubik's Cube Solving: CFOP Guide
4 pages
MapReduce for Big Data Developers
No ratings yet
MapReduce for Big Data Developers
9 pages
Map Reduce 2
No ratings yet
Map Reduce 2
14 pages
Anatomy of Hadoop MapReduce Jobs
No ratings yet
Anatomy of Hadoop MapReduce Jobs
11 pages
Data Science
No ratings yet
Data Science
7 pages
Unit 3
No ratings yet
Unit 3
27 pages
ECS765P - W2 - The MapReduce Programming Model
No ratings yet
ECS765P - W2 - The MapReduce Programming Model
53 pages
Python Assignments Set
No ratings yet
Python Assignments Set
10 pages
Unit - 5
No ratings yet
Unit - 5
57 pages
S4 Hana 1709
No ratings yet
S4 Hana 1709
7 pages
Bda Unit-3
No ratings yet
Bda Unit-3
44 pages
Map Reduce
No ratings yet
Map Reduce
35 pages
Bda Unit III r20csm
No ratings yet
Bda Unit III r20csm
54 pages
Unit - Iii
No ratings yet
Unit - Iii
38 pages
BTech IT
No ratings yet
BTech IT
81 pages
Big Data
No ratings yet
Big Data
120 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Bda Unit 3
No ratings yet
Bda Unit 3
20 pages
M4 06 MapReduce
No ratings yet
M4 06 MapReduce
28 pages
L04 MapReduce
No ratings yet
L04 MapReduce
37 pages
HART Guide
No ratings yet
HART Guide
174 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
Questionnaire Design Guide
No ratings yet
Questionnaire Design Guide
8 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
No ratings yet
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
36 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Bda Module 4
No ratings yet
Bda Module 4
34 pages
(By Kuafu) Introduction To 3D Game Programming With DirectX90c A Shader Approach
100% (1)
(By Kuafu) Introduction To 3D Game Programming With DirectX90c A Shader Approach
413 pages
Bda Lab Exercises Lab Mannual - 2023
No ratings yet
Bda Lab Exercises Lab Mannual - 2023
72 pages
Daytona Operator Training Manual v1 PDF
No ratings yet
Daytona Operator Training Manual v1 PDF
33 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
6.unit 3 Bda
No ratings yet
6.unit 3 Bda
18 pages
MapReduce Programming Model Guide
No ratings yet
MapReduce Programming Model Guide
55 pages
Da Unit 5 Data Analytics
No ratings yet
Da Unit 5 Data Analytics
43 pages
Map Reduce
No ratings yet
Map Reduce
7 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
MapReduce for Data Engineers
No ratings yet
MapReduce for Data Engineers
28 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
Mapreduce
No ratings yet
Mapreduce
5 pages
MapReduce for Big Data Processing
No ratings yet
MapReduce for Big Data Processing
7 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
English Learning Resources List
No ratings yet
English Learning Resources List
11 pages
MapReduce Programming in Hadoop
No ratings yet
MapReduce Programming in Hadoop
42 pages
ArcObject Training
No ratings yet
ArcObject Training
75 pages
Understanding MapReduce
No ratings yet
Understanding MapReduce
15 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
12 pages
MapReduce & Hadoop for CS Students
No ratings yet
MapReduce & Hadoop for CS Students
25 pages
MapReduce Guide for Data Engineers
No ratings yet
MapReduce Guide for Data Engineers
7 pages
MapReduce for Big Data Processing
No ratings yet
MapReduce for Big Data Processing
7 pages
Stacks and Queues
100% (1)
Stacks and Queues
41 pages
MapReduce for Data Engineers
No ratings yet
MapReduce for Data Engineers
29 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
Hadoop MapReduce Tutorial Guide
No ratings yet
Hadoop MapReduce Tutorial Guide
20 pages
Hadoop - Mapreduce
No ratings yet
Hadoop - Mapreduce
5 pages
SWIFT Society For Worldwide Interbank Financial Telecommunications
100% (2)
SWIFT Society For Worldwide Interbank Financial Telecommunications
35 pages
Research Paper - Map Reduce - CSC3323
No ratings yet
Research Paper - Map Reduce - CSC3323
16 pages
Organising Online Campaigning To Raise Donation For The Homeless Shelter
No ratings yet
Organising Online Campaigning To Raise Donation For The Homeless Shelter
7 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Result 4th Sem 2021
No ratings yet
Result 4th Sem 2021
13 pages
Coca Cola 131212165203 Phpapp02
50% (2)
Coca Cola 131212165203 Phpapp02
26 pages
Technical: Iso/Iec TR 13335-3
No ratings yet
Technical: Iso/Iec TR 13335-3
6 pages
Gr3 Wk16 Phases of The Moon
No ratings yet
Gr3 Wk16 Phases of The Moon
1 page
.-111111 - Tti, N: Untvi '7,'lo) LLT' Ll''l.it
No ratings yet
.-111111 - Tti, N: Untvi '7,'lo) LLT' Ll''l.it
4 pages
S2-11 - EAZC473 - DCT Calculation in EC3
No ratings yet
S2-11 - EAZC473 - DCT Calculation in EC3
3 pages
Computer Systems Servicing NCII: Competency Assessment Results Summary (CARS)
No ratings yet
Computer Systems Servicing NCII: Competency Assessment Results Summary (CARS)
1 page
Oscar Chen Resume
No ratings yet
Oscar Chen Resume
1 page
P51 User Guide
No ratings yet
P51 User Guide
170 pages
Data Set 23: Weights of Discarded Garbage For One Week: Stats Explore
No ratings yet
Data Set 23: Weights of Discarded Garbage For One Week: Stats Explore
2 pages
Assembly Line Optimization Guide
No ratings yet
Assembly Line Optimization Guide
2 pages
Animasi Karakter: The Struggle
No ratings yet
Animasi Karakter: The Struggle
16 pages
User Manual For Online E-Consignment Declaration (DTD 05/01/2010)
No ratings yet
User Manual For Online E-Consignment Declaration (DTD 05/01/2010)
8 pages
SP3361
No ratings yet
SP3361
2 pages
B.Tech Seminar: Autonomous Cars
No ratings yet
B.Tech Seminar: Autonomous Cars
5 pages
Syllabus CSF302
No ratings yet
Syllabus CSF302
2 pages
Java Clustering & Scalability Guide
No ratings yet
Java Clustering & Scalability Guide
1 page

Map Reduce

Uploaded by

Map Reduce

Uploaded by

MapReduce

• You have to do this job within 4 months.

• So, how can you proceed?

• So now, how can you proceed?

• So we have to use same model to do sensex population

• The Model here is called MapReduce.

• MapReduce is a programming model for distributed

• Map Phase:- Phase where individual collects the population of

 Input split is not physical chunk of data. So Mapper will

You might also like