Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
39 views33 pages

Map Reduce

The document explains the MapReduce programming model used for distributed computing, particularly for processing large datasets efficiently. It outlines the three main phases of MapReduce: Map Phase, Shuffle Phase, and Reduce Phase, detailing the roles of mappers and reducers in aggregating data. Additionally, it provides a Python example of a word frequency counter using the MapReduce framework, illustrating the mapper and reducer functions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views33 pages

Map Reduce

The document explains the MapReduce programming model used for distributed computing, particularly for processing large datasets efficiently. It outlines the three main phases of MapReduce: Map Phase, Shuffle Phase, and Reduce Phase, detailing the roles of mappers and reducers in aggregating data. Additionally, it provides a Python example of a word frequency counter using the MapReduce framework, illustrating the mapper and reducer functions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

MapReduce

Nabamita
Majumder
A Given Problem
• Suppose you are the head of West Bengal Sensex & you
have to calculate population of West bengal people.

• You have to do this job within 4 months.

• So, how can you proceed?


DataNode
A Given Problem
• Suppose next year the same task is assigned to you but
now within 2 months you have to do it.

• So now, how can you proceed?


DataNode
MapReduce

• So we have to use same model to do sensex population


calculation within 1 month for next year, but only
difference is that to do the work within 1 month we have
to double the resources.

• The Model here is called MapReduce.

• MapReduce is a programming model for distributed


computing. It is not a programming language but is a
programming Model which is used to process huge
dataset in a distributed environment.
Phases Involved to MapReduce

• Map Phase:- Phase where individual collects the population of


assigned city or part of city is called Map Phase.
• Mappers:- Individual person involved in actual calculation is called
Mapper.
• Input Splits:- City or part of city is Input Split.
• Key-Value Pairs:- Output from each mapper is a Key Value Pair.
• Reduce Phase:- This phase aggregate the intermediate results from
each city on mapper in the headquarter is called Reduce Phase.
• Reducer:- Individual works in headquarter is called Reducer because
they reduce or consolidate output from many different mappers.
• Result:- Each reducer will produce resultset.
• Shuffle Phase:- The phase in which the value from the different
mappers are copied or transfer to reducers is known as Suffle Phase.
It comes between Map Phase & Reduce Phase.
• The Map Phase, Shuffle Phase & Reduce Phase are 3 phases of Map
Reduce.
Phases Involved to MapReduce
What is MapReduce?
Sample Big Data Problem
Sample Big Data Problem
Max Closing Price Algorithm
Max Closing Price Algorithm

• There is no parallelism.
• If you have a huge dataset you have extremely long
computation time.
• So how can you solve this problem?
• The answer is using MapReduce.
Sample Big Data Problem in Distributed Environment
Block vs Input Split
MapReduce Phase

 Map Phase:- First divide the dataset into chunk & you have
separate proceed working on each chunk of data. The
chunks are called Input Splits.
 Mapper:- The process working on chunk is called mapper.
Each mapper process record at a time. Each mapper
executes same set of code on every single record. Output of
the mapper is key-value pair.
 Input Splits are not same as the blocks. Block is the hard
division of data at the block size. If block size in a cluster is
128MB , each block for the dataset will be 128 MB except
for the last block which could be less than the block size if
the file size is not entirely divisible by block size.
 Since block is a hard core, block can end even before an
record ends.
MapReduce Phase Contd....

 Input split is not physical chunk of data. So Mapper will


read the data & it must know where to start & where to
end.
 Input split records logical record boundary.
 During MapReduce process execution Hadoop scans
through the blocks & creates input splits which follows
record boundary.
 Mapper in Hadoop can be written in many different
programming languages like c++, python, java etc.
 Number of mappers is equal to no. of input splits.
 Output of the mapper is key value pair. In our e.g,
stockname is key & closing price is value.
Map Phase
Reduce Phase
 How do you decide what should be the key & what should
be the value? --- Reduce phase will give you the answer.
 Reducer work on the output of mappers. Output of
individual mapper are grouped by the key, in our case the
stock symbol & pass to the reducer.
 Reducer will receive key & list of values for that key for
each input.
 If 10 stock & 100 records are there for each stock then the
total no. of records= 10*100=1000 records. So we will have
1000 key value pair from all mappers.
 Therefore, Reducer will receive 10 records to process. 1
record for each symbol as we have information of 10 stocks.
 For each record , the reducer will have a symbol for the key
& a list of closing prices for that key.
Reduce Phase Contd...
 So the reducer reduced the list to calculate maximum
closing price of stocks & output the results.
 What needs to be reduced is the Value.
 No. of reducer can be set by user.
 Without reducers no. of output will be the output of all
mappers.
 But it is advisable to have more than 1 reducers.
 So, output of the individual mappers create groups by
symbols & reach to the reducer.
 The magic happens in Shuffle Phase.
Reduce Phase
Shuffle Phase
 The Shuffle Phase is a phase which also has key
components.
 The process in which the mapper output is transferred to
the reducer is known as Shuffle Phase.
 All key value pair of one stock has to go to one reducer.
 In the map phase each key is assigned to a partition. So if
we have 3 reducers we have 3 partitions.
 Each key is assigned to the partition by class called
partition.
 If any partition decides any key value pair of stock “xyz” , it
should go to the partition1 then all key values of that stock
will go to partition1.Each partition will assign to reducer. For
e.g, Partition1 will go to reducer1, Partition2 will go to
Reducer2 & so on.
Shuffle Phase Cond...
 These partitioning happen across all the mappers at Map
Phase.
 Key Value pair in the partition are sorted by key.
 Once the keys are sorted, then it is ready to copy each
partition to the appropriate reducer. This is known as Copy
Phase.
 Data in a partition can come from many mappers.
 Each mapper will process all the records in its assigned
input splits & will output a key value pair for each record.
 At each reducer the key value pairs coming from different
mappers will be merged according to sorted order.
 In the e.g here, Reducer1 will run 3 times one for each
symbol, Reducer2 will run 2 times one for each symbol.
That is the end - to end process of MapReduce.
Shuffle Phase with Multiple Reducers
Combiner (Optional)
 Combiner can be used to reduce data before sent to
reducer.
 Combiner is like a mini reducer that advances the map
phase.
 It is helpful to reduce the load on the reducers, so
increasing the performance.
 It is optimal.
Combiner (Optional)
Find the word frequency from a text file using MAP-REDUCE
program in python.
Content of WordFrequency.py file:-
from mrjob.job import MRJob
class MRWordFrequencyCount(MRJob):
def mapper(self, _, line):
words = line.split()
for word in words:
yield word.lower(), 1
def reducer(self, key, values):
yield key, sum(values)
if __name__ == '__main__':
MRWordFrequencyCount.run()
Content of f1.txt file:-
Hello mckvie
hello
Execution of WordFrequency.py from Anaconda Prompt or Anaconda Powershell:-
(base) C:\Users\yy>python WordFrequency.py f1.txt
Output:-
hello 2
mckvie 1
Explanation of Previous Program:
Content of WordFrequency.py file:-
from mrjob.job import MRJob /* This line imports the MRJob class from the
mrjob.job module. The MRJob class is the base class that you'll extend to create
MapReduce jobs in Python. */
class MRWordFrequencyCount(MRJob): /* Here, a new class
MRWordFrequencyCount is being defined, which inherits from MRJob. This class
will contain the logic for the Map and Reduce steps in the MapReduce process. */
def mapper(self, _, line): /* The mapper function processes each input line in the
input dataset: The line is split into a list of words using the split() function.
Each word is then converted to lowercase using word.lower() (to make the count
case-insensitive). The yield statement emits a tuple (word, 1) for each word. This
means the mapper is emitting each word along with the number 1, representing
the occurrence of that word in the line. */
words = line.split()
for word in words:
yield word.lower(), 1
def reducer(self, key, values): /* The reducer function processes the output of the
mapper: The key is a word, and the values are all the 1s associated with that word
from different mapper outputs. The sum(values) computes the total occurrences
of that word, as all mappers will emit a 1 for each occurrence of a word.
The yield statement emits a tuple (key, total_count), where key is the word and
total_count is the sum of occurrences. */
Explanation of Previous Program Contd….
yield key, sum(values)
if __name__ == '__main__':
MRWordFrequencyCount.run() /* This is the entry point of the script. When the
script is run directly, the MRWordFrequencyCount.run() method is called. This
starts the MapReduce job, where: The mapper function is applied to each line in
the input data. The reducer function aggregates the results by summing up the
occurrences of each word. */

Here the Python script is implementing a MapReduce job using the mrjob library, which
is a Python framework for writing MapReduce jobs that can run on Apache Hadoop
or locally.
What This Script Does:
This script is a word frequency counter that takes an input dataset (in the form of lines
of text), splits each line into words, and counts how many times each word appears
in total across the entire dataset.
Mapper: Processes the input line by line and yields each word along with the value 1.
Reducer: Takes the list of values for each word (all 1s) and sums them up to get the
total frequency of the word.
Meaning of def mapper(self, _, line):
1. self: This refers to the instance of the class MRWordFrequencyCount that is calling
the mapper method. It allows the method to access attributes and other methods
defined in the class. In object-oriented programming, self is a reference to the
current instance of the class, so it's a standard part of instance methods in Python.
2. _: The underscore (_) is a convention in Python indicating that the argument is
unused or irrelevant. In this case, the mapper function takes two arguments: a key
and a value, but the key is not being used in the logic. Typically, the key could
represent the position of the line (like the line number), but in this example, the key
is not needed, so it’s replaced with an underscore to indicate it's not being used.
Using _ is just a way to tell other developers that the value is intentionally ignored.
3. line: line represents the actual input data that the mapper processes. In the context
of this script, each line of text from the input dataset is passed into the mapper
method one at a time. The mapper function will then process this line by splitting it
into words and yielding word counts.
In summary:
The mapper method takes two arguments: _ (which is ignored and doesn't serve a
purpose in this function) line (the actual line of text from the input). The method
then processes the line, splits it into words, and yields a pair consisting of each
word and the number 1, which the reducer later uses to count the occurrences of
each word.
Meaning of def reducer(self, key, values):
1. self: This refers to the instance of the class (MRWordFrequencyCount) that is calling
the reducer method. It’s a reference to the current object, allowing the method to
access instance attributes and other methods in the class. This is typical for instance
methods in Python.

2. key: In MapReduce, the key is the output from the mapper phase. It represents the
item you want to aggregate over. In this script, the key will be a word. For example,
if the input data contains the word "hello", then the key will be "hello". The key is
passed from the mapper to the reducer, and the reducer will operate on the values
associated with this key.

3. values: The values are a list (or iterator) of all the values that have been emitted by
the mapper for a given key. In this script, each value is 1, which is the result emitted
by the mapper for each occurrence of a word. For example, if the word "hello"
appeared 3 times in the input data, the values list would be [1, 1, 1] for the key
"hello".
The reducer will sum up all these 1s to get the total frequency of the word.
Meaning of def reducer(self, key, values) Contd…:
Example:
If the input to the mapper included the word "hello" three times, the mapper would
emit something like: hello 1 hello 1 hello 1
The reducer then receives: key = "hello“ values = [1, 1, 1]
The reducer will compute: sum([1, 1, 1]) = 3
And the output from the reducer will be: hello 3
In summary:
key is the word, which was generated by the mapper.
values is the list of 1s associated with that word, representing its occurrences.
The reducer sums up the 1s for each word & yields the word along with its total count.
Thank You

You might also like