0% found this document useful (0 votes)

76 views20 pages

T05 MapReduce

The document describes the key components of Hadoop including HDFS daemons like NameNode, Secondary NameNode, and DataNode as well as MapReduce daemons like JobTracker and TaskTracker. It explains how Hadoop works by having clients submit work to a master node which divides the work into tasks assigned to slave nodes. The slave nodes perform the tasks and return results to the master node which then returns the final results to the client. It also discusses MapReduce programming concepts like map, reduce, shuffle, and sort as well as advantages like parallel processing and data locality.

Uploaded by

abdulazizbinyabtemp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views20 pages

T05 MapReduce

Uploaded by

abdulazizbinyabtemp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

1

Hadoop Daemons

◼ HDFS Daemons
◼ NameNode (NN)
◼ Secondary NameNode (SNN)
◼ DataNode (DN)

◼ MapReduce Daemons
◼ JobTracker (JT)
◼ TaskTracker (TT)

2
3
Hadoop Architecture Description

◼ Clients (one or more) submit their work to Hadoop System.

◼ When Hadoop System receives a Client Request, first it is received by a Master
Node.
◼ Master Node’s MapReduce component “Job Tracker” is responsible for receiving
Client Work and divides into manageable independent Tasks and assign them to
Task Trackers.
◼ Slave Node’s MapReduce component “Task Tracker” receives those Tasks from “Job
Tracker” and perform those tasks by using MapReduce components.
◼ Once all Task Trackers finished their assigned task, Job Tracker takes those results
and combines them into a final result.
◼ Finally, Hadoop System sends the final result to the Client.

4
Advantages of MapReduce

◼ Parallel Processing

◼ Data locality : Moving

processing to data.

5
MapReduce Process Flow

6
Hadoop Processing Framework

7
MapReduce Example: Word Count Program

8
9
What is MapReduce

◼ MapReduce is a programming model and an associated implementation for

processing and generating large data sets.

◼ Users specify a map function that processes a key/value pair to generate a set of
intermediate key/value pairs, and a reduce function that merges all intermediate
values associated with the same intermediate key.

◼ Many real world tasks are expressible in this model.

◼ Programs written in this functional style are automatically parallelized and executed
on a large cluster of commodity machines.

◼ A typical MapReduce computation processes many terabytes of data on thousands

of machines

10
MapReduce Runtime System

◼ The MapReduce runtime system takes care of the details of:

◼ partitioning the input data,

◼ scheduling the program’s execution across a set of machines,

◼ handling machine failures, and

◼ managing the required inter-machine communication.

◼ This allows programmers without any experience with parallel and distributed
systems to easily utilize the resources of a large distributed system

11
MapReduce Programming Components

TT: RecordReader

(k1, v1)

Mapper

(k2, v2)

TT: Partition Sort Shuffle

(k3, <v3>)

Reducer

(k4, v4)

TT: RecordWriter

12
Combine

◼ Combine is an intermediate step between Map and Reduce

◼ Combine is an optional process.
◼ The combiner is a reducer that runs individually on each mapper server.
◼ It reduces the data on each mapper further to a simplified form before passing it
downstream.
◼ This makes shuffling and sorting easier as there is less data to work with.
◼ Often, the combiner class is set to the reducer class itself, due to the cumulative and
associative functions in the reduce function. However, if needed, the combiner can
be a separate class as well.

13
Partition

◼ Partition is an intermediate step between Map and Reduce

◼ Partition is the process that translates the <key, value> pairs resulting from mappers
to another set of <key, value> pairs to feed into the reducer.
◼ It decides how the data has to be presented to the reducer and also assigns it to a
particular reducer.
◼ The default partitioner determines the hash value for the key, resulting from the
mapper, and assigns a partition based on this hash value.
◼ There are as many partitions as there are reducers.
◼ Once the partitioning is complete, the data from each partition is sent to a specific
reducer.

14
Fault Tolerance

◼ The master pings every worker periodically.

◼ If no response is received from a worker in a certain amount of time, the master marks the
worker as failed.
◼ Any map tasks completed by the worker are reset back to their initial idle state, and therefore
become eligible for scheduling on other workers.
◼ Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and
becomes eligible for rescheduling.
◼ Completed map tasks are re-executed on a failure because their output is stored on the local
disk(s) of the failed machine and is therefore inaccessible.
◼ Completed reduce tasks do not need to be re-executed since their output is stored in a global
file system.
◼ When a map task is executed first by worker A and then later executed by worker B (because A
failed), all workers executing reduce tasks are notified of the reexecution. Any reduce task that
has not already read the data from worker A will read the data from worker B

15
Locality

◼ Network bandwidth is a relatively scarce resource in our computing environment.

◼ HDFS divides each file into 64 MB blocks, and stores several copies of each block
(typically 3 copies) on different machines.
◼ The MapReduce master takes the location information of the input files into
account and attempts to schedule a map task on a machine that contains a replica
of the corresponding input data.
◼ Failing that, it attempts to schedule a map task near a replica of that task’s input
data (e.g., on a slave machine that is on the same rack).

16
Limitations of Hadoop

◼ Issue with small files. Not suited to processing many files simultaneously.
NameNode gets overloaded.
◼ Slow processing Speed. Hadoop has many lines of code.
◼ Latency: Each time converting data to key-value format by Mapper and then by
Reducer takes time.
◼ Security: Hadoop is missing encryption. It supports Kerberos authentication, which
is hard to manage. Complex applications are challenging to manage and thus their
data can be at risk.
◼ No real-time data processing, only batch
◼ Not easy to use and program. It doesn’t support abstraction
◼ Hadoop is not so efficient for iterative processing of chain of stages in which each
output of the previous stage is the input to the next stage.

17
References

◼ https://www.systutorials.com/hadoop-mapreduce-tutorials/
◼ https://www.tutorialscampus.com/map-reduce/algorithm.htm
◼ https://www.journaldev.com/8848/mapreduce-algorithm-example

18
Acknowledgment

◼ Many of the figures in the slides are copied from Simplilearn and Edureka

19
Thank You

Hospital
100% (1)
Hospital
161 pages
First Holy Communion (A5 Booklet)
100% (1)
First Holy Communion (A5 Booklet)
7 pages
Soal UN Bahasa Inggris SMP Kelas IX Latihan 1
No ratings yet
Soal UN Bahasa Inggris SMP Kelas IX Latihan 1
4 pages
Kartikeya Strota
No ratings yet
Kartikeya Strota
6 pages
The Victorian Poetry
No ratings yet
The Victorian Poetry
7 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
Eden's Bridge Songs
No ratings yet
Eden's Bridge Songs
6 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages
Jeopardy Game Template Guide
No ratings yet
Jeopardy Game Template Guide
56 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Humphries Language Anxiety
No ratings yet
Humphries Language Anxiety
13 pages
Lecture4 IntroMapReduce PDF
No ratings yet
Lecture4 IntroMapReduce PDF
75 pages
Hadoop
No ratings yet
Hadoop
50 pages
2inceptez Hadoop Processing
No ratings yet
2inceptez Hadoop Processing
16 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Take A Close Look At: Ma Ed
No ratings yet
Take A Close Look At: Ma Ed
42 pages
DLL Sept 19 English III
No ratings yet
DLL Sept 19 English III
3 pages
Hadoop Map Reduce Concept
No ratings yet
Hadoop Map Reduce Concept
23 pages
Distributed Systems: MapReduce Basics
No ratings yet
Distributed Systems: MapReduce Basics
24 pages
Unit-2 Bda Kalyan - Pagenumber
No ratings yet
Unit-2 Bda Kalyan - Pagenumber
15 pages
Hadoop
No ratings yet
Hadoop
34 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Unit - III
No ratings yet
Unit - III
37 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Unit 2 Topic 5 Developing A Map Reduce Application
No ratings yet
Unit 2 Topic 5 Developing A Map Reduce Application
52 pages
Module 2
No ratings yet
Module 2
23 pages
Unit 5
No ratings yet
Unit 5
35 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Bda CHP2
No ratings yet
Bda CHP2
105 pages
Work BRITISH Council
No ratings yet
Work BRITISH Council
2 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
MapReduce Architecture
No ratings yet
MapReduce Architecture
5 pages
(OOP) - 01-45 (22-08-2009) Updated
No ratings yet
(OOP) - 01-45 (22-08-2009) Updated
342 pages
Present Tenses
No ratings yet
Present Tenses
18 pages
Bertrand Russell On Critical Thinking
No ratings yet
Bertrand Russell On Critical Thinking
7 pages
Hadoop MapReduce for Developers
No ratings yet
Hadoop MapReduce for Developers
4 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
7 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
Literary Theory Bakhtin Theoryof Dialogism Ideasand Applications
No ratings yet
Literary Theory Bakhtin Theoryof Dialogism Ideasand Applications
14 pages
CC Unit5
No ratings yet
CC Unit5
27 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Big Data
No ratings yet
Big Data
47 pages
Chapter 4 MapReduce and New Software Stack
No ratings yet
Chapter 4 MapReduce and New Software Stack
48 pages
Ditp - ch2 4
No ratings yet
Ditp - ch2 4
2 pages
Grade 12 Student Work Immersion Request
0% (1)
Grade 12 Student Work Immersion Request
3 pages
Unit 3
No ratings yet
Unit 3
10 pages
A Schedule Is Said To Be Conflict-Serializable When The Schedule Is Conflict-Equivalent To One or More Serial Schedules
No ratings yet
A Schedule Is Said To Be Conflict-Serializable When The Schedule Is Conflict-Equivalent To One or More Serial Schedules
9 pages
3.4 Map Scheduler
No ratings yet
3.4 Map Scheduler
23 pages
Hadoop 1
No ratings yet
Hadoop 1
26 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
New01 Intro
No ratings yet
New01 Intro
11 pages
3.1.how Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.how Map Reduce Works & 3.2 Anatomy
11 pages
Word - Action - Plan
No ratings yet
Word - Action - Plan
3 pages
Cat-Themed Musical Score
No ratings yet
Cat-Themed Musical Score
9 pages
Unit 3
No ratings yet
Unit 3
33 pages
CPU Scheduling Explained
No ratings yet
CPU Scheduling Explained
20 pages
UNIT III Notes
No ratings yet
UNIT III Notes
24 pages
Big Data Unit-2 PPT Part2
No ratings yet
Big Data Unit-2 PPT Part2
78 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
Big Data Unit 3 Own
No ratings yet
Big Data Unit 3 Own
20 pages
3 The Writing Process
No ratings yet
3 The Writing Process
28 pages
C Programming and Data Structures - Unit I Notes
No ratings yet
C Programming and Data Structures - Unit I Notes
40 pages
Comparatives and Superlatives Sheets (2523)
No ratings yet
Comparatives and Superlatives Sheets (2523)
2 pages
Ex05 - To Create A CD Pipeline in Jenkins and Deploying To Azure Cloud
No ratings yet
Ex05 - To Create A CD Pipeline in Jenkins and Deploying To Azure Cloud
4 pages
Example of Thesis Paragraph
100% (2)
Example of Thesis Paragraph
4 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Igcse Write Format 2025
No ratings yet
Igcse Write Format 2025
1 page
Sentence Types and Their Effects
100% (1)
Sentence Types and Their Effects
1 page
Bda Unit 3
No ratings yet
Bda Unit 3
14 pages
CHAITYAVANDAN
No ratings yet
CHAITYAVANDAN
4 pages
PDC Lecture 13
No ratings yet
PDC Lecture 13
32 pages
PracResearch2 Grade 12 Q1 Mod3 Conceptual Framework and Review of Related Literature CO Version2
No ratings yet
PracResearch2 Grade 12 Q1 Mod3 Conceptual Framework and Review of Related Literature CO Version2
49 pages
Unit IV
No ratings yet
Unit IV
14 pages
FALLSEM2025-26 VL ISWE309L 00100 TH 2025-08-08 Module-4
No ratings yet
FALLSEM2025-26 VL ISWE309L 00100 TH 2025-08-08 Module-4
40 pages

T05 MapReduce

Uploaded by

T05 MapReduce

Uploaded by

1

◼ Clients (one or more) submit their work to Hadoop System.

◼ Data locality : Moving

◼ MapReduce is a programming model and an associated implementation for

◼ Many real world tasks are expressible in this model.

◼ A typical MapReduce computation processes many terabytes of data on thousands

◼ The MapReduce runtime system takes care of the details of:

◼ partitioning the input data,

◼ scheduling the program’s execution across a set of machines,

◼ handling machine failures, and

◼ managing the required inter-machine communication.

TT: Partition Sort Shuffle

◼ Combine is an intermediate step between Map and Reduce

◼ Partition is an intermediate step between Map and Reduce

◼ The master pings every worker periodically.

◼ Network bandwidth is a relatively scarce resource in our computing environment.

You might also like