1
Hadoop Daemons
◼ HDFS Daemons
◼ NameNode (NN)
◼ Secondary NameNode (SNN)
◼ DataNode (DN)
◼ MapReduce Daemons
◼ JobTracker (JT)
◼ TaskTracker (TT)
2
3
Hadoop Architecture Description
◼ Clients (one or more) submit their work to Hadoop System.
◼ When Hadoop System receives a Client Request, first it is received by a Master
Node.
◼ Master Node’s MapReduce component “Job Tracker” is responsible for receiving
Client Work and divides into manageable independent Tasks and assign them to
Task Trackers.
◼ Slave Node’s MapReduce component “Task Tracker” receives those Tasks from “Job
Tracker” and perform those tasks by using MapReduce components.
◼ Once all Task Trackers finished their assigned task, Job Tracker takes those results
and combines them into a final result.
◼ Finally, Hadoop System sends the final result to the Client.
4
Advantages of MapReduce
◼ Parallel Processing
◼ Data locality : Moving
processing to data.
5
MapReduce Process Flow
6
Hadoop Processing Framework
7
MapReduce Example: Word Count Program
8
9
What is MapReduce
◼ MapReduce is a programming model and an associated implementation for
processing and generating large data sets.
◼ Users specify a map function that processes a key/value pair to generate a set of
intermediate key/value pairs, and a reduce function that merges all intermediate
values associated with the same intermediate key.
◼ Many real world tasks are expressible in this model.
◼ Programs written in this functional style are automatically parallelized and executed
on a large cluster of commodity machines.
◼ A typical MapReduce computation processes many terabytes of data on thousands
of machines
10
MapReduce Runtime System
◼ The MapReduce runtime system takes care of the details of:
◼ partitioning the input data,
◼ scheduling the program’s execution across a set of machines,
◼ handling machine failures, and
◼ managing the required inter-machine communication.
◼ This allows programmers without any experience with parallel and distributed
systems to easily utilize the resources of a large distributed system
11
MapReduce Programming Components
TT: RecordReader
(k1, v1)
Mapper
(k2, v2)
TT: Partition Sort Shuffle
(k3, <v3>)
Reducer
(k4, v4)
TT: RecordWriter
12
Combine
◼ Combine is an intermediate step between Map and Reduce
◼ Combine is an optional process.
◼ The combiner is a reducer that runs individually on each mapper server.
◼ It reduces the data on each mapper further to a simplified form before passing it
downstream.
◼ This makes shuffling and sorting easier as there is less data to work with.
◼ Often, the combiner class is set to the reducer class itself, due to the cumulative and
associative functions in the reduce function. However, if needed, the combiner can
be a separate class as well.
13
Partition
◼ Partition is an intermediate step between Map and Reduce
◼ Partition is the process that translates the <key, value> pairs resulting from mappers
to another set of <key, value> pairs to feed into the reducer.
◼ It decides how the data has to be presented to the reducer and also assigns it to a
particular reducer.
◼ The default partitioner determines the hash value for the key, resulting from the
mapper, and assigns a partition based on this hash value.
◼ There are as many partitions as there are reducers.
◼ Once the partitioning is complete, the data from each partition is sent to a specific
reducer.
14
Fault Tolerance
◼ The master pings every worker periodically.
◼ If no response is received from a worker in a certain amount of time, the master marks the
worker as failed.
◼ Any map tasks completed by the worker are reset back to their initial idle state, and therefore
become eligible for scheduling on other workers.
◼ Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and
becomes eligible for rescheduling.
◼ Completed map tasks are re-executed on a failure because their output is stored on the local
disk(s) of the failed machine and is therefore inaccessible.
◼ Completed reduce tasks do not need to be re-executed since their output is stored in a global
file system.
◼ When a map task is executed first by worker A and then later executed by worker B (because A
failed), all workers executing reduce tasks are notified of the reexecution. Any reduce task that
has not already read the data from worker A will read the data from worker B
15
Locality
◼ Network bandwidth is a relatively scarce resource in our computing environment.
◼ HDFS divides each file into 64 MB blocks, and stores several copies of each block
(typically 3 copies) on different machines.
◼ The MapReduce master takes the location information of the input files into
account and attempts to schedule a map task on a machine that contains a replica
of the corresponding input data.
◼ Failing that, it attempts to schedule a map task near a replica of that task’s input
data (e.g., on a slave machine that is on the same rack).
16
Limitations of Hadoop
◼ Issue with small files. Not suited to processing many files simultaneously.
NameNode gets overloaded.
◼ Slow processing Speed. Hadoop has many lines of code.
◼ Latency: Each time converting data to key-value format by Mapper and then by
Reducer takes time.
◼ Security: Hadoop is missing encryption. It supports Kerberos authentication, which
is hard to manage. Complex applications are challenging to manage and thus their
data can be at risk.
◼ No real-time data processing, only batch
◼ Not easy to use and program. It doesn’t support abstraction
◼ Hadoop is not so efficient for iterative processing of chain of stages in which each
output of the previous stage is the input to the next stage.
17
References
◼ https://www.systutorials.com/hadoop-mapreduce-tutorials/
◼ https://www.tutorialscampus.com/map-reduce/algorithm.htm
◼ https://www.journaldev.com/8848/mapreduce-algorithm-example
18
Acknowledgment
◼ Many of the figures in the slides are copied from Simplilearn and Edureka
19
Thank You
20