Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
22 views23 pages

Map Reduce

The document outlines the anatomy of a MapReduce job run, detailing the roles of various entities such as the client, YARN resource manager, node managers, application master, and distributed filesystem. It describes the job submission process, initialization, task assignment, execution, and how failures are handled in Classic MapReduce, including task, tasktracker, and jobtracker failures. Additionally, it discusses job scheduling in Hadoop, highlighting different schedulers like FIFO and Fair Scheduler, and their functionalities.

Uploaded by

Akshay Rathod
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views23 pages

Map Reduce

The document outlines the anatomy of a MapReduce job run, detailing the roles of various entities such as the client, YARN resource manager, node managers, application master, and distributed filesystem. It describes the job submission process, initialization, task assignment, execution, and how failures are handled in Classic MapReduce, including task, tasktracker, and jobtracker failures. Additionally, it discusses job scheduling in Hadoop, highlighting different schedulers like FIFO and Fair Scheduler, and their functionalities.

Uploaded by

Akshay Rathod
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

UNIT 3

MapReduce
 Anatomy of a Map Reduce Job Run

There are five independent entities:


 The client, which submits the MapReduce job.
 The YARN resource manager, which coordinates the allocation
of
compute resources on the cluster.
 The YARN node managers, which launch and monitor the
compute
containers on machines in the cluster.
 The MapReduce application master, which coordinates the
tasks
running the MapReduce job The application master and
the MapReduce tasks run in containers that are scheduled by
the resource manager and managed by the node managers.
 The distributed filesystem, which is used for sharing job files
between
the other entities.

Job Submission :

 The submit() method on Job creates an internal JobSubmitter


instance and calls submitJobInternal() on it.
 Having submitted the job, waitForCompletion polls the job’s
progress once per second and reports the progress to the console if
it
has changed since the last report.
 When the job completes successfully, the job counters are displayed
Otherwise, the error that caused the job to fail is logged to the
console.

The job submission process implemented by JobSubmitter does


the following:
 Asks the resource manager for a new application ID, used for the
MapReduce job ID.
 Checks the output specification of the job For example, if the output
directory has not been specified or it already exists, the job is not
submitted and an error is thrown to the MapReduce program.
 Computes the input splits for the job If the splits cannot be
computed (because the input paths don’t exist, for example), the job
is not submitted and an error is thrown to the MapReduce
program.
 Copies the resources needed to run the job, including the job
JAR file, the configuration file, and the computed input splits, to
the shared filesystem in a directory named after the job ID.
 Submits the job by calling submitApplication() on the resource
manager.
Job Initialization :
 When the resource manager receives a call to
its submitApplication() method, it hands off the request to the YARN
scheduler.
 The scheduler allocates a container, and the resource manager then
launches the application master’s process there, under the node
manager’s management.
 The application master for MapReduce jobs is a Java application
whose main class is MRAppMaster .
 It initializes the job by creating a number of bookkeeping objects to
keep track of the job’s progress, as it will receive progress and
completion reports from the tasks.
 It retrieves the input splits computed in the client from the shared
filesystem.
 It then creates a map task object for each split, as well as a number
of
reduce task objects determined by
the mapreduce.job.reduces property (set by
the setNumReduceTasks() method on Job).

Task Assignment:
 If the job does not qualify for running as an uber task, then the
application master requests containers for all the map and reduce
tasks in the job from the resource manager .
 Requests for map tasks are made first and with a higher priority
than
those for reduce tasks, since all the map tasks must complete before
the sort phase of the reduce can start.
 Requests for reduce tasks are not made until 5% of map tasks have
completed.

Task Execution:
 Once a task has been assigned resources for a container on a
particular node by the resource manager’s scheduler, the
application
master starts the container by contacting the node manager.
 The task is executed by a Java application whose main class is
YarnChild. Before it can run the task, it localizes the resources that
the task needs, including the job configuration and JAR file, and
any files from the distributed cache.
 Finally, it runs the map or reduce task.
Streaming:

 Streaming runs special map and reduce tasks for the purpose of
launching the user supplied executable and communicating with it.
 The Streaming task communicates with the process (which may be
written in any language) using standard input and output streams.
 During execution of the task, the Java process passes input key
value
pairs to the external process, which runs it through the user defined
map or reduce function and passes the output key value pairs back
to
the Java process.
 From the node manager’s point of view, it is as if the child process
ran the map or reduce code itself.
Progress and status updates :
 MapReduce jobs are long running batch jobs, taking anything from
tens of seconds to hours to run.
 A job and each of its tasks have a status, which includes such things
as the state of the job or task (e g running, successfully completed,
failed), the progress of maps and reduces, the values of the job’s
counters, and a status message or description (which may be set by
user code).
 When a task is running, it keeps track of its progress (i e the
proportion of task is completed).
 For map tasks, this is the proportion of the input that has been
processed.
 For reduce tasks, it’s a little more complex, but the system can still
estimate the proportion of the reduce input processed.
It does this by dividing the total progress into three parts,
corresponding to the three phases of the shuffle.
 As the map or reduce task runs, the child process communicates
with its parent application master through the umbilical interface.
 The task reports its progress and status (including counters) back to
its application master, which has an aggregate view of the job,
every
three seconds over the umbilical interface.
 The resource manager web UI displays all the running applications
with links to the web UIs of their respective application masters,
each of which displays further details on the MapReduce job,
including its progress.
 During the course of the job, the client receives the latest status
by polling the application master every second (the interval is set
via mapreduce.client.progressmonitor.pollinterval).

Job Completion:
 When the application master receives a notification that the last
task for a job is complete, it changes the status for the job to
Successful.
 Then, when the Job polls for status, it learns that the job has
completed successfully, so it prints a message to tell the user and
then returns from the waitForCompletion() .
 Finally, on job completion, the application master and the task
containers clean up their working state and the OutputCommitter’s
commitJob () method is called.
 Job information is archived by the job history server to enable later
interrogation by users if desired.

 Failures in Classic MapReduce

Failures
In the real world, user code is buggy, processes crash, and
machines fail. One of the major benefits of using Hadoop is its
ability to handle such failures and allow your job to complete.
Failures in Classic MapReduce
In the MapReduce 1 runtime there are three failure modes to
consider: failure of the running task, failure of the tastracker, and
failure of the jobtracker. Let’s look at each in turn.

 Task Failure
Consider first the case of the child task failing. The most common
way that this happens is when user code in the map or reduce task
throws a runtime exception. If this happens, the child JVM reports
the error back to its parent tasktracker, before it exits. The error
ultimately makes it into the user logs. The tasktracker marks the
task attempt as failed, freeing up a slot to run another task.

For Streaming tasks, if the Streaming process exits with a nonzero


exit code, it is marked as failed. This behavior is governed
the stream.non.zero.exit.is.failure property (the default is true).
Another failure mode is the sudden exit of the child JVM—perhaps
there is a JVM bug that causes the JVM to exit for a particular set of
circumstances exposed by the MapReduce user code. In this case,
the tasktracker notices that the process has exited and marks the
attempt as failed.

Hanging tasks are dealt with differently. The tasktracker notices


that it hasn’t received a progress update for a while and proceeds
to mark the task as failed. The child JVM process will be
automatically killed after this period. The timeout period after which
tasks are considered failed is normally 10 minutes and can be
configured on a per-job basis (or a cluster basis) by setting
the mapred.task.timeout property to a value in milliseconds.

If a Streaming or Pipes process hangs, the tasktracker will kill it


(along with the JVM that launched it) only in one the following
circumstances: either mapred.task.tracker.task-controller is set
to org.apache.hadoop.mapred.LinuxTaskController, or the default
task controller in being used
(org.apache.hadoop.mapred.DefaultTaskController) and
the setsid command is available on the system (so that the child
JVM and any processes it launches are in the same process group).
In any other case orphaned Streaming or Pipes processes will
accumulate on the system, which will impact utilization over time.

Setting the timeout to a value of zero disables the timeout, so long-


running tasks are never marked as failed. In this case, a hanging
task will never free up its slot, and over time there may be cluster
slowdown as a result. This approach should therefore be avoided,
and making sure that a task is reporting progress periodically will
suffice.
When the jobtracker is notified of a task attempt that has failed (by
the tasktracker’s heartbeat call), it will reschedule execution of the
task. The jobtracker will try to avoid rescheduling the task on a
tasktracker where it has previously failed. Furthermore, if a task fails
four times (or more), it will not be retried further. This value is
configurable: the maximum number of attempts to run a task is
controlled by the mapred.map.max.attempts property for map tasks
and mapred.reduce.max.attempts for reduce tasks. By default, if
any task fails four times (or whatever the maximum number of
attempts is configured to), the whole job fails.

For some applications, it is undesirable to abort the job if a few


tasks fail, as it may be possible to use the results of the job despite
some failures. In this case, the maximum percentage of tasks that
are allowed to fail without triggering job failure can be set for the
job. Map tasks and reduce tasks are controlled independently, using
the mapred.max.map.failures.percent and mapred.max.reduce.failur
es.percent properties.

A task attempt may also be killed, which is different from it failing. A


task attempt may be killed because it is a speculative duplicate (for
more, see “Speculative Execution” on page 213), or because the
tasktracker it was running on failed, and the jobtracker marked all
the task attempts running on it as killed. Killed task attempts do not
count against the number of attempts to run the task (as set
by mapred.map.max.attempts and mapred.reduce.max.attempts),
since it wasn’t the task’s fault that an attempt was killed.

Users may also kill or fail task attempts using the web UI or the
command line (type hadoop job to see the options). Jobs may also
be killed by the same mechanisms.
 Tasktracker Failure
Failure of a tasktracker is another failure mode. If a tasktracker fails
by crashing, or running very slowly, it will stop sending heartbeats
to the jobtracker (or send them very infrequently). The jobtracker
will notice a tasktracker that has stopped sending heartbeats (if it
hasn’t received one for 10 minutes, configured via the mapred.task
tracker.expiry.interval property, in milliseconds) and remove it from
its pool of tasktrackers to schedule tasks on. The jobtracker
arranges for map tasks that were run and completed successfully
on that tasktracker to be rerun if they belong to incomplete jobs,
since their intermediate output residing on the failed tasktracker’s
local filesystem may not be accessible to the reduce task. Any tasks
in progress are also rescheduled.
A tasktracker can also be blacklisted by the jobtracker, even if the
tasktracker has not failed. If more than four tasks from the same job
fail on a particular tasktracker (set by (mapred.max.tracker.failures),
then the jobtracker records this as a fault. A tasktracker is
blacklisted if the number of faults is over some minimum threshold
(four, set by mapred.max.tracker.blacklists) and is significantly
higher than the average number of faults for tasktrackers in the
cluster cluster.
Blacklisted tasktrackers are not assigned tasks, but they continue to
communicate with the jobtracker. Faults expire over time (at the
rate of one per day), so tasktrackers get the chance to run jobs
again simply by leaving them running. Alternatively, if there is an
underlying fault that can be fixed (by replacing hardware, for
example), the tasktracker will be removed from the jobtracker’s
blacklist after it restarts and rejoins the cluster.
 Jobtracker Failure
Failure of the jobtracker is the most serious failure mode. Hadoop
has no mechanism for dealing with failure of the jobtracker—it is a
single point of failure—so in this case the job fails. However, this
failure mode has a low chance of occurring, since the chance of a
particular machine failing is low. The good news is that the situation
is improved in YARN, since one of its design goals is to eliminate
single points of failure in MapReduce.
After restarting a jobtracker, any jobs that were running at the time
it was stopped will need to be re-submitted. There is a configuration
option that attempts to recover any running jobs
(mapred.jobtracker.restart.recover, turned off by default), however
it is known not to work reliably, so should not be used.

 Job Scheduling in MapReduce (Job Scheduling in


Hadoop)

What is Hadoop Schedulers?


Basically, a general-purpose system which enables high-
performance processing of data over a set of distributed nodes
is what we call Hadoop. Moreover, it is a multitasking system
which processes multiple data sets for multiple jobs for multiple
users simultaneously.
Earlier, there was a single scheduler which was intermixed with
the JobTracker logic, supported by Hadoop. However, for the
traditional batch jobs of Hadoop (such as log mining and Web
indexing), this implementation was perfect. Yet this
implementation was inflexible as well as impossible to tailor.
Well, for scheduling users jobs, previous versions of Hadoop had
a very simple way. Basically, by using a Hadoop FIFO scheduler,
they ran in order of submission. Further, by using
the mapred.job.priority property or the setJobPriority() method
on JobClient, it adds the ability to set a job’s priority. The job
scheduler selects one with the highest priority when it is
choosing the next job to run. Although, priorities do not support
preemption, with the FIFO scheduler in Hadoop. Hence by a long
-running low priority job that started before the high-priority job
was scheduled, a high-priority job can still be blocked.
Additionally, in Hadoop, MapReduce comes along with a choice
of schedules, like Hadoop FIFO scheduler, and some multiuser
schedulers such as Fair Scheduler in Hadoop as well as the
Hadoop Capacity Scheduler.
Types of Hadoop Schedulers
There are several types of schedulers which we use in Hadoop,
such as:

a. Hadoop FIFO scheduler


An original Hadoop Job Scheduling Algorithm which was
integrated within the JobTracker is the FIFO. Basically, as a
process, a JobTracker pulled jobs from a work queue, that says
oldest job first, this is a Hadoop FIFO scheduling. Moreover, this
is simpler as well as efficient approach and it had no concept of
the priority or size of the job.
b. Hadoop Fair Scheduler
Further, to give every user a fair share of the cluster capacity
over time, we use the Fair Scheduler in Hadoop. It gets all of
the Hadoop Clusters if a single job is running. Further, free task
slots are given to the jobs in such a way as to give each user a
fair share of the cluster, as more jobs are submitted.
If a pool has not received its fair share for a certain period of
time, then the Hadoop Fair Scheduler supports preemption.
Further, the scheduler will kill tasks in pools running over
capacity to give the slots to the pool running under capacity.
In addition, it is a “contrib” module. Though, by copying it from
Hadoop’s control/fair scheduler directory to the lib directory,
place its JAR file on Hadoop’s classpath, to enable it.
Furthermore, just set
the mapred.jobtracker.taskScheduler property to:
org.apache.hadoop.mapred.FairScheduler
c. Hadoop Capacity Scheduler
Except for one fact that within each queue, jobs are scheduled
using FIFO scheduling in Hadoop (with priorities), this is like the
Fair Scheduler. It takes a slightly different approach for multiuser
scheduling. Moreover, for each user or an organization, it
permits to simulate a separate MapReduce Cluster along with
FIFO scheduling.
4. Hadoop Scheduler — Other Approaches
Instead of the scheduler, Hadoop also offers the concept of
provisioning virtual clusters from within larger physical clusters,
which we also call Hadoop On Demand (HOD). It uses the
Torque resource manager for node allocation on the basis of the
requirement of the virtual cluster. The HOD system initializes the
system based on the nodes within the virtual cluster, along with
allocated nodes, after preparing configuration files,
automatically. Also, we can use the HOD virtual cluster in a
relatively independent way, after the initialization.
In other words, an interesting model for deployments of Hadoop
clusters within a cloud infrastructure is what we call HOD. It
offers greater security as an advantage in that with less sharing
of the nodes.
 Shuffling and Sorting in Hadoop MapReduce
What is Shuffling and Sorting in Hadoop MapReduce?
Before we start with Shuffle and Sort in MapReduce, let us revise
the other phases of MapReduce like Mapper, reducer in
MapReduce, Combiner, partitioner in
MapReduce and inputFormat in MapReduce.
Shuffle phase in Hadoop transfers the map output from Mapper
to a Reducer in MapReduce. Sort phase in MapReduce covers
the merging and sorting of map outputs. Data from the mapper
are grouped by the key, split among reducers and sorted by the
key. Every reducer obtains all values associated with the same
key. Shuffle and sort phase in Hadoop occur simultaneously and
are done by the MapReduce framework.
Let us now understand both these processes in details below:
Shuffling in MapReduce
The process of transferring data from the mappers to reducers is
known as shuffling i.e. the process by which the system
performs the sort and transfers the map output to the reducer
as input. So, MapReduce shuffle phase is necessary for the
reducers, otherwise, they would not have any input (or input
from every mapper). As shuffling can start even before the map
phase has finished so this saves some time and completes the
tasks in lesser time.
Sorting in MapReduce
The keys generated by the mapper are automatically sorted by
MapReduce Framework, i.e. Before starting of reducer, all
intermediate key-value pairs in MapReduce that are generated
by mapper get sorted by key and not by value. Values passed to
each reducer are not sorted; they can be in any order.
Sorting in Hadoop helps reducer to easily distinguish when a
new reduce task should start. This saves time for the reducer.
Reducer starts a new reduce task when the next key in the
sorted input data is different than the previous. Each reduce
task takes key-value pairs as input and generates key-value pair
as output.
Note that shuffling and sorting in Hadoop MapReduce is not
performed at all if you specify zero reducers
(setNumReduceTasks(0)). Then, the MapReduce job stops at the
map phase, and the map phase does not include any kind of
sorting (so even the map phase is faster).

 Key Features of MapReduce


1. Highly scalable
A framework with excellent scalability is Apache Hadoop
MapReduce. This is because of its capacity for distributing and
storing large amounts of data across numerous servers. These
servers can all run simultaneously and are all reasonably priced.
By adding servers to the cluster, we can simply grow the amount of
storage and computing power. We may improve the capacity of
nodes or add any number of nodes (horizontal scalability) to attain
high computing power. Organizations may execute applications
from massive sets of nodes, potentially using thousands of terabytes
of data, thanks to Hadoop MapReduce programming.
2. Versatile
Businesses can use MapReduce programming to access new data
sources. It makes it possible for companies to work with many forms
of data. Enterprises can access both organized and unstructured
data with this method and acquire valuable insights from the various
data sources.
Since Hadoop is an open-source project, its source code is freely
accessible for review, alterations, and analyses. This enables
businesses to alter the code to meet their specific needs. The
MapReduce framework supports data from sources including email,
social media, and clickstreams in different languages.
3. Secure
The MapReduce programming model uses the HBase and HDFS
security approaches, and only authenticated users are permitted to
view and manipulate the data. HDFS uses a replication technique in
Hadoop 2 to provide fault tolerance. Depending on the replication
factor, it makes a clone of each block on the various machines. One
can therefore access data from the other devices that house a
replica of the same data if any machine in a cluster goes down.
Erasure coding has taken the role of this replication technique in
Hadoop 3. Erasure coding delivers the same level of fault tolerance
with less area. The storage overhead with erasure coding is less than
50%.
4. Affordability
With the help of the MapReduce programming framework and
Hadoop’s scalable design, big data volumes may be stored and
processed very affordably. Such a system is particularly cost-
effective and highly scalable, making it ideal for business models
that must store data that is constantly expanding to meet the
demands of the present.
In terms of scalability, processing data with older, conventional
relational database management systems was not as simple as it is
with the Hadoop system. In these situations, the company had to
minimize the data and execute classification based on presumptions
about how specific data could be relevant to the organization, hence
deleting the raw data. The MapReduce programming model in the
Hadoop scale-out architecture helps in this situation.
5. Fast-paced
The Hadoop Distributed File System, a distributed storage technique
used by MapReduce, is a mapping system for finding data in a
cluster. The data processing technologies, such as MapReduce
programming, are typically placed on the same servers that enable
quicker data processing.

6. Based on a simple programming model


Hadoop MapReduce is built on a straightforward programming
model and is one of the technology’s many noteworthy features.
This enables programmers to create MapReduce applications that
can handle tasks quickly and effectively. Java is a very well-liked and
simple-to-learn programming language used to develop the
MapReduce programming model.
7. Parallel processing-compatible
The parallel processing involved in MapReduce programming is one
of its key components. The tasks are divided in the programming
paradigm to enable the simultaneous execution of independent
activities. As a result, the program runs faster because of the parallel
processing, which makes it simpler for the processes to handle each
job. Multiple processors can carry out these broken-down tasks
thanks to parallel processing. Consequently, the entire software runs
faster.
8. Reliable
The same set of data is transferred to some other nodes in a cluster
each time a collection of information is sent to a single node.
Therefore, even if one node fails, backup copies are always available
on other nodes that may still be retrieved whenever necessary. This
ensures high data availability.
 Hadoop Cluster

Apache Hadoop is an open source, Java-based, software framework


and parallel data processing engine. It enables big data analytics
processing tasks to be broken down into smaller tasks that can be
performed in parallel by using an algorithm (like
the MapReduce algorithm), and distributing them across a Hadoop
cluster. A Hadoop cluster is a collection of computers, known as
nodes, that are networked together to perform these kinds of
parallel computations on big data sets. Unlike other computer
clusters, Hadoop clusters are designed specifically to store and
analyze mass amounts of structured and unstructured data in a
distributed computing environment. Further distinguishing Hadoop
ecosystems from other computer clusters are their unique structure
and architecture. Hadoop clusters consist of a network of connected
master and slave nodes that utilize high availability, low-cost
commodity hardware. The ability to linearly scale and quickly add or
subtract nodes as volume demands makes them well-suited to big
data analytics jobs with data sets highly variable in size.
Hadoop Cluster Architecture
Hadoop clusters are composed of a network of master and worker
nodes that orchestrate and execute the various jobs across the
Hadoop distributed file system. The master nodes typically utilize
higher quality hardware and include a NameNode, Secondary
NameNode, and JobTracker, with each running on a separate
machine. The workers consist of virtual machines, running both
DataNode and TaskTracker services on commodity hardware, and
do the actual work of storing and processing the jobs as directed by
the master nodes. The final part of the system are the Client Nodes,
which are responsible for loading the data and fetching the results.

 Master nodes are responsible for storing data in HDFS and


overseeing key operations, such as running parallel
computations on the data using MapReduce.
 The worker nodes comprise most of the virtual machines in a
Hadoop cluster, and perform the job of storing the data and
running computations. Each worker node runs the DataNode
and TaskTracker services, which are used to receive the
instructions from the master nodes.
 Client nodes are in charge of loading the data into the cluster.
Client nodes first submit MapReduce jobs describing how data
needs to be processed and then fetch the results once the
processing is finished.
What is cluster size in Hadoop?
A Hadoop cluster size is a set of metrics that defines storage and
compute capabilities to run Hadoop workloads, namely :
 Number of nodes : number of Master nodes, number of Edge
Nodes, number of Worker Nodes.
 Configuration of each type node: number of cores per node,
RAM and Disk Volume.
What are the advantages of a Hadoop Cluster?
 Hadoop clusters can boost the processing speed of many big
data analytics jobs, given their ability to break down large
computational tasks into smaller tasks that can be run in a
parallel, distributed fashion.
 Hadoop clusters are easily scalable and can quickly add nodes
to increase throughput, and maintain processing speed, when
faced with increasing data blocks.
 The use of low cost, high availability commodity hardware
makes Hadoop clusters relatively easy and inexpensive to set
up and maintain.
 Hadoop clusters replicate a data set across the distributed file
system, making them resilient to data loss and cluster failure.
 Hadoop clusters make it possible to integrate and leverage
data from multiple different source systems and data formats.
 It is possible to deploy Hadoop using a single-node installation,
for evaluation purposes.
What are the challenges of a Hadoop Cluster?
 Issue with small files - Hadoop struggles with large volumes of
small files - smaller than the Hadoop block size of 128MB or
256MB by default. It wasn't designed to support big data in a
scalable way. Instead, Hadoop works well when there are a
small number of large files. Ultimately when you increase the
volume of small files, it overloads the Namenode as it stores
namespace for the system.
 High processing overhead - reading and writing operations in
Hadoop can get very expensive quickly especially when
processing large amounts of data. This all comes down to
Hadoop's inability to do in-memory processing and instead
data is read and written from and to the disk.
 Only batch processing is supported - Hadoop is built for small
volumes of large files in batches. This goes back to the way
data is collected and stored which all has to be done before
processing starts. What this ultimately means is that streaming
data is not supported and it cannot do real-time processing
with low latency.
 Iterative Processing - Hadoop has a data flow structure is set-
up in sequential stages which makes it impossible to do
iterative processing or use for ML.

You might also like