Communication between client
,NameNode and DataNodes
Example :
client/Name
Node/DataN
ode
interaction
2
During write
operation
Step 1:
When a client
writes data, it
first
communicates
with the
NameNode and
requests to
create a file.
3
During write
operation
Step 2:
The NameNode
determines how
many blocks are
needed and
provides the
client with the
DataNodes that
will store the
data.. 4
During write
operation
Step 3:
As part of the storage
process, the data
blocks are replicated
after they are written
to the assigned node.
5
6
4. Depending on how many nodes are in the cluster, the NameNode
will attempt to write replicas of the data blocks on nodes that are
in other separate racks (if possible).
• If there is only one rack, then the replicated blocks are written to other servers in
the same rack.
5. After the DataNode acknowledges that the file block replication
is complete, the client closes the file and informs the
NameNode that the operation is complete
Note
• the NameNode does not write any data
directly to the DataNodes.
• It does, however, give the client a limited
amount of time to complete the operation.
• If it does not complete in the time period,
the operation is cancelled.
7
During Read Operation
1. Reading data happens in a similar fashion.
2. The client requests a file from the
NameNode, which returns the best
DataNodes from which to read the data.
3. The client then accesses the data directly
from the DataNodes
8
What is MapReduce?
• MAPREDUCE is a software framework and
programming model used for processing huge
amounts of data in distributed manner.
• MapReduce program work in two phases, namely,
Map and Reduce.
• Map tasks deal with splitting and mapping of data
• Reduce tasks shuffle and reduce the data.
Important terminologies
• Mapper
• Reducer
• Aggregation function
• Querying function
• Daemon
• Mapper:
• Software for doing assigned task after organising
the data blocks imported using keys
• A key specified in a command line of mapper.
• The command maps the key to the data, which an
application uses.
13
• Reducer
– Software for reducing the mapped data by using
the aggregation, query or user-specified function.
– It provides a concise cohesive response for the
application
14
• Aggregation function:
– Function that groups the values of multiple rows
together to result a single value of more significant
meaning or measurement .
– Example: function s uc h as count, s u m , maximum,
minimum, deviation etc
15
• Querying function:
– Function that finds the desired values
– Example:
• Function for finding a best performing student of a class
16
• Daemon
– A highly dedicated program that runs in the
background in a system.
– User does not control or interact with that
– Example: MapReduce in Hadoop
17
• Hadoop is capable of running MapReduce
programs written in various languages: Java, Ruby,
Python, and C++.
• MapReduce programs are parallel in nature, thus
are very useful for performing large-scale data
analysis using multiple machines in the cluster
• The cluster size does not limit as such to process
in parallel.
• The input to each phase is key-value pairs.
• In addition, every programmer needs to
specify two functions:
1. map function and
2. reduce function.
Working
1. The processing tasks are submitted to the Hadoop.
2. The Hadoop framework in turn manages the task of issuing
jobs , job completion and copying data around the cluster
between the DataNodes with help of JobTracker
3. MapReduce runs as per assigned J o b by JobTracker, which
keeps track of the job submitted for execution and runs
TaskTracker for tracking the tasks
4. Finally the cluster collects and reduces the data to obtain
the result and send back to the Hadoop server after
completion of the given tasks
MapReduce programming enables job scheduling and task execution as follows:
1. A client node submits a request of an application to the JobTracker.
2. Following are the steps on the request to MapReduce:
i. Estimate the need of resources for processing that request
ii. Analyze the states of the slave nodes
iii. Place the mapping tasks in a queue
iv. Monitor the progress of task and on the failure,restart the task on slots of time
available
Two types of process for controlling Job
Execution
• JobTracker:
– A single master process
– Coordinates all jobs running on the cluster
and assigns map and reduce tasks to run on
the TaskTrackers
• TaskTracker:
– Number of subordinate processes
– Run assigned tasks and periodically report
the progress to the JobTracker
– The JobTracker schedules jobs submitted by
clients, keeps track of TaskTrackers and
maintains the available Map and Reduce
slots.
– The JobTracker also monitors the execution
of jobs and tasks on the cluster.
– The TaskTracker executes the Map and
Reduce tasks and reports to the JobTracker