CS19P16-DATA ANALYTICS
UNIT II- HDFS & Map
Reduce
1
Contents
•HDFS
•Processing data with Hadoop
•Map Reduce
•Managing Resources and Applications with Hadoop YARN
•Interacting with Hadoop Ecosystems
2
What is HDFS?
•HDFS is a distributed file system that provides access
to data across Hadoop clusters.
• A cluster is a group of computers that work together.
•HDFS is a key tool that manages and supports analysis
of very large volumes; petabytes and zettabytes of
data.
3
Why
HDFS?
•Before 2011, storing and retrieving
petabytes or zettabytes of data had the following
three major challenges: Cost, Speed, Reliability.
• Also, if search components were saved on different
servers, fetching data was difficult.
•Cost
•HDFS is open-source software so that it can be used with zero
licensing and support costs. It is designed to run on a
regular computer.
•Speed
•Large Hadoop clusters can read or write more than a
terabyte of data per second. A cluster comprises multiple
systems logically interconnected in the same network.
•HDFS can easily deliver more than two gigabytes of data
per second, per computer to MapReduce, which is a data
processing framework of Hadoop.
4
Why HDFS?
•Reliability
• HDFS copies the data multiple times and distributes the copies to
individual nodes. A node is a commodity server which is interconnected
through a network device.
• HDFS then places at least one copy of data on a different server. In case,
any of the data is deleted from any of the nodes; it can be found within
the cluster.
•A regular file system, like a Linux file system, is different from
HDFS with
respect to the size of the data. In a regular file system, each block
of data is
small, usually about 51 bytes. However,in HDFS, each
block is 128 Megabytes by default.
•A regular file system provides access to large data but may
suffer from disk input/output problems mainly due to multiple seek
operations.
•On the other hand,HDFS can read large quantities
5
of data sequentially
Characteristics of
HDFS
•HDFS has high fault-tolerance
•HDFS may consist of thousands of server machines. Each
machine stores a part of the file system data. HDFS
detects faults that can occur on any of the machines
and recovers it quickly and automatically.
•HDFS has high throughput
•HDFS is designed to store and scan millions of rows of data
and to count or add some subsets of the data. The
time required in this process is dependent on the
complexities involved.
•It has been designed to support large datasets in batch-
style jobs. However, the emphasis is on high throughput of
data access rather than low latency.
6
Characteristics of
HDFS
•HDFS is economical
•HDFS is designed in such a way that it can be built on
commodity hardware and heterogeneous platforms, which is
low-priced and easily available.
•HDFS stores files in a number of blocks. Each block is
replicated to a few separate computers. The count of
replication can be modified by the administrator. Data is
divided into 128 Megabytes per block and replicated
across local disks of cluster nodes. Metadata controls the
physical location of a block and its replication within the
cluster. It is stored in NameNode HDFS is the storage
system for both input/output of MapReduce jobs. Let’s
understand how HDFS stores files with an example. .
7
How Does HDFS Work?
•Example - A patron gifted a collection of popular
books to a college library. The librarian decided to
arrange the books on a small rack and then distribute
multiple copies of each book on other racks. This
way the students could easily pick up a book from
any of the racks.
•Similarly, HDFS creates multiple copies of a data
block and keeps them in separate systems for easy
access.
8
HDFS Architecture and Components
9
10
11
12
13
•There is a Secondary NameNode which performs tasks
for NameNode and is also considered as a master
node. Prior to Hadoop 2.0.0, the NameNode was a
Single Point of Failure, or SPOF, in an HDFS cluster.
•Each cluster had a single NameNode. In case of an
unplanned event, such as a system failure, the
cluster would be unavailable until an operator
restarted the NameNode.
•Also, planned maintenance events, such as software or
hardware upgrades on the NameNode system, would
result in cluster downtime.
•The HDFS High Availability, or HA, feature addresses
these problems by providing the option of running
two redundant NameNodes in the same cluster in
an Active/Passive configuration with a hot standby.
14
•This allows a fast failover to a new NameNode in case a
system crashes or an administrator initiates a
failover for the purpose of a planned
maintenance.
•In an HA cluster, two separate systems are
configured as NameNodes. At any instance, one of
the NameNodes is in an Active state, and the other is
in a Standby state.
•The Active NameNode is responsible for all client
operations in the cluster, while the Standby
simply acts as a slave, maintaining enough state
to provide a fast failover if necessary.
15
HDFS Components
•The main components of HDFS
are:
•Namenode
•Secondary Namenode
•File system
•Metadata
•Datanode
16
Namenod
e
•The NameNode server is the core component of
an HDFS cluster. There can be only one NameNode
server in an entire cluster. Namenode maintains and
executes the file system namespace operation
such as opening, closing, and renaming of files
and directories, which are present in HDFS.
17
•The namespace image and the edit log stores information of the
data and
the metadata. NameNode also determines the linking of blocks
to DataNodes. Furthermore, the NameNode is a single point of
failure.
•The DataNode is a multiple instance server. There can be several
numbers
of DataNode servers. The number depends on the type of
network and the storage system.
•The DataNode servers, stores, and maintains the data blocks.
The
NameNode Server provisions the data blocks on the basis of the
type of job submitted by the client.
•DataNode also stores and retrieves the blocks when asked by
clients or the
NameNode. Furthermore, it reads/writes requests
and performs block
creation, deletion, and replication of instruction from the
NameNode. There 18
can be only one Secondary NameNode server in a cluster. Note
Secondary Namenode
•The Secondary NameNode server maintains the edit
log and namespace image information in sync
with the NameNode server. At times, the
namespace images from the NameNode server are
not updated; therefore, you cannot totally rely on the
Secondary NameNode server for the recovery process.
19
20
File
System
•HDFS exposes a file system namespace and allows user
data to be stored in files. HDFS has a hierarchical file
system with directories and files. The NameNode
manages the file system namespace, allowing clients to
work with files and directories.
21
File
System
22
•A file system supports operations like create,
remove, move, and rename. The NameNode,
apart from maintaining the file system
namespace, records any change to metadata
information.
•Now that we have learned about HDFS components,
let us see how NameNode works along with other
components.
23
Namenode: Operation
•NameNode maintains two persistent files; one a
transaction log called an Edit Log and the other, a
namespace image called a FsImage. The Edit Log
records every change that occurs in the file system
metadata such as creating a new file.
24
Namenode: Operation
The NameNode is a local filesystem that stores the Edit Log. The entire
file system namespace including mapping of blocks, files, and file system
properties is stored in FsImage. This is also stored in the NameNode local
file system.
25
Metadat
•When new DataNodes join a cluster, metadata loads
athe blocks that reside on a specific DataNode into
its memory at startup. Metadata then periodically
loads the data at user-defined or default intervals.
•When the NameNode starts up, it retrieves the Edit
Log and FsImage from its local file system. It then
updates the FsImage with Edit Log information and
stores a copy of the FsImage on the file system as a
checkpoint.
•The metadata size is limited to the RAM available
on the NameNode. A large number of small files
would require more metadata than a small number
of large files. Hence, the in-memory metadata
management issue explains why HDFS favors a small
number of large files.
26
•If a NameNode runs out of RAM, it will crash,
and the applications will not be able to use HDFS until
the NameNode is operational again.
•Data block split is an important process of HDFS
architecture. As discussed earlier, each file is split
into one or more blocks stored and replicated in
DataNodes.
27
DataNod
e
•DataNodes manage names and locations of file
blocks. By default, each file block is 128
Megabytes. However, this potentially reduces the
amount of parallelism that can be achieved as
the number of blocks per file decreases.
28
DataNod
e
29
•The data block approach provides:
•Simplified replication
•Fault-tolerance
•Reliability.
•It also helps by shielding users from storage sub-
system details.
30
Block Replication Architecture
•Block replication refers to creating copies of a block in
multiple data nodes. Usually, the data is split into
the forms of parts such as part and part one.
31
Block Replication Architecture
32
Replication Method
•In the replication method, each file is split into a
sequence of blocks. All blocks except the last one in
the file are of the same size. Blocks are replicated for
fault tolerance.
33
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 34
Data Replication Topology
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 35
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 36
How Are Files Stored?
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 37
Example
•2 log files to save from a local file system to the HDFS
cluster.
•The cluster has 5 data nodes: node A, node B, node C,
node D, and node E.
•Now the first log is divided into three blocks: b1 b2 and
b3 and the other log is divided into two blocks: b4
and b5.
•Now the blocks b1 b2 b3 b4 and b5 are distributed to
the node A, node B, node C, and no D respectively
as shown in the diagram.
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 38
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 39
READ OPERATION
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 40
READ OPERATION
1. A client initiates read request by calling 'open()' method of FileSystem object; it is
an object of type DistributedFileSystem.
2. This object connects to namenode using RPC and gets metadata information such as the
locations of the blocks of the file. Please note that these addresses are of first few blocks of
a file.
3. In response to this metadata request, addresses of the DataNodes having a copy of that
block is returned back.
4. Once addresses of DataNodes are received, an object of type
returned to the client. isFSDataInputStream contains DFSInputStream which takes
FSDataInputStream
of
careinteractions with DataNode and NameNode. In step 4 shown in the above diagram, a
client invokes 'read()' method which causes DFSInputStream to establish a
with the first DataNode with the first block of a file.
connection
5. Data is read in the form of streams wherein client invokes 'read()' method repeatedly.
process of read() operation continues till it reaches the end of block.
This
6. Once the end of a block is reached, DFSInputStream closes the connection and moves on to
locate the next DataNode for the next block
7. Once a client has done with the reading, it calls a close() method.
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 41
WRITE OPERATION
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 42
WRITE OPERATION
1. A client initiates write operation by calling 'create()' method of DistributedFileSystem
object which creates a new file - Step no. 1 in the above diagram.
2. DistributedFileSystem object connects to the NameNode using RPC call and initiates
new file creation. However, this file creates operation does not associate any blocks
with the file. It is the responsibility of NameNode to verify that the file (which
is being created) does not exist already and a client has correct permissions to
create a new file. If a file already exists or client does not have sufficient
permission to create a new file, then IOException is thrown to the client.
Otherwise, the operation succeeds and a new record for the file is created by the
NameNode.
3. Once a new record in NameNode is created, an object of type FSDataOutputStream is
returned to the client. A client uses it to write data into the HDFS. Data write method
is invoked (step 3 in the diagram).
4. FSDataOutputStream contains DFSOutputStream object which looks after
communication with DataNodes and NameNode. While the client continues
writing data, DFSOutputStream continues creating packets with this data. These
packets are enqueued into a queue which is called as DataQueue.
5. There is one more component called DataStreamer which consumes this DataQueue.
DataStreamer also asks NameNode for allocation of new blocks thereby
picking desirable DataNodes to be used for replication.
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 43
Replication
1. Now, the process of replication starts by creating a pipeline using DataNodes. In
our case, we have chosen a replication level of 3 and hence there are 3 DataNodes
in the pipeline.
2. The DataStreamer pours packets into the first DataNode in the pipeline.
3. Every DataNode in a pipeline stores packet received by it and forwards the same to
the second DataNode in a pipeline.
4. Another queue, 'Ack Queue' is maintained by DFSOutputStream to store packets
which are waiting for acknowledgment from DataNodes.
5. Once acknowledgment for a packet in the queue is received from all DataNodes in
the pipeline, it is removed from the 'Ack Queue'. In the event of any DataNode
failure, packets from this queue are used to reinitiate the operation.
6. After a client is done with the writing data, it calls a close() method (Step 9 in the
diagram) Call to close(), results into flushing remaining data packets to the pipeline
followed by waiting for acknowledgment.
7. Once a final acknowledgment is received, NameNode is contacted to tell it that the
file write operation is complete.
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 44
MapReduce
• MAPREDUCE is a software framework and programming model used
for processing huge amounts of data.
• MapReduce program work in two phases, namely, Map and Reduce.
• Map tasks deal with splitting and mapping of data
• Reduce tasks shuffle and reduce the data.
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 45
MapReduce
•Hadoop is capable of running MapReduce programs written in various
languages: Java, Ruby, Python, and C++.
•MapReduce programs are parallel in nature, thus are very useful for
performing large-scale data analysis using multiple machines in
the cluster.
•The input to each phase is key-value pairs.
•In addition, every programmer needs to specify two functions:
map function and reduce function.
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 46
How MapReduce
•Four phases of
Works?
execution :
•Phase 1: Splitting
•Phase 2: Mapping
•Phase 3: Shuffling
•Phase 4: Reducing
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 47
Working of Map Reduce
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 48
Final output of the MapReduce
task
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 49
The data goes through the following phases:
• Input Splits:
•An input to a MapReduce job is divided into fixed-size pieces
called input splits .Input split is a chunk of the input that is
consumed by a single map
• Mapping:
•This is the very first phase in the execution of map-reduce program. In
this phase data in each split is passed to a mapping function to
produce output values.
• Example, a job of mapping phase is to count a number of
occurrences of each word from input splits and prepare a list in the
form of <word, frequency>
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 50
The data goes through the following phases:
• Shuffling
•This phase consumes the output of Mapping phase. Its task is to consolidate the
relevant records from Mapping phase output.
•In the example, the same words are clubbed together along with their respective
frequency.
• Reducing
•In this phase, output values from the Shuffling phase are aggregated. This phase
combines values from Shuffling phase and returns a single output value. In
short, this phase summarizes the complete dataset.
•In the example, this phase aggregates the values from Shuffling phase i.e.,
calculates total occurrences of each word.
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 51
How MapReduce Organizes
Work?
Hadoop divides the job into tasks. There are two types of tasks:
1.Map tasks (Splits & Mapping)
2.Reduce tasks (Shuffling, Reducing)
•The complete execution process (execution of Map and Reduce tasks, both)
is controlled by two types of entities called
3. Jobtracker : Acts like a master (responsible for complete
execution of submitted job)
4.Multiple Task Trackers: Acts like slaves, each of them performing
the job
•For every job submitted for execution in the system, there
is one Jobtracker that resides on Namenode and
there are multiple tasktrackers which reside on
Datanode.
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 52
How MapReduce Organizes Work?
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 53
How MapReduce Organizes
Work?
•A job is divided into multiple tasks which are then run onto multiple data
nodes in a cluster.
•It is the responsibility of job tracker to coordinate the activity by scheduling
tasks to run on different data nodes.
•Execution of individual task is then to look after by task tracker,
which resides on every data node executing part of the job.
•Task tracker's responsibility is to send the progress report to the job tracker.
•In addition, task tracker periodically sends 'heartbeat' signal to
the Jobtracker so as to notify him of the current state of the system.
•Thus job tracker keeps track of the overall progress of each job. In the event
of task failure, the job tracker can reschedule it on a different task tracker.
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 54
Managing Resources And Applications With
Hadoop Yarn
•YARN stands for “Yet Another Resource Negotiator“.
•It was introduced in Hadoop 2.0 to remove the bottleneck on
Job Tracker which was present in Hadoop 1.0.
• YARN was described as a “Redesigned Resource Manager” at the
time of its launching, but it has now evolved to be known as
large-scale distributed operating system used for Big Data processing.
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 55
Hadoop 1.0
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 56
Hadoop 2.0
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 57
YARN
•YARN also allows different data processing engines like graph
processing, interactive processing, stream processing as well as
batch processing to run and process data stored in HDFS
(Hadoop Distributed File System) thus making the system much
more efficient.
• Through its various components, it can dynamically allocate
various resources and schedule the application processing. For large
volume data processing, it is quite necessary to manage the available
resources properly so that every application can leverage them.
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 58
YARN Features
• Scalability: The scheduler in Resource manager of YARN
architecture allows Hadoop to extend and manage thousands of
nodes and clusters.
• Compatibility: YARN supports the existing map-reduce applications
without disruptions thus making it compatible with Hadoop 1.0
as well.
• Cluster Utilization :Since YARN supports Dynamic utilization of
cluster in Hadoop, which enables optimized Cluster Utilization.
• Multi-tenancy: It allows multiple engine access thus giving
organizations a benefit of multi-tenancy.
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 59
Hadoop YARN
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 60
Components of Yarn Architecture
• Client
• Resource Manager
• Scheduler
• Application manager
• Node Manager
• Application Master
• Container
• Client: It submits map-reduce jobs.
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 61
Components of Yarn Architecture
• Resource Manager: It is the master daemon of YARN and is responsible
for resource assignment and management among all the applications.
Whenever it receives a processing request, it forwards it to the
corresponding node manager and allocates resources for the completion of
the request accordingly. It has two major components:
• Scheduler: It performs scheduling based on the allocated application
and available resources. It is a pure scheduler, means it does not
perform other tasks such as monitoring or tracking and does not
guarantee a restart if a task fails. The YARN scheduler supports plugins
such as Capacity Scheduler and Fair Scheduler to partition the cluster
resources.
• Application manager: It is responsible for accepting the application
and negotiating the first container from the resource manager. It also
restarts the Application Manager container if a task fails.
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 62
Components of Yarn Architecture
• Node Manager: It take care of individual node on Hadoop cluster and
manages application and workflow and that particular node.
Its primary job is to keep-up with the Node Manager. It monitors
resource usage, performs log management and also kills a
container based on directions from the resource manager. It
is also responsible for creating the container process and start it on
the request of Application master.
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 63
Components of Yarn Architecture
• Application Master: An application is a single job submitted to a
framework. The application manager is responsible for
negotiating resources with the resource manager, tracking the
status and monitoring progress of a single application. The
application master requests the container from the node
manager by sending a Container Launch Context(CLC) which
includes everything an application needs to run. Once the
application is started, it sends the health report to the resource
manager from time-to-time.
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 64
Components of Yarn Architecture
• Container: It is a collection of physical resources such as RAM, CPU
cores and disk on a single node. The containers are invoked
by Container Launch Context(CLC) which is a record that
contains information such as environment variables,
security tokens, dependencies etc.
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 65
Application workflow in Hadoop
YARN:
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 66
Application workflow
1. Client submits an application
2. The Resource Manager allocates a container to start the
Application Manager
3. The Application Manager registers itself with the Resource Manager
4. The Application Manager negotiates containers from the Resource Manager
5. The Application Manager notifies the Node Manager to launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to
monitor application’s status
8. Once the processing is complete, the Application Manager un-registers with
the Resource Manager
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 67
HDFS_READ FILE
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 68
HDFS_WRITE FILE
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 69
References
•https://www.simplilearn.com/tutorials/hadoop-tutorial/hdfs
•HDFS-READ
• https://www.youtube.com/watch?v=Ax7EhEsVVzE
•HDFS-WRITE
• https://www.youtube.com/watch?v=0QJKx4A4L7Y
•https://www.youtube.com/watch?v=nWqdePeOh9M
•https://techvidvan.com/tutorials/how-hadoop-works-internally/
•https://www.guru99.com/learn-hdfs-a-beginners-guide.html
•https://www.guru99.com/introduction-to-mapreduce.html
•https://data-flair.training/blogs/hdfs-data-write-operation/
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 70
THANK YOU
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 71