Bda Unit 4-1
Bda Unit 4-1
Purpose: Provides shared libraries and utilities that support the other Hadoop
modules.
Functions:
o Contains the common utilities and infrastructure used by Hadoop components
like HDFS, YARN, and MapReduce.
o Includes Java libraries, scripts, and configuration files required for startup and
operation.
o Provides essential services like I/O, RPC, serialization, and file-based
operations.
Purpose: A distributed file system that stores data across multiple machines in a
Hadoop cluster, ensuring redundancy and fault tolerance.
Functions:
o Distributed Storage: Breaks files into blocks and distributes them across
different nodes in the cluster.
o Replication: Each block is replicated (default is 3 copies) to ensure fault
tolerance in case of node failure.
o High Throughput: Optimized for handling large files with high throughput
rather than low-latency access to small files.
o Components:
NameNode: Manages metadata, like the directory structure and
locations of blocks.
DataNodes: Store the actual data blocks.
In Hadoop, both hadoop fs and hdfs dfs commands are used to interact with the Hadoop
Distributed File System (HDFS)
hadoop fs Command
Overview: hadoop fs is a generic file system command that can operate on
different types of file systems supported by Hadoop, not just HDFS.
Purpose: It is designed to work with any file system Hadoop supports (e.g., HDFS,
Local File System, Amazon S3, Azure Blob Storage, etc.). This command is flexible
and used to interact with both HDFS and non-HDFS storage systems.
hadoop fs <args>
fs is used for generic file system and it can point to any file
system such as local file system, HDFS, WebHDFS, S3 FS, etc.
hadoop dfs <args>
hdfs dfs <args>
dfs points to the Distributed File System and it is specific to
HDFS. You can use it to execute operations on HDFS. Now it is
deprecated, and you have to use hdfs dfs instead of hadoop dfs.
4. MapReduce
Supporting and Ecosystem Components (Not Core Modules but Part of the
Hadoop Ecosystem):
In addition to these core modules, the Hadoop ecosystem includes various projects and tools
that extend the capabilities of Hadoop for different use cases:
1. Hive: A data warehousing system built on top of Hadoop. Provides SQL-like query
language (HiveQL) for querying large datasets stored in HDFS.
2. HBase: A NoSQL database that provides real-time read/write access to large datasets
stored in HDFS.
3. Pig: A high-level platform for creating MapReduce programs using a data flow
scripting language (Pig Latin).
4. Flume: A tool for efficiently collecting, aggregating, and moving large amounts of
log data into Hadoop.
5. Sqoop: A tool designed for transferring bulk data between Hadoop and relational
databases.
6. Oozie: A workflow scheduler that manages Hadoop jobs, including MapReduce, Pig,
and Hive jobs.
7. Zookeeper: A distributed coordination service used to manage synchronization,
configuration, and group services in large Hadoop clusters.
8. Mahout: A machine learning library designed for building scalable machine learning
algorithms on top of Hadoop.
Each of these modules plays a crucial role in the overall functioning of the Apache Hadoop
framework, making it a robust system for handling and processing big data across distributed
clusters.
RDBMS vs HADOOP
The primary difference between RDBMS (Relational Database Management System) and
Hadoop lies in how they handle and process data. Both are used for managing data, but their
61
purposes, data structures, scalability, and processing models are quite different.
HDFS & Map Reduce
RDBMS:
Structured data: RDBMS is designed to handle structured data (i.e., data organized
in tables with defined schema).
Fixed schema: RDBMS uses a well-defined schema (table structure) with columns
that have predefined data types. This schema must be defined before inserting data.
ACID properties: RDBMS follows ACID (Atomicity, Consistency, Isolation,
Durability) properties, ensuring strict transaction control, consistency, and data
integrity.
Hadoop:
2. Data Storage
RDBMS:
Hadoop:
Distributed storage (HDFS): Hadoop uses HDFS (Hadoop Distributed File System)
to store data across multiple machines in a cluster. Data is broken into blocks and
distributed across nodes.
Scalable storage: Hadoop can handle massive amounts of data (petabytes or more)
and scale horizontally by adding more machines (commodity hardware).
Block-based storage: Data is stored in large blocks (default 128MB or 256MB),
which helps in efficiently managing large datasets.
3. Scalability
61
RDBMS:
HDFS & Map Reduce
Hadoop:
RDBMS:
Hadoop:
Batch processing: Hadoop is designed for batch processing, where large amounts of
data are processed in bulk over time. It is not optimized for real-time querying or
small transactions.
MapReduce model: Hadoop uses the MapReduce programming model to distribute
processing tasks across multiple machines, enabling parallel processing of large
datasets.
OLAP (Online Analytical Processing): Hadoop is more suited for OLAP workloads,
which involve analyzing large datasets for complex computations, data mining, and
reporting.
RDBMS:
Hadoop:
6. Performance
RDBMS:
Optimized for small data: RDBMS performs well with small to medium-sized
datasets and when dealing with complex queries on structured data.
Real-time performance: Supports real-time transaction processing and is optimized
for low-latency reads and writes.
Hadoop:
Optimized for large-scale data: Hadoop performs better when dealing with large,
distributed data sets. However, it's optimized for batch processing and not for real-
time queries.
Higher latency: Hadoop processing (e.g., MapReduce jobs) can take longer
compared to real-time operations in RDBMS due to the nature of batch processing.
RDBMS:
Expensive scaling: As RDBMS scales vertically (more powerful servers), the cost
increases significantly due to the requirement of specialized hardware.
Commercial licenses: Many RDBMS systems (e.g., Oracle, SQL Server) come with
high licensing fees, although open-source options like MySQL or PostgreSQL are
available.
Hadoop:
8. Use Cases
61
RDBMS:
HDFS & Map Reduce
Traditional databases: Used for managing small to medium-sized data sets with
structured data.
Transactional systems: Ideal for banking, retail, healthcare, or any system requiring
real-time processing and ACID compliance.
Relational data: Best suited for applications with relational data and strict
consistency requirements.
Hadoop:
Big data processing: Ideal for applications involving large-scale data analysis, ETL
(Extract, Transform, Load), and batch processing of vast amounts of data.
Unstructured data: Suitable for handling unstructured or semi-structured data (e.g.,
log files, social media data, images, videos).
Data lakes: Used in scenarios where massive amounts of raw data need to be ingested
and stored for later analysis (e.g., in data lakes).
Conclusion:
As
checkpoints
System,
Node
secondary
HADOOP
Apache
following
Name
Node,
work
Manager
on
1.
primary
Data.
Features:
stores
in
Namenode
Master
system
with
cluster
correctly
to
used
the
Name itfile.
processing
information
as
of
2.
Slave
instructs
that
request
this
possess
more
3.
is
backup
secondary
the
checkpoints
store
fsimage.
transferred
new
assigned
andBlocks.
DataNode
Secondary
their
the
Node
Master
Data.
hourly is
DataNode,
runs
serves
athis
manage
system.
on
this
RAM
Data.
Name
The
DataNode
As in
also
Secondary
In
system.
newNode
node,
and
the
A
Slave
fails,
is
System,
should
for
of
Block NameNode
a
purpose
from
Hadoop
acase
to
dataHadoop
known
MetaData,
high
new
DataNode
Node
This
on
Daemons:
the
made
and
DEAMONS
Master
the
data
backup
NameNode
taking
Namenode
Resource
Master
System.
than
to
power
the
works
of
that
or
of
the
Node as
Distributed
Secondary
machine.
data
the
all
MetaData
memory
NameNode
Data
file
a
the
id’s
data.
DataNode
into
have
that keeps
crashes, the
read/write
they
that
to
the
slave
Slaves.
new
consists
of
works
is
client.
Name
Data
Secondary
Resource
Node
As
It
then
NameNode
the
Hadoop
System
and
on
is
run
or
is
workscheckpoint
astores
never
Node
Name
data
and
for
is track
Master
The
Meta
should
stored
good
created
file
program
the
Manager
system
hourly
again File
present
system
to
gets
Number
will
Node
always
storing
on
Node
is
the
Name
name
and of
store
such
work
on
of
the
while
node
take
the
inthe
Hadoop
Installation
Modes
Hadoop can be installed and run in three different modes, each designed for different use
cases and stages of development or deployment. These modes determine how Hadoop is
configured, where services run, and how the system processes data. The three primary
Hadoop installation modes are:
Overview:
Features:
No HDFS: All files are stored in the local filesystem, not in HDFS.
Single JVM: Both the Map and Reduce tasks are executed on a single Java Virtual
Machine (JVM).
No daemons: None of the Hadoop daemons (like NameNode, DataNode,
ResourceManager, etc.) are started.
Fast setup: The easiest to set up and configure.
Use case: Used mainly for debugging, learning, and developing small applications
without involving multiple nodes or complex configurations.
Limitations:
Overview:
Features:
HDFS is enabled: Data is stored in HDFS, allowing users to test Hadoop’s storage
features (like data replication and fault tolerance).
61
HDFS & Map Reduce
All daemons run locally: The daemons for HDFS and YARN run as separate
processes on the same machine.
Testing distributed environment: Though everything runs on a single machine, you
can test job scheduling, fault tolerance, and parallel execution.
Higher resource usage: Since all daemons run on one machine, it requires more CPU
and memory compared to standalone mode.
Limitations:
While it simulates a real cluster, all processes still run on a single machine, so it lacks
the true benefits of distributed computing (scalability, performance).
Resource-intensive compared to standalone mode.
Overview:
Features:
HDFS is fully distributed: Data is broken into blocks, distributed across multiple
nodes, and replicated for fault tolerance.
Multiple nodes: Hadoop daemons (NameNode, DataNode, ResourceManager, etc.)
run on different machines.
Fault tolerance: Hadoop’s fault-tolerant features (like block replication and task re-
execution) are fully operational.
Scalability: Can scale horizontally by adding more nodes to the cluster, allowing
Hadoop to handle petabytes of data.
High availability: High availability can be configured for critical components (like
the NameNode).
Used for –
o Big data analytics for large enterprises where data is stored and processed
across many nodes.
o Parallel processing of large datasets using MapReduce or other engines like
Spark, Hive, or Pig.
61
HDFS & Map Reduce
Comparison Table:
Each mode is tailored to specific needs, from development and testing to full-scale
production in distributed computing environments.
Hadoop Distributors
Overview of Hadoop Distributions
CDH 5.0
o Completely open source
o
Newer distro
o Hortonworks Data Platform
HortonWorks Tracks Apache Hadoop closely
www.hortonworks.com Comes with tools to manage and
HDP 1.0
administer a cluster
HDP 2.0
o MapR has their own file system
(alternative to HDFS) Free / Premium model
o Boasts higher performance
o Nice set of tools to manage and M3
MapR
administer a cluster
www.mapr.com
o Does not suffer from Single Point of M5
Failure
o Offer some cool features like M8
mirroring, snapshots, etc.
HADOOP DEAMONS
Name Node
Data Node
Resource Manager
Node Manager
Name node, Secondary Name Node, and Resource Manager work on a Master System while
the
61
Node Manager and Data Node work on the Slave machine.
HDFS & Map Reduce
1. Name Node Name Node works on the Master System. The primary purpose of Name
node is to manage all the Meta Data.Features: It never stores the data that is present in the
file. As Namenode works on the Master System, the Master system should have good
processing power and more RAM than Slaves. It stores the information of DataNode such as
their Block id’s and Number of Blocks.
2. DataNode DataNode works on the Slave system. The NameNode always instructs
DataNode for storing the Data. DataNode is a program that runs on the slave system that
serves the read/write request from the client. As the data is stored in this DataNode, they
should possess high memory to store more Data.
3. Secondary NameNode Secondary NameNode is used for taking the hourly backup of the
data. In case the Hadoop cluster fails, or crashes, the secondary Namenode will take the
hourly backup or checkpoints of that data and store this data into a file name fsimage. This
file then gets transferred to a new system. A new MetaData is assigned to that new system
and a new Master is created with this MetaData, and the cluster is made to run again correctly
new
As
checkpoints
System,
HADOOP
Apache
4. secondary
following
Name
Node,
work
purpose
manage
Features:
the
system
processing
more
id’s
Slave
instructs
Data.
runs
serves
from
DataNode,
high
3.
used
of
fails,
Namenode
backup
checkpoints
this
This
Master
MetaData,
cluster
correctly
4.
as
that
Manager
for
running
Mainly
1.
2.
An
responsible
request
makes
on
to
resources
providing
5.
The
Slaves
memory
and
Disk.
Hadoop
NodeManager
in
monitoring
Resource
it.
Name
DataNode
Secondary
the
that
Resource
the
the
ResourceName
The itsystem.
DataNode
As is
Application
Node
Slave
Master
and
data
is
works
Memory
Resource
scheduler
Node
It
system.
file
memory
on
the
or
on
for
Each
RAM
data.
Secondary
In
A case
new
the in
also
node,
also NameNode
present
and
Global
new
Slaves
NameNode
Name
Data a
ApplicationsManager
Scheduler
host
Node Secondary
Resource
Node
As
It
Node
data
MetaData
never
stores
the Hadoop
known
applications
consists
is
the
a
System
System.
is
should
crashes,
or
for
the
in
Hadoop
then
Number
a
cluster
Manager
resource
taking
of
all
client.
DataNode
Node
memory
into
Daemons:
made
and
Manages
created
machine.
Manager
DEAMONS
Master
for
Manager
aon
read/write
and
Resource
this
information
Application
System,
Name
than
Slave
sends
a
they
power
system
the
will
slave
for
Namenode
works
is
Node Hadoop
of
Manager
Secondary
client
such
gets
ato as
applications
NameNode
Master
Data
NameNode
Hadoop
Manager
ManagerResourceis
Node
works the
in
astores
stored
the in
is
Daemon
have
that
Manager
program Distributed
keeps
the
application.
file
Manager
always
Name
on
is take
accepting
the
Meta
that
to
and
has
of
store
The
should
Slaves.
aconsists
the
of
utilized
information
Manager on
assigned
cluster
the
works
isinthe
is Node
Master
system
this
Node
also resource
with
System
node
this
within
thatrun
and checkpoint
transferred
2
Hadoop
dataas
name
and
hourly
on
knownNode
and
the
data
Blocks.
that
for
is
manages
the
works
afor
asthings.
the
good track
primary
secondary
file.
Cluster.
Manager
Daemon
request
their
themore
single
also
again
this
Master.
to File
storing
resources
also
is
possess
in
Master
aare
hourly
running
Name
Globaland
the
is
for of
fsimage.
System.
work
the
that
of
new
the
to
backup
while
ain
on
Block
known
cluster
Master Data.
the
NodeNode
astore
The
the
to
Daemonthe
works on the Master System. The Resource Manager Manages the resources for the
on
that a
applications that are running in a Hadoop Cluster. The Resource ManagerMainly consists of
2 things.
1. ApplicationsManager 2. Scheduler
An Application Manager is responsible for accepting the request for a client and also makes a
memory resource on the Slaves in a Hadoop cluster to host the Application Master. The
scheduler is utilized for providing resources for applications in a Hadoop cluster and for
monitoring this application.
5. Node ManagerThe Node Manager works on the Slaves System that manages the memory
resource within the Node and MemoryDisk. Each Slave Node in a Hadoop cluster has a
single NodeManager Daemon running in it. It also sends this monitoring information to the
Resource Manager
61
HDFS & Map Reduce
Blocks
Now, as we know that the data in HDFS is scattered across the DataNodes as blocks. Let’s
havea look at what is a block and how is it formed?
Blocks are the nothing but the smallest continuous location on your hard drive where data
isstored. In general, in any of the File System, you store the data as a collection of blocks.
Similarly,HDFS stores each file as blocks which are scattered throughout the Apache Hadoop
cluster. Thedefault size of each block is 128 MB in Apache Hadoop 2.x (64 MB in Apache
Hadoop 1.x)which you can configure as per your requirement.
HDFS
store
environment
blocks
replicated
tolerance.
factor
configurable.
see
each
times
DataNodes
the default
inblock
huge
and
is
the
provides
are
Replication
Data 3 which
stored
to
figure
The
also
replication
data
is(considering
provide
as
So,
replicated
Replicationdefault
adata
in
below
ison
as
reliable
aagain
different
you
distributed
fault
blocks.
factor):
Management:replication
where
three
can
wayThe
to
Data Replication
Replication Management:
HDFS provides a reliable way to store huge data in a distributed environment as data blocks. The
blocks are alsoreplicated to provide fault tolerance. The default replication factor is 3 which is again
configurable. So, as you cansee in the figure below where each block is replicated three times and
stored on different DataNodes (consideringthe default replication factor):
61
HDFS & Map Reduce
core-site.xml – This configuration file contains Hadoop core configuration settings, for
example, I/O settings, very common for MapReduce and HDFS.
mapred-site.xml – This configuration file specifies a framework name for MapReduce
by setting mapreduce.framework.name
hdfs-site.xml – This configuration file contains HDFS daemons configuration settings. It
also specifies default block permission and replication checking on HDFS.
yarn-site.xml – This configuration file specifies configuration settings for ResourceManager and
NodeManager.
61
HDFS & Map Reduce
HADOOP DEAMONS
1. Name Node
2. Data Node
3. Secondary Name Node
4. Resource Manager
5. Node Manager
Name node, Secondary Name Node, and Resource Manager work on a Master
System while the Node Manager and Data Node work on the Slave machine.
HDFS Daemons:
(i) NameNode
The NameNode is the master of HDFS that directs the slave DataNodes to perform
I/O tasks.
Blocks: HDFS breaks large file into smaller pieces called blocks.
rackID: NameNode uses rackID to identify data nodes in the rack. (rack is a
collection of datanodes with in the cluster) NameNode keep track of blocks of a file.
File System Namespace: NameNode is the book keeper of HDFS. It keeps track of
how files are broken down into blocks and which DataNode stores these blocks. It is a
collection of files in the cluster.
FsImage: file system namespace includes mapping of blocks of a file, file properties
and is stored in a file called FsImage.
EditLog: namenode uses an EditLog (transaction log) to record every transaction
that happens to the file system metadata.
NameNode is single point of failure of Hadoop cluster.
61
HDFS & Map Reduce
(ii) DataNode
Multiple data nodes per cluster. Each slave machine in the cluster have DataNode
daemon for reading and writing HDFS blocks of actual file on local file system.
During pipeline read and write DataNodes communicate with each other.
It also continuously Sends “heartbeat” message to NameNode to ensure the
connectivity between the Name node and the data node.
If no heartbeat is received for a period of time NameNode assumes that the
DataNode had failed and it is re-replicated.
61
HDFS & Map Reduce
61
HDFS & Map Reduce
Basic terminology:
Network bandwidth between any two nodes in the same rack is greater than
bandwidth between two nodes on different racks.
Cluster: A Hadoop Cluster (or just cluster from now on) is a collection of racks.
61
HDFS & Map Reduce
File Blocks:
Blocks are nothing but the smallest continuous location on your hard drive where
data is stored. In general, in any of the File System, you store the data as a
collection of blocks. Similarly, HDFS stores each file as blocks which are scattered
throughout the Apache Hadoop cluster. The default size of each block is 128 MB
in Apache Hadoop 2.x (64 MB in Apache Hadoop 1.x) which you can configure as
per your requirement. All blocks of the file are the same size except the last
block, which can be either the same size or smaller. The files are split into 128
MB blocks and then
61
HDFS & Map Reduce
Components of HDFS:
HDFS is a block-structured file system where each file is divided into
blocks of a pre-determined size. These blocks are stored across a
cluster of one or several machines. Apache Hadoop HDFS Architecture
follows a Master/Slave Architecture, where a cluster comprises of a
single NameNode (Master node) and all the other nodes are
DataNodes (Slave nodes). HDFS can be deployed on a broad spectrum
of machines that support Java. Though one can run several DataNodes
on a single machine, but in the practical world, these DataNodes are
spread across various machines.
NameNode:
NameNode is the master node in the Apache Hadoop HDFS
Architecture that maintains and manages the blocks present on the
DataNodes (slave nodes). NameNode is a very highly available server
that manages the File System Namespace and controls access to files
by clients. The HDFS architecture is built in such a way that the user
data never resides on the NameNode. Name node contains metadata
and the data resides on DataNodes only.
Functions of NameNode:
It is the master daemon that maintains and manages the DataNodes (slave
nodes)
61 Manages the file system namespace.
HDFS & Map Reduce
It records the metadata of all the files stored in the cluster, e.g.
the location of blocks stored, the size of the files, permissions,
hierarchy, etc. There are two files associated with the
metadata:
o FsImage: It contains the complete state of the file
system namespace since the start of the NameNode.
o EditLogs: It contains all the recent modifications made
to the file system with respect to the most recent
FsImage.
It records each change that takes place to the file system
metadata. For example, if a file is deleted in HDFS, the
NameNode will immediately record this in the EditLog.
It regularly receives a Heartbeat and a block report from all the
DataNodes in the cluster to ensure that the DataNodes are
live.
It keeps a record of all the blocks in HDFS and in which nodes these blocks
are located.
The NameNode is also responsible to take care of the
replication factor of all the blocks
In case of the DataNode failure, the NameNode chooses
new DataNodes for new replicas, balance disk usage and
manages the communication traffic to the DataNodes.
61
HDFS & Map Reduce
DataNode:
Data Nodes are the slave nodes in HDFS. Unlike NameNode, DataNode
is commodity hardware, that is, a non-expensive system which is not
of high quality or high-availability. The Data Node is a block server that
stores the data in the local file ext3 or ext4.
Functions of DataNode:
61
HDFS & Map Reduce
Secondary NameNode:
It is a separate physical machine which acts as a helper of name
node. It performs periodic check points. It communicates with the
name node and take snapshot of meta data which helps minimize
61
HDFS & Map Reduce
machines that support Java. Though one can run several DataNodes on
a single machine, but in the practical world, these DataNodes are
spread across various machines.
In HDFS, NameNode makes sure that all the replicas are not stored
on the same rack or single rack; it follows Rack Awareness
Algorithm to reduce latency as well as fault tolerance.
As we know the default Replication Factor is 3 and Client want to
place a file in HDFS, then Hadoop places the replicas as follows:
1) The first replica is written to the data node creating the
file, to improve the write performance because of the
write affinity.
2) The second replica is written to another data
node within the same rack, to minimize the cross-
rack network traffic.
3) The third replica is written to a data node in a different
rack, ensuring that even if a switch or rack fails, the data is
not lost (Rack awareness).
This configuration is maintained to make sure that the File is
never lost in case of a Node Failure or even an entire Rack Failure.
61
HDFS & Map Reduce
61
https://www.npntraining.com/blog/anatomy-of-file-read-and-write/
HDFS & Map Reduce
All the metadata information is with namenode and the original data is stored on the
datanodes.
below figure will give idea about how data flow happens between the Client
interacting with HDFS, i.e. the Namenode and the Datanodes.
The following steps are involved in reading the file from HDFS:
Let’s suppose a Client (a HDFS Client) wants to read a file from HDFS.
Step 1: First the Client will open the file by giving a call to open() method on
FileSystem object, which for HDFS is an instance of DistributedFileSystem class.
each block, the NameNode returns the addresses of all the DataNode's that have a
copy of that block. Client will interact with respective DataNode's to read the file.
NameNode also provide a token to the client which it shows to data node for
authentication.
Step 3: The client then calls read() on the stream. DFSInputStream, which has
stored the DataNode addresses for the first few blocks in the file, then connects to
the first closest DataNode for the first block in the file.
Step 4: Data is streamed from the DataNode back to the client, which calls read()
repeatedly on the stream.
Step 5: When the end of the block is reached, DFSInputStream will close the
connection to the DataNode , then find the best DataNode for the next block. This
happens transparently to the client, which from its point of view is just reading a
continuous stream.
Step 6: Blocks are read in order, with the DFSInputStream opening new connections
to datanodes as the client reads through the stream. It will also call the namenode to
retrieve the datanode locations for the next batch of blocks as needed. When the
client has finished reading, it calls close() on the FSDataInputStream.
Step
61 2: DistributedFileSystem makes an RPC call to the namenode to create a new
file in the filesystem’s namespace, with no blocks associated with it.
HDFS & Map Reduce
The namenode performs various checks to make sure the file doesn’t already exist
and that the client has the right permissions to create the file.
If these checks pass, the namenode makes a record of the new file; otherwise, file
creation fails and the client is thrown an IOException.
Step 3: As the client writes data, DFSOutputStream splits it into packets, which it
writes to an internal queue, called the data queue.
The data queue is consumed by the DataStreamer, which is responsible for asking
the namenode to allocate new blocks by picking a list of suitable datanodes to store
the replicas.
The list of datanodes forms a pipeline, and here we’ll assume the replication level is
three, so there are three nodes in the pipeline.
TheDataStreamer streams the packets to the first datanode in the pipeline, which
stores the packet and forwards it to the second datanode in the pipeline.
Step 4: Similarly, the second datanode stores the packet and forwards it to the third
(and last) datanode in the pipeline.
A packet is removed from the ack queue only when it has been acknowledged by all
the datanodes in the pipeline.
Step 6: When the client has finished writing data, it calls close() on the stream.
This action flushes all the remaining packets to the datanode pipeline and waits for
acknowledgments before contacting the namenode to signal that the file is complete
The namenode already knows which blocks the file is made up of , so it only has to
wait for blocks to be minimally replicated before returning successfully.
Subsequent to this, the DFS returns an FSD to client to read from the file.
3. Client then calls read() on the stream DFSInputStream, which has addresses of
DataNodes for the first few block of the file.
61
4. Client calls read() repeatedly to stream the data from the DataNode.
HDFS & Map Reduce
5. When the end of the block is reached, DFSInputStream closes the connection with the
DataNode. It repeats the steps to find the best DataNode for the next block and
subsequent blocks.
6. When the client completes the reading of the file, it calls close() on the FSInputStream
to the connection.
61
HDFS & Map Reduce
4. Data streamer streams the packets to the first DataNode in the pipeline. It stores
packet and forwards it to the second DataNode in the pipeline.
5. In addition to the internal queue, DFSOutputStream also manages on “Ackqueue” of
the packets that are waiting for acknowledged by DataNodes.
6. When the client finishes writing the file, it calls close() on the stream.
1. Creating a directory:
Syntax: hdfs dfs –mkdir <path>
Eg. hdfs dfs –mkdir /chp
2. Remove a file in specified path:
Hadoop Configuraion
Cluster Specification
Hadoop is designed to run on commodity hardware.
How large should your cluster be?
There isn’t an exact answer to this question, but the beauty of Hadoop is that you
can start with a small cluster (say, 10 nodes) and grow it as your storage and
computational needs grow.
For a small cluster (on the order of 10 nodes), it is usually acceptable to run the
namenode and the jobtracker on a single master machine (as long as at least one
copy of the namenode’s metadata is stored on a remote filesystem).
61
HDFS & Map Reduce
As the cluster and the number of files stored in HDFS grow, the namenode needs
more memory, so the namenode and jobtracker should be moved onto separate
machines.
The secondary namenode can be run on the same machine as the namenode, but
again for reasons of memory usage (the secondary has the same memory
requirements as the primary), it is best to run it on a separate piece of hardware,
especially for larger clusters.
Network Topology
A common Hadoop cluster architecture consists of a two-level network topology,
as illustrated in Figure 1. Typically there are 30 to 40 servers per rack, with a 1 GB
switch for the rack (only three are shown in the diagram), and an uplink to a core
switch or router (which is normally 1 GB or better).
The salient point is that the aggregate bandwidth between nodes on the same rack is
much greater than that between nodes on different racks.
Rack awareness
To get maximum performance out of Hadoop, it is important to configure Hadoop
so that it knows the topology of your network. For multirack clusters, you need
to map nodes to racks. By doing this, Hadoop will prefer within-rack transfers
61
HDFS & Map Reduce
Hadoop Configuration
There are a handful of files for controlling the configuration of a Hadoop installation;
the
61 most important ones are listed in Table-1.
HDFS & Map Reduce
These files are all found in the etc/hadoop directory of the Hadoop
distribution.
The configuration directory can be relocated to another part of the
filesystem (outside the Hadoop installation, which makes upgrades marginally
easier) as long as daemons are started with the --config option (or, equivalently,
with the HADOOP_CONF_DIR environment variable set) specifying the
location of this directory on the local filesystem.
Configuration Management
Hadoop does not have a single, global location for configuration information.
Instead, each Hadoop node in the cluster has its own set of configuration files,
and it is up to administrators to ensure that they are kept in sync across the system.
61
Hadoop provides a rudimentary facility for synchronizing configuration using rsync
HDFS & Map Reduce
alternatively, there are parallel shell tools that can help do this, like dsh or
pdsh.
Hadoop is designed so that it is possible to have a single set of configuration files
that are used for all master and worker machines.
For a cluster of any size, it can be a challenge to keep all of the machines in
sync: consider what happens if the machine is unavailable when you push out
an update—who ensures it gets the update when it becomes available? This is
a big problem and can lead to divergent installations, so even if you use the
Hadoop control scripts for managing Hadoop, it may be a good idea to use
configuration management tools for maintaining the cluster. These tools are
also excellent for doing regular maintenance, such as patching security holes and
updating system packages.
Control scripts
Hadoop comes with scripts for running commands, and starting and stopping
daemons across the whole cluster. To use these scripts (which can be found in the
bin directory),you need to tell Hadoop which machines are in the cluster. There are
two files for this purpose, called masters and slaves, each of which contains a
list of the machine hostnames or IP addresses, one per line. Both masters and
slaves files reside in the configuration directory, although the slaves file may be
placed elsewhere (and given another name) by changing the HADOOP_SLAVES
setting in hadoop-env.sh. Also, these files do not need to be distributed to worker
nodes, since they are used only by the control scripts running on the
namenode or jobtracker.
For example, the start-dfs.sh script, which starts all the HDFS daemons in the
cluster, runs the namenode on the machine the script is run on.
In slightly more detail, it:
1. Starts a namenode on the local machine (the machine that the script is run on)
2. Starts a datanode on each machine listed in the slaves file
3. Starts a secondary namenode on each machine listed in the masters file
There is a similar script called start-mapred.sh, which starts all the MapReduce
daemons in the cluster.
More specifically, it:
1.
61 Starts a jobtracker on the local machine
HDFS & Map Reduce
Also there are stop-dfs.sh and stop-mapred.sh scripts to stop the daemons
started by the corresponding start script.
Whether the master daemons run on one or more nodes, the following
instructions apply:
• Run the HDFS control scripts from the namenode machine. The masters file should
contain the address of the secondary namenode.
•61Run the MapReduce control scripts from the jobtracker machine.
HDFS & Map Reduce
When the namenode and jobtracker are on separate nodes, their slaves files need to
be kept in sync, since each node in the cluster should run a datanode and a
tasktracker.
Memory
By default, Hadoop allocates 1,000 MB (1 GB) of memory to each daemon it
runs. This is controlled by the HADOOP_HEAPSIZE setting in hadoop-env.sh.
In addition, the task tracker launches separate child JVMs to run map and reduce
tasks in, so we need to factor these into the total memory footprint of a worker
machine.
The maximum number of map tasks that can run on a tasktracker at one time
is controlled by the mapred.tasktracker.map.tasks.maximum property, which
defaults to two tasks.
There is a corresponding property for reduce tasks, mapred.task
tracker.reduce.tasks.maximum, which also defaults to two tasks.
The tasktracker is said to have two map slots and two reduce slots.
Hadoop also provides settings to control how much memory is used for MapReduce
operations. These can be set on a per-job basis also.
For the master nodes, each of the namenode, secondary namenode, and
jobtracker daemons uses 1,000 MB by default, a total of 3,000 MB.
Java
The location of the Java implementation to use is determined by the
JAVA_HOME setting in hadoop-env.sh or from the JAVA_HOME shell
environment variable, if not set in hadoopenv. sh.
It’s a good idea to set the value in hadoop-env.sh, so that it is clearly defined in one
place and to ensure that the whole cluster is using the same version of Java.
System logfiles
61
HDFS & Map Reduce
</configuration>
<value>/disk1/mapred/local,/disk2/mapred/local</value>
<final>true</final>
</property>
<property>
<name>mapred.system.dir</name>
<value>/tmp/hadoop/mapred/system</value>
<final>true</final>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>7</value>
<final>true</final>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>7</value>
<final>true</final>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx400m</value>
<!-- Not marked as final so jobs can include JVM debugging options -->
</property>
</configuration>
HDFS
To run HDFS, you need to designate one machine as a namenode. In this case, the
property fs.default.name is an HDFS filesystem URI, whose host is the
namenode’s hostname or IP address, and port is the port that the namenode
will listen on for RPCs. If no port is specified, the default of 8020 is used.
The fs.default.name property also doubles as specifying the default filesystem. The
default filesystem is used to resolve relative paths, which are handy to use since
61
HDFS & Map Reduce
61
HDFS & Map Reduce
MapReduce
To run MapReduce, you need to designate one machine as a jobtracker, which on
small clusters may be the same machine as the namenode. To do this, set the
mapred.job.tracker property to the hostname or IP address and port that the
jobtracker will listen on. Note that this property is not a URI, but a host-port pair,
separated by a colon. The port number 8021 is a common choice.
During a MapReduce job, intermediate data and working files are written to
temporary local files. Since this data includes the potentially very large output of map
tasks, you need to ensure that the mapred.local.dir property, which controls
the location of local temporary storage, is configured to use disk partitions
that are large enough. The mapred.local.dir property takes a comma-separated
list of directory names, and you should use all available local disks to spread
disk I/O.
MapReduce uses a distributed filesystem to share files (such as the job JAR file)
with the tasktrackers that run the MapReduce tasks. The mapred.system.dir
property is used to specify a directory where these files can be stored. This
directory is resolved relative to the default filesystem (configured in fs.default.name),
which is usually HDFS.
61
HDFS & Map Reduce
The file is specified using the dfs.hosts (for datanodes) and mapred.hosts (for
tasktrackers) properties, as well as the corresponding dfs.hosts.exclude and
mapred.hosts.exclude files used for decommissioning.
Buffer size
Hadoop uses a buffer size of 4 KB (4,096 bytes) for its I/O operations. This is a
conservative setting, and with modern hardware and operating systems, you will
likely see performance benefits by increasing it; 128 KB (131,072 bytes) is a
common choice. Set this using the io.file.buffer.size property in core-site.xml.
HDFS block size
The HDFS block size is 64 MB by default, but many clusters use 128 MB
(134,217,728 bytes) or even 256 MB (268,435,456 bytes) to ease memory pressure
on the namenode and to give mappers more data to work on. Set this using the
dfs.block.size property in hdfs-site.xml.
Reserved storage space
By default, datanodes will try to use all of the space available in their storage
directories.If you want to reserve some space on the storage volumes for non-HDFS
use, then you can set dfs.datanode.du.reserved to the amount, in bytes, of space to
reserve.
Trash
Hadoop filesystems have a trash facility, in which deleted files are not actually
deleted,but rather are moved to a trash folder, where they remain for a minimum
period before being permanently deleted by the system. The minimum period in
minutes that a file will remain in the trash is set using the fs.trash.interval
configuration property in core-site.xml. By default, the trash interval is zero, which
disables trash.
HDFS will automatically delete files in trash folders, but other filesystems will not, so
you have to arrange for this to be done periodically. You can expunge the trash,
which will delete files that have been in the trash longer than their minimum period,
using the filesystem shell:
% hadoop fs -expunge
The Trash class exposes an expunge() method that has the same effect.
61
Job scheduler
HDFS & Map Reduce
YARN Configuration
YARN also has a job history server daemon that provides users with details of past
job runs, and a web app proxy server for providing a secure way for users to access
the UI provided by YARN applications.
YARN has its own set of configuration files listed in Table -4, these are used in
addition to those in Table -1.
Table -4. YARN configuration files
1. core-site.xml
o Contains configuration settings for the Hadoop core, such as the default
filesystem and I/O settings.
o Key configurations:
fs.defaultFS: Sets the default file system URI (e.g., HDFS, local file
system).
hadoop.tmp.dir: The default directory for Hadoop to store temporary
files.
io.file.buffer.size: Buffer size for reading and writing data to/from
HDFS.
2. hdfs-site.xml
o Configures settings for HDFS, including replication, storage, and
namenode/datanode settings.
o Key configurations:
dfs.replication: Sets the default replication factor for HDFS blocks.
dfs.blocksize: Specifies the block size for HDFS files.
dfs.namenode.name.dir: Local filesystem paths for storing
NameNode metadata.
dfs.datanode.data.dir: Paths where DataNodes store data.
3. mapred-site.xml (replaced by mapreduce-site.xml in Hadoop 2.x and later)
o Configures settings for MapReduce, such as job execution parameters.
o Key configurations:
mapreduce.framework.name: Defines the execution framework for
MapReduce (e.g., YARN).
mapreduce.job.tracker: Specifies the job tracker (used in older
versions; replaced by YARN ResourceManager in newer versions).
mapreduce.task.io.sort.mb: Buffer size for sorting map output before
writing to disk.
mapreduce.map.memory.mb / mapreduce.reduce.memory.mb:
Memory allocation for map and reduce tasks.
4. yarn-site.xml (Hadoop 2.x and later)
o Configures settings for YARN (Yet Another Resource Negotiator), which
manages resources and job scheduling.
o Key configurations:
yarn.resourcemanager.hostname: Specifies the hostname of the
YARN ResourceManager.
yarn.nodemanager.resource.memory-mb: Maximum memory
available for NodeManager containers.
yarn.scheduler.maximum-allocation-mb: Maximum memory that
can be allocated to a single container.
yarn.nodemanager.vmem-pmem-ratio: Virtual memory to physical
memory ratio.
Configuration Modes
1. Standalone Mode
o Hadoop runs on a single JVM, mainly for testing and debugging.
o Only uses the local filesystem, so no HDFS configuration is required.
2. Pseudo-Distributed Mode
o Runs on a single machine but simulates a cluster by using multiple JVMs.
o HDFS and YARN are set up as if it were a cluster, but everything runs on one
machine.
3. Fully Distributed Mode
o Runs Hadoop on a cluster of multiple nodes, distributing storage and
processing across the nodes.
o This requires setting up and configuring all the above-mentioned configuration
files and tuning them according to cluster size and workload.
MapReduce Framework
MapReduce is a programming framework that allows us to perform parallel and
distributed processing on huge data sets in distributed environment.
Map Reduce can be implemented by its components:
1. Architectural components
a) Job Tracker
b) Task Trackers
2. Functional components
a) Mapper (map)
b) Combiner (Shuffler)
c) Reducer (Reduce)
Architectural Components:
The complete execution process (execution of Map and Reduce
tasks, both) is controlled by two types of entities called:
1. A Job tracker : Acts like a master (responsible for complete execution of
submitted job) JobTracker is a master daemon responsible for executing
61 over MapReduce job. It provides connectivity between Hadoop and
application.
HDFS & Map Reduce
2. Multiple Task Trackers : Acts like slaves, each of them performing the job
This daemon is responsible for executing individual tasks that is assigned by
the Job Tracker.
For every job submitted for execution in the system, there is
one Job tracker that resides on Name node and there are
multiple task trackers which reside on Data node.
61
HDFS & Map Reduce
Functional Components:
A job (complete Work) is submitted to the master, Hadoop divides the
job into phases , map phase and reduce phase. In between Map and
Reduce, there is small phase called shuffle & Sort in MapReduce.
1. Map tasks
2. Reduce tasks
61
HDFS & Map Reduce
61
HDFS & Map Reduce
Now, each Reducer counts the values which are present in that
list of values. As shown in the figure, reducer gets a list of
values which is [1,1] for the key Bear. Then, it counts the
number of ones in the very list and gives the final output as –
Bear, 2.
Finally, all the output key/value pairs are then collected and written in the
output file.
Heartbeat Signal:
HDFS follows a master slave architecture. Namenode (master)
stores metadata about the data and Datanodes store/process the
actual data (and its replications).
Now the namenode should know if any datanode in a cluster is
down (power failure/network failure) otherwise it will continue
assigning tasks or sending data/replications to that dead datanode.
Heartbeat is a mechanism for detecting datanode failure and
ensuring that the link between datanodes and namenode is intact.
In Hadoop , Name node and data node do communicate using
Heartbeat. Therefore, Heartbeat is the signal that is sent by the
datanode to the namenode after the regular interval of time to
indicate its presence, i.e. to indicate that it is available.
The default heartbeat interval is 3 seconds. If the DataNode in
HDFS does not send heartbeat to NameNode in ten minutes, then
NameNode considers the DataNode to be out of service and the
Blocks replicas hosted by that DataNode to be unavailable.
Hence once the heartbeat stops sending a signal to NameNode,
then NameNode perform certain tasks such as replicating the
blocks present in DataNode to other DataNodes to make the data
is highly available and ensuring data reliability.
61 NameNode that receives the Heartbeats from a DataNode also
HDFS & Map Reduce
Speculative Execution
In Hadoop, MapReduce breaks jobs into tasks and these tasks run
parallel rather than sequential, thus reduces overall execution
time. This model of execution is sensitive to slow tasks (even if
they are few in numbers) as they slow down the overall execution
of a job.
There may be various reasons for the slowdown of tasks, including
hardware degradation or software misconfiguration, but it may be
difficult to detect causes since the tasks still complete successfully,
although more time is taken than the expected time.
The Hadoop framework does not try to diagnose or fix the slow-
running tasks. The framework tries to detect the task which is
running slower than the expected speed and launches another task,
which is an equivalent task as a backup. The backup task is known
as the speculative task, and this process is known as speculative
execution in Hadoop.
61
HDFS & Map Reduce
61
HDFS & Map Reduce
In MapReduce programming, Jobs(applications) are split into a set of map tasks and
reduce tasks.
Map task takes care of loading, parsing, transforming and filtering.
The responsibility of reduce task is grouping and aggregating data that is produced by
map tasks to generate final output.
Each map task is broken down into the following phases:
1. Record Reader 2. Mapper
3. Combiner 4.Partitioner.
The output produced by the map task is known as intermediate <keys, value> pairs.
These intermediate <keys, value> pairs are sent to reducer.
The reduce tasks are broken down into the following phases:
1. Shuffle 2. Sort
3. Reducer 4. Output format.
Hadoop assigns map tasks to the DataNode where the actual data to be processed
resides. This way, Hadoop ensures data locality. Data locality means that data is not
moved over network; only computational code moved to process data which saves
network bandwidth.
Mapper Phases:
Mapper maps the input <keys, value> pairs into a set of intermediate <keys, value>
pairs.
Each map task is broken into following phases:
1. RecordReader: converts byte oriented view of input in to Record oriented view and
presents it to the Mapper tasks. It presents the tasks with keys and values.
i) InputFormat: It reads the given input file and splits using the method getsplits().
ii) Then it defines RecordReader using createRecordReader() which is responsible for
generating <keys, value> pairs.
2. Mapper: Map function works on the <keys, value> pairs produced by RecordReader
and generates intermediate (key, value) pairs.
Methods:
- protected void cleanup(Context context): called once at tend of task.
- protected void map(KEYIN key, VALUEIN value, Context context): called once for
each key-value pair in input split.
- void run(Context context): user can override this method for complete control over
execution of Mapper.
- protected void setup(Context context): called once at beginning of task to perform
required activities to initiate map() method.
3. Combiner: It takes intermediate <keys, value> pairs provided by mapper and applies
user specific aggregate function to only one mapper. It is also known as local Reducer.
61
HDFS & Map Reduce
The default behavior is to hash the key to determine the reducer.User can control by
using the method:
int getPartition(KEY key, VALUE value, int numPartitions )
61
HDFS & Map Reduce
Reducer Phases:
1. Shuffle & Sort:
Downloads the grouped key-value pairs onto the local machine, where the Reducer is
running.
The individual <keys, value> pairs are sorted by key into a larger data list.
The data list groups the equivalent keys together so that their values can be iterated
easily in the Reducer task.
2. Reducer:
The Reducer takes the grouped key-value paired data as input and runs a Reducer
function on each one of them.
Here, the data can be aggregated, filtered, and combined in a number of ways, and it
requires a wide range of processing.
Once the execution is over, it gives zero or more key-value pairs to the final step.
Methods:
- protected void cleanup(Context context): called once at tend of task.
- protected void reduce(KEYIN key, VALUEIN value, Context context): called once
for each key-value pair.
- void run(Context context): user can override this method for complete control over
execution of Reducer.
- protected void setup(Context context): called once at beginning of task to perform
required activities to initiate reduce() method.
3. Output format:
In the output phase, we have an output formatter that translates the final key-value
pairs from the Reducer function and writes them onto a file using a record writer.
61
HDFS & Map Reduce
61