Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
21 views64 pages

Bda Unit 4-1

Uploaded by

crishitha8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views64 pages

Bda Unit 4-1

Uploaded by

crishitha8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

HDFS & Map Reduce

Key Modules of Apache Hadoop Framework:

1. Hadoop Common (Hadoop Core)

 Purpose: Provides shared libraries and utilities that support the other Hadoop
modules.
 Functions:
o Contains the common utilities and infrastructure used by Hadoop components
like HDFS, YARN, and MapReduce.
o Includes Java libraries, scripts, and configuration files required for startup and
operation.
o Provides essential services like I/O, RPC, serialization, and file-based
operations.

2. HDFS (Hadoop Distributed File System)

 Purpose: A distributed file system that stores data across multiple machines in a
Hadoop cluster, ensuring redundancy and fault tolerance.
 Functions:
o Distributed Storage: Breaks files into blocks and distributes them across
different nodes in the cluster.
o Replication: Each block is replicated (default is 3 copies) to ensure fault
tolerance in case of node failure.
o High Throughput: Optimized for handling large files with high throughput
rather than low-latency access to small files.
o Components:
 NameNode: Manages metadata, like the directory structure and
locations of blocks.
 DataNodes: Store the actual data blocks.

In Hadoop, both hadoop fs and hdfs dfs commands are used to interact with the Hadoop
Distributed File System (HDFS)

hadoop fs Command
 Overview: hadoop fs is a generic file system command that can operate on
different types of file systems supported by Hadoop, not just HDFS.
 Purpose: It is designed to work with any file system Hadoop supports (e.g., HDFS,
Local File System, Amazon S3, Azure Blob Storage, etc.). This command is flexible
and used to interact with both HDFS and non-HDFS storage systems.

hdfs dfs Command


 Overview: hdfs dfs is a command specifically for interacting with the Hadoop
Distributed File System (HDFS).
61
HDFS & Map Reduce

 Purpose: It is explicitly designed for managing and interacting with HDFS. It


works directly with HDFS and is a more specific command for HDFS operations, as
opposed to the general-purpose nature of hadoop fs.

 hadoop fs <args>
 fs is used for generic file system and it can point to any file
system such as local file system, HDFS, WebHDFS, S3 FS, etc.
 hadoop dfs <args>
 hdfs dfs <args>
 dfs points to the Distributed File System and it is specific to
HDFS. You can use it to execute operations on HDFS. Now it is
deprecated, and you have to use hdfs dfs instead of hadoop dfs.

3. YARN (Yet Another Resource Negotiator)

 Purpose: A resource management layer responsible for managing and scheduling


computing resources in a Hadoop cluster.
 Functions:
o Resource Allocation: Allocates CPU, memory, and storage resources to
various applications running on the cluster.
o Job Scheduling: Schedules and monitors tasks, ensuring efficient utilization
of cluster resources.
o Components:
 ResourceManager: Manages the overall cluster resources and assigns
them to applications.
 NodeManager: Manages resources on individual nodes and executes
tasks.
 ApplicationMaster: Manages the execution of individual applications
(e.g., MapReduce jobs).

4. MapReduce

 Purpose: A processing model for performing distributed computation on large data


sets.
 Functions:
o Map Phase: Processes input data and transforms it into intermediate key-
value pairs.
o Reduce Phase: Aggregates the key-value pairs generated by the Map phase to
produce the final output.
o Distributed Processing: Executes the Map and Reduce tasks across multiple
nodes in the cluster.
o Fault Tolerance: Automatically handles failures by re-executing tasks on
other nodes when necessary.
61
Components:
HDFS & Map Reduce

o JobTracker (in Hadoop 1.x) or ResourceManager (in YARN): Manages job


scheduling.
o TaskTracker (in Hadoop 1.x) or NodeManager (in YARN): Executes tasks
on nodes.

Supporting and Ecosystem Components (Not Core Modules but Part of the
Hadoop Ecosystem):

In addition to these core modules, the Hadoop ecosystem includes various projects and tools
that extend the capabilities of Hadoop for different use cases:

1. Hive: A data warehousing system built on top of Hadoop. Provides SQL-like query
language (HiveQL) for querying large datasets stored in HDFS.
2. HBase: A NoSQL database that provides real-time read/write access to large datasets
stored in HDFS.
3. Pig: A high-level platform for creating MapReduce programs using a data flow
scripting language (Pig Latin).
4. Flume: A tool for efficiently collecting, aggregating, and moving large amounts of
log data into Hadoop.
5. Sqoop: A tool designed for transferring bulk data between Hadoop and relational
databases.
6. Oozie: A workflow scheduler that manages Hadoop jobs, including MapReduce, Pig,
and Hive jobs.
7. Zookeeper: A distributed coordination service used to manage synchronization,
configuration, and group services in large Hadoop clusters.
8. Mahout: A machine learning library designed for building scalable machine learning
algorithms on top of Hadoop.

Summary of Hadoop Modules:

1. Hadoop Common: Provides shared libraries and utilities.


2. HDFS: Distributed file system for storing large data across multiple nodes.
3. YARN: Resource management and job scheduling framework.
4. MapReduce: Distributed processing model for parallel computation of large data
sets.

Each of these modules plays a crucial role in the overall functioning of the Apache Hadoop
framework, making it a robust system for handling and processing big data across distributed
clusters.

RDBMS vs HADOOP
The primary difference between RDBMS (Relational Database Management System) and
Hadoop lies in how they handle and process data. Both are used for managing data, but their
61
purposes, data structures, scalability, and processing models are quite different.
HDFS & Map Reduce

Here’s a detailed comparison:

1. Data Structure and Schema

RDBMS:

 Structured data: RDBMS is designed to handle structured data (i.e., data organized
in tables with defined schema).
 Fixed schema: RDBMS uses a well-defined schema (table structure) with columns
that have predefined data types. This schema must be defined before inserting data.
 ACID properties: RDBMS follows ACID (Atomicity, Consistency, Isolation,
Durability) properties, ensuring strict transaction control, consistency, and data
integrity.

Hadoop:

 Unstructured and semi-structured data: Hadoop is capable of handling structured,


semi-structured, and unstructured data (e.g., text, images, logs, videos).
 Flexible schema: Data in Hadoop does not require a predefined schema. It can store
any type of data (e.g., CSV, JSON, images) and process it later.
 No strict ACID: Hadoop does not follow ACID properties as strictly as RDBMS.
Instead, it focuses on availability and partition tolerance (CAP theorem).

2. Data Storage

RDBMS:

 Row-oriented storage: Data is typically stored in a row-based format where all


columns of a record (row) are stored together.
 Limited storage: RDBMS is designed for relatively smaller datasets and can be
expensive when scaling up (more hardware required).
 Centralized storage: Data is stored on a single server or limited number of nodes.

Hadoop:

 Distributed storage (HDFS): Hadoop uses HDFS (Hadoop Distributed File System)
to store data across multiple machines in a cluster. Data is broken into blocks and
distributed across nodes.
 Scalable storage: Hadoop can handle massive amounts of data (petabytes or more)
and scale horizontally by adding more machines (commodity hardware).
 Block-based storage: Data is stored in large blocks (default 128MB or 256MB),
which helps in efficiently managing large datasets.

3. Scalability
61
RDBMS:
HDFS & Map Reduce

 Vertical scalability: RDBMS primarily scales by adding more powerful resources


(e.g., more CPU, RAM) to a single machine, which can become expensive and
complex.
 Limited scalability: Most traditional RDBMS systems are not designed to handle
extremely large datasets (big data) or scale across many machines easily.

Hadoop:

 Horizontal scalability: Hadoop is designed for horizontal scalability, meaning it can


easily scale by adding more inexpensive machines (nodes) to the cluster.
 Designed for big data: Hadoop is built for handling and processing vast amounts of
data (petabytes or exabytes) efficiently.

4. Data Processing Model

RDBMS:

 OLTP (Online Transaction Processing): RDBMS is primarily designed for OLTP


workloads, which involve many small, short, read/write operations (e.g., transactions).
 Real-time querying: RDBMS supports real-time querying, and SQL is used to
perform fast lookups, updates, and queries on small to moderately large datasets.
 Transactional processing: Strong transactional processing ensures consistency in
small, frequent operations (e.g., banking transactions).

Hadoop:

 Batch processing: Hadoop is designed for batch processing, where large amounts of
data are processed in bulk over time. It is not optimized for real-time querying or
small transactions.
 MapReduce model: Hadoop uses the MapReduce programming model to distribute
processing tasks across multiple machines, enabling parallel processing of large
datasets.
 OLAP (Online Analytical Processing): Hadoop is more suited for OLAP workloads,
which involve analyzing large datasets for complex computations, data mining, and
reporting.

5. Data Integrity and Transactions

RDBMS:

 Strong consistency and ACID compliance: RDBMS ensures strong data


consistency, integrity, and supports ACID transactions (Atomicity, Consistency,
Isolation, Durability).
 Transactional: Every operation is transactional, meaning it can be rolled back if an
error occurs, ensuring data reliability.
61
HDFS & Map Reduce

Hadoop:

 Eventual consistency: Hadoop is designed to handle large distributed systems where


eventual consistency is more important than strict transaction management.
 No ACID: Hadoop does not strictly follow ACID properties. Instead, it focuses on
fault tolerance and scalability.

6. Performance

RDBMS:

 Optimized for small data: RDBMS performs well with small to medium-sized
datasets and when dealing with complex queries on structured data.
 Real-time performance: Supports real-time transaction processing and is optimized
for low-latency reads and writes.

Hadoop:

 Optimized for large-scale data: Hadoop performs better when dealing with large,
distributed data sets. However, it's optimized for batch processing and not for real-
time queries.
 Higher latency: Hadoop processing (e.g., MapReduce jobs) can take longer
compared to real-time operations in RDBMS due to the nature of batch processing.

7. Cost and Hardware

RDBMS:

 Expensive scaling: As RDBMS scales vertically (more powerful servers), the cost
increases significantly due to the requirement of specialized hardware.
 Commercial licenses: Many RDBMS systems (e.g., Oracle, SQL Server) come with
high licensing fees, although open-source options like MySQL or PostgreSQL are
available.

Hadoop:

 Cost-effective scaling: Hadoop is designed to scale horizontally using commodity


hardware (inexpensive servers), making it more cost-effective for handling big data.
 Open-source: Hadoop is an open-source project under the Apache license, making it
free to use, though enterprise support may have costs (e.g., Cloudera, Hortonworks).

8. Use Cases
61
RDBMS:
HDFS & Map Reduce

 Traditional databases: Used for managing small to medium-sized data sets with
structured data.
 Transactional systems: Ideal for banking, retail, healthcare, or any system requiring
real-time processing and ACID compliance.
 Relational data: Best suited for applications with relational data and strict
consistency requirements.

Hadoop:

 Big data processing: Ideal for applications involving large-scale data analysis, ETL
(Extract, Transform, Load), and batch processing of vast amounts of data.
 Unstructured data: Suitable for handling unstructured or semi-structured data (e.g.,
log files, social media data, images, videos).
 Data lakes: Used in scenarios where massive amounts of raw data need to be ingested
and stored for later analysis (e.g., in data lakes).

Summary Table: RDBMS vs Hadoop

Feature RDBMS Hadoop


Structured, semi-structured,
Data Structure Structured (tables, schema)
unstructured
Schema Fixed, pre-defined Flexible, schema-on-read
Vertical (scales by adding
Scalability Horizontal (scales by adding nodes)
resources)
Processing Batch processing (MapReduce),
OLTP, real-time
Model OLAP
Consistency Strong consistency (ACID) Eventual consistency
Storage Centralized, row-oriented Distributed (HDFS), block-oriented
Performance Low-latency, transactional High-latency, batch processing
Cost-effective (open-source,
Cost Expensive (scaling & licenses)
commodity hardware)
Transactional systems, small to Big data processing, data lakes,
Best Use Cases
medium datasets analytics

Conclusion:

 RDBMS is ideal for traditional database applications requiring structured data,


ACID transactions, and real-time performance.
 Hadoop is designed for big data analytics and batch processing of large datasets,
and it excels in handling unstructured data at a low cost through horizontal
scalability.
61
HDFS & Map Reduce

 As
checkpoints
System,
Node
secondary
HADOOP
Apache
following
Name
Node,
work
Manager
on
1.
primary
Data.
Features:
stores
in
Namenode
Master
system
with
cluster
correctly
to
used
the
Name itfile.
processing
information
as
of
2.
Slave
instructs
that
request
this
possess
more
3.
is
backup
secondary
the
checkpoints
store
fsimage.
transferred
new
assigned
andBlocks.
DataNode
Secondary
their
the
Node
Master
Data.
hourly is
DataNode,
runs
serves
athis
manage
system.
on
this
RAM
Data.
Name
The
DataNode
As in
also
Secondary
In
system.
newNode
node,
and
the
A
Slave
fails,
is
System,
should
for
of
Block NameNode
a
purpose
from
Hadoop
acase
to
dataHadoop
known
MetaData,
high
new
DataNode
Node
This
on
Daemons:
the
made
and
DEAMONS
Master
the
data
backup
NameNode
taking
Namenode
Resource
Master
System.
than
to
power
the
works
of
that
or
of
the
Node as
Distributed
Secondary
machine.
data
the
all
MetaData
memory
NameNode
Data
file
a
the
id’s
data.
DataNode
into
have
that keeps
crashes, the
read/write
they
that
to
the
slave
Slaves.
new
consists
of
works
is
client.
Name
Data
Secondary
Resource
Node
As
It
then
NameNode
the
Hadoop
System
and
on
is
run
or
is
workscheckpoint
astores
never
Node
Name
data
and
for
is track
Master
The
Meta
should
stored
good
created
file
program
the
Manager
system
hourly
again File
present
system
to
gets
Number
will
Node
always
storing
on
Node
is
the
Name
name
and of
store
such
work
on
of
the
while
node
take
the
inthe
Hadoop
 Installation
  Modes

Hadoop can be installed and run in three different modes, each designed for different use
cases and stages of development or deployment. These modes determine how Hadoop is
configured, where services run, and how the system processes data. The three primary
Hadoop installation modes are:

1. Standalone Mode (Local Mode)

Overview:

 In standalone mode, Hadoop runs on a single machine without any distributed


processing.
 This is the default mode of Hadoop and is typically used for testing and
development purposes. It does not require Hadoop Distributed File System (HDFS),
and all components (MapReduce jobs, for example) run in the local filesystem.

Features:

 No HDFS: All files are stored in the local filesystem, not in HDFS.
 Single JVM: Both the Map and Reduce tasks are executed on a single Java Virtual
Machine (JVM).
 No daemons: None of the Hadoop daemons (like NameNode, DataNode,
ResourceManager, etc.) are started.
 Fast setup: The easiest to set up and configure.
 Use case: Used mainly for debugging, learning, and developing small applications
without involving multiple nodes or complex configurations.

Limitations:

 Not suitable for handling large datasets or distributed processing.


 Lacks fault tolerance, scalability, and parallelism.

2. Pseudo-Distributed Mode (Single-Node Cluster)

Overview:

 In pseudo-distributed mode, Hadoop runs on a single machine, but all Hadoop


daemons (NameNode, DataNode, ResourceManager, NodeManager, etc.) run as
separate Java processes.
 It mimics a real Hadoop cluster but still runs on a single machine. This mode allows
you to simulate the behavior of a multi-node cluster while running on just one node.

Features:

 HDFS is enabled: Data is stored in HDFS, allowing users to test Hadoop’s storage
features (like data replication and fault tolerance).
61
HDFS & Map Reduce

 All daemons run locally: The daemons for HDFS and YARN run as separate
processes on the same machine.
 Testing distributed environment: Though everything runs on a single machine, you
can test job scheduling, fault tolerance, and parallel execution.
 Higher resource usage: Since all daemons run on one machine, it requires more CPU
and memory compared to standalone mode.

Limitations:

 While it simulates a real cluster, all processes still run on a single machine, so it lacks
the true benefits of distributed computing (scalability, performance).
 Resource-intensive compared to standalone mode.

3. Fully Distributed Mode (Multi-Node Cluster)

Overview:

 Fully distributed mode is Hadoop’s production mode, where Hadoop runs on a


cluster of machines.
 The data is distributed across multiple nodes using HDFS, and processing is done in
parallel using MapReduce (or other processing engines like Spark).

Features:

 HDFS is fully distributed: Data is broken into blocks, distributed across multiple
nodes, and replicated for fault tolerance.
 Multiple nodes: Hadoop daemons (NameNode, DataNode, ResourceManager, etc.)
run on different machines.
 Fault tolerance: Hadoop’s fault-tolerant features (like block replication and task re-
execution) are fully operational.
 Scalability: Can scale horizontally by adding more nodes to the cluster, allowing
Hadoop to handle petabytes of data.
 High availability: High availability can be configured for critical components (like
the NameNode).

 Used for –
o Big data analytics for large enterprises where data is stored and processed
across many nodes.
o Parallel processing of large datasets using MapReduce or other engines like
Spark, Hive, or Pig.

Components in Fully Distributed Mode:

 NameNode: Manages metadata and the file system namespace.


 DataNodes: Store data blocks and perform read/write operations.
 ResourceManager: Manages cluster resources and schedules applications.
 NodeManagers: Manage individual node resources and run tasks.

61
HDFS & Map Reduce

Comparison Table:

Standalone Mode Pseudo-Distributed


Feature Fully Distributed Mode
(Local) Mode
Number of
1 1 Multiple
Nodes
HDFS No (Local filesystem) Yes Yes
All daemons run across
Daemons None All daemons run locally
nodes
Simulates distributed Fully parallel and
Processing Local, no parallelism
processing distributed
Development, Testing, simulating a Production, big data
Use Case
debugging cluster processing
Fault
No Limited Yes
Tolerance
Scalability None Limited High

Summary of Installation Modes:

1. Standalone Mode: For small-scale development, debugging, and testing on a single


machine. No HDFS or distributed processing.
2. Pseudo-Distributed Mode: Simulates a cluster environment by running all Hadoop
services on a single machine, allowing you to test HDFS and distributed job
processing.
3. Fully Distributed Mode: Hadoop operates on multiple nodes in a cluster, enabling
distributed storage (HDFS) and parallel processing (MapReduce), designed for large-
scale production environments.

Each mode is tailored to specific needs, from development and testing to full-scale
production in distributed computing environments.

Hadoop Distributors
Overview of Hadoop Distributions

Table 11.1. Hadoop Distributions

Distro Remarks Free / Premium


o Completely free and open
o source
The Hadoop Source
Apache o
No packaging except TAR balls
hadoop.apache.org Hadoop 1.0
No extra tools
61
Hadoop 2.0
HDFS & Map Reduce

o Free / Premium model


o (depending on cluster size)
o
Oldest distro
Cloudera Distribution for
Cloudera Very polished
Hadoop
www.cloudera.com Comes with good tools to install and
manage a Hadoop cluster
CDH 4.0

CDH 5.0
o Completely open source
o
Newer distro
o Hortonworks Data Platform
HortonWorks Tracks Apache Hadoop closely
www.hortonworks.com Comes with tools to manage and
HDP 1.0
administer a cluster
HDP 2.0
o MapR has their own file system
(alternative to HDFS) Free / Premium model
o Boasts higher performance
o Nice set of tools to manage and M3
MapR
administer a cluster
www.mapr.com
o Does not suffer from Single Point of M5
Failure
o Offer some cool features like M8
mirroring, snapshots, etc.

HADOOP DEAMONS

Apache Hadoop consists of the following Daemons:

Name Node

Data Node

Secondary Name Node

Resource Manager

Node Manager

Name node, Secondary Name Node, and Resource Manager work on a Master System while
the
61
Node Manager and Data Node work on the Slave machine.
HDFS & Map Reduce

1. Name Node Name Node works on the Master System. The primary purpose of Name
node is to manage all the Meta Data.Features: It never stores the data that is present in the
file. As Namenode works on the Master System, the Master system should have good
processing power and more RAM than Slaves. It stores the information of DataNode such as
their Block id’s and Number of Blocks.

2. DataNode DataNode works on the Slave system. The NameNode always instructs
DataNode for storing the Data. DataNode is a program that runs on the slave system that
serves the read/write request from the client. As the data is stored in this DataNode, they
should possess high memory to store more Data.

3. Secondary NameNode Secondary NameNode is used for taking the hourly backup of the
data. In case the Hadoop cluster fails, or crashes, the secondary Namenode will take the
hourly backup or checkpoints of that data and store this data into a file name fsimage. This
file then gets transferred to a new system. A new MetaData is assigned to that new system
and a new Master is created with this MetaData, and the cluster is made to run again correctly

As secondary NameNode keeps track of checkpoints in a Hadoop Distributed File System, it


is also known as the checkpoint Node

new
As
checkpoints
System,
HADOOP
Apache
4. secondary
following
Name
Node,
work
purpose
manage
Features:
the
system
processing
more
id’s
Slave
instructs
Data.
runs
serves
from
DataNode,
high
3.
used
of
fails,
Namenode
backup
checkpoints
this
This
Master
MetaData,
cluster
correctly
4.
as
that
Manager
for
running
Mainly
1.
2.
An
responsible
request
makes
on
to
resources
providing
5.
The
Slaves
memory
and
Disk.
Hadoop
NodeManager
in
monitoring
Resource
it.
Name
DataNode
Secondary
the
that
Resource
the
the
ResourceName
The itsystem.
DataNode
As is
Application
Node
Slave
Master
and
data
is
works
Memory
Resource
scheduler
Node
It
system.
file
memory
on
the
or
on
for
Each
RAM
data.
Secondary
In
A case
new
the in
also
node,
also NameNode
present
and
Global
new
Slaves
NameNode
Name
Data a
ApplicationsManager
Scheduler
host
Node Secondary
Resource
Node
As
It
Node
data
MetaData
never
stores
the Hadoop
known
applications
consists
is
the
a
System
System.
is
should
crashes,
or
for
the
in
Hadoop
then
Number
a
cluster
Manager
resource
taking
of
all
client.
DataNode
Node
memory
into
Daemons:
made
and
Manages
created
machine.
Manager
DEAMONS
Master
for
Manager
aon
read/write
and
Resource
this
information
Application
System,
Name
than
Slave
sends
a
they
power
system
the
will
slave
for
Namenode
works
is
Node Hadoop
of
Manager
Secondary
client
such
gets
ato as
applications
NameNode
Master
Data
NameNode
Hadoop
Manager
ManagerResourceis
Node
works the
in
astores
stored
the in
is
Daemon
have
that
Manager
program Distributed
keeps
the
application.
file
Manager
always
Name
on
is take
accepting
the
Meta
that
to
and
has
of
store
The
should
Slaves.
aconsists
the
of
utilized
information
Manager on
assigned
cluster
the
works
isinthe
is Node
Master
system
this
Node
also resource
with
System
node
this
within
thatrun
and checkpoint
transferred
2
Hadoop
dataas
name
and
hourly
on
knownNode
and
the
data
Blocks.
that
for
is
manages
the
works
afor
asthings.
the
good track
primary
secondary
file.
Cluster.
Manager
Daemon
request
their
themore
single
also
again
this
Master.
to File
storing
resources
also
is
possess
in
Master
aare
hourly
running
Name
Globaland
the
is
for of
fsimage.
System.
work
the
that
of
new
the
to
backup
while
ain
on
Block
known
cluster
Master Data.
the
NodeNode
astore
The
the
to
Daemonthe
works on the Master System. The Resource Manager Manages the resources for the
on
that a

applications that are running in a Hadoop Cluster. The Resource ManagerMainly consists of
2 things.

1. ApplicationsManager 2. Scheduler

An Application Manager is responsible for accepting the request for a client and also makes a
memory resource on the Slaves in a Hadoop cluster to host the Application Master. The
scheduler is utilized for providing resources for applications in a Hadoop cluster and for
monitoring this application.

5. Node ManagerThe Node Manager works on the Slaves System that manages the memory
resource within the Node and MemoryDisk. Each Slave Node in a Hadoop cluster has a
single NodeManager Daemon running in it. It also sends this monitoring information to the
Resource Manager

61
HDFS & Map Reduce

Blocks

Now, as we know that the data in HDFS is scattered across the DataNodes as blocks. Let’s
havea look at what is a block and how is it formed?

Blocks are the nothing but the smallest continuous location on your hard drive where data
isstored. In general, in any of the File System, you store the data as a collection of blocks.
Similarly,HDFS stores each file as blocks which are scattered throughout the Apache Hadoop
cluster. Thedefault size of each block is 128 MB in Apache Hadoop 2.x (64 MB in Apache
Hadoop 1.x)which you can configure as per your requirement.

HDFS
store
environment
blocks
replicated
tolerance.
factor
configurable.
see
each
times
DataNodes
the default
inblock
huge
and
is
the
provides
are
Replication
Data 3 which
stored
to
figure
The
also
replication
data
is(considering
provide
as
So,
replicated
Replicationdefault
adata
in
below
ison
as
reliable
aagain
different
you
distributed
fault
blocks.
factor):
Management:replication
where
three
can
wayThe
to
Data Replication

Replication Management:

HDFS provides a reliable way to store huge data in a distributed environment as data blocks. The
blocks are alsoreplicated to provide fault tolerance. The default replication factor is 3 which is again
configurable. So, as you cansee in the figure below where each block is replicated three times and
stored on different DataNodes (consideringthe default replication factor):

61
HDFS & Map Reduce

XML File configrations in Hadoop.

core-site.xml – This configuration file contains Hadoop core configuration settings, for
example, I/O settings, very common for MapReduce and HDFS.
mapred-site.xml – This configuration file specifies a framework name for MapReduce
by setting mapreduce.framework.name
hdfs-site.xml – This configuration file contains HDFS daemons configuration settings. It
also specifies default block permission and replication checking on HDFS.
yarn-site.xml – This configuration file specifies configuration settings for ResourceManager and
NodeManager.
61
HDFS & Map Reduce

HADOOP DEAMONS

Apache Hadoop consists of the following Daemons:

1. Name Node
2. Data Node
3. Secondary Name Node
4. Resource Manager
5. Node Manager

Name node, Secondary Name Node, and Resource Manager work on a Master
System while the Node Manager and Data Node work on the Slave machine.

HDFS Daemons:

(i) NameNode
The NameNode is the master of HDFS that directs the slave DataNodes to perform
I/O tasks.
Blocks: HDFS breaks large file into smaller pieces called blocks.
rackID: NameNode uses rackID to identify data nodes in the rack. (rack is a
collection of datanodes with in the cluster) NameNode keep track of blocks of a file.
File System Namespace: NameNode is the book keeper of HDFS. It keeps track of
how files are broken down into blocks and which DataNode stores these blocks. It is a
collection of files in the cluster.
FsImage: file system namespace includes mapping of blocks of a file, file properties
and is stored in a file called FsImage.
EditLog: namenode uses an EditLog (transaction log) to record every transaction
that happens to the file system metadata.
NameNode is single point of failure of Hadoop cluster.

61
HDFS & Map Reduce

(ii) DataNode
Multiple data nodes per cluster. Each slave machine in the cluster have DataNode
daemon for reading and writing HDFS blocks of actual file on local file system.
During pipeline read and write DataNodes communicate with each other.
It also continuously Sends “heartbeat” message to NameNode to ensure the
connectivity between the Name node and the data node.
If no heartbeat is received for a period of time NameNode assumes that the
DataNode had failed and it is re-replicated.

61
HDFS & Map Reduce

Fig. Interaction between NameNode and DataNode.

(iii)Secondary name node


Takes snapshot of HDFS meta data at intervals specified in the hadoop
configuration.
Memory is same for secondary node as NameNode.
But secondary node will run on a different machine.
In case of name node failure secondary name node can be configured manually to
bring up the cluster i.e; we make secondary namenode as name node.

Special features of HDFS:


1. Data Replication: There is absolutely no need for a client application to track all
blocks. It directs client to the nearest replica to ensure high performance.
2. Data Pipeline: A client application writes a block to the first DataNode in the pipeline.
Then this DataNode takes over and forwards the data to the next node in the pipeline.
This process continues for all the data blocks, and subsequently all the replicas are
written to the disk.

Fig. File Replacement Strategy

61
HDFS & Map Reduce

Basic terminology:

Node: A node is simply a computer. This is typically non-enterprise, commodity


hardware for nodes that contain data.

Rack: Collection of nodes is called as a rack. A rack is a collection of 30 or 40


nodes that are physically stored close together and are all connected to the same
network switch.

Network bandwidth between any two nodes in the same rack is greater than
bandwidth between two nodes on different racks.

Cluster: A Hadoop Cluster (or just cluster from now on) is a collection of racks.

61
HDFS & Map Reduce

File Blocks:
Blocks are nothing but the smallest continuous location on your hard drive where
data is stored. In general, in any of the File System, you store the data as a
collection of blocks. Similarly, HDFS stores each file as blocks which are scattered
throughout the Apache Hadoop cluster. The default size of each block is 128 MB
in Apache Hadoop 2.x (64 MB in Apache Hadoop 1.x) which you can configure as
per your requirement. All blocks of the file are the same size except the last
block, which can be either the same size or smaller. The files are split into 128
MB blocks and then

stored into the Hadoop file system. The Hadoop application is


responsible for distributing the data block across multiple nodes.

61
HDFS & Map Reduce

Let’s take an example where we have a file “example.txt” of size 514


MB as shown in above figure. Suppose that we are using the default
configuration of block size, which is 128 MB. Then, 5 blocks will be
created. The first four blocks will be of 128 MB. But, the last block will
be of 2 MB size only.

Components of HDFS:
HDFS is a block-structured file system where each file is divided into
blocks of a pre-determined size. These blocks are stored across a
cluster of one or several machines. Apache Hadoop HDFS Architecture
follows a Master/Slave Architecture, where a cluster comprises of a
single NameNode (Master node) and all the other nodes are
DataNodes (Slave nodes). HDFS can be deployed on a broad spectrum
of machines that support Java. Though one can run several DataNodes
on a single machine, but in the practical world, these DataNodes are
spread across various machines.

NameNode:
NameNode is the master node in the Apache Hadoop HDFS
Architecture that maintains and manages the blocks present on the
DataNodes (slave nodes). NameNode is a very highly available server
that manages the File System Namespace and controls access to files
by clients. The HDFS architecture is built in such a way that the user
data never resides on the NameNode. Name node contains metadata
and the data resides on DataNodes only.

Functions of NameNode:

 It is the master daemon that maintains and manages the DataNodes (slave
nodes)
61  Manages the file system namespace.
HDFS & Map Reduce

 It records the metadata of all the files stored in the cluster, e.g.
the location of blocks stored, the size of the files, permissions,
hierarchy, etc. There are two files associated with the
metadata:
o FsImage: It contains the complete state of the file
system namespace since the start of the NameNode.
o EditLogs: It contains all the recent modifications made
to the file system with respect to the most recent
FsImage.
 It records each change that takes place to the file system
metadata. For example, if a file is deleted in HDFS, the
NameNode will immediately record this in the EditLog.
 It regularly receives a Heartbeat and a block report from all the
DataNodes in the cluster to ensure that the DataNodes are
live.
 It keeps a record of all the blocks in HDFS and in which nodes these blocks
are located.
 The NameNode is also responsible to take care of the
replication factor of all the blocks
 In case of the DataNode failure, the NameNode chooses
new DataNodes for new replicas, balance disk usage and
manages the communication traffic to the DataNodes.

61
HDFS & Map Reduce

DataNode:
Data Nodes are the slave nodes in HDFS. Unlike NameNode, DataNode
is commodity hardware, that is, a non-expensive system which is not
of high quality or high-availability. The Data Node is a block server that
stores the data in the local file ext3 or ext4.

Functions of DataNode:

 The actual data is stored on DataNodes.

 Datanodes perform read-write operations on the file systems, as per client


request.
 They also perform operations such as block creation, deletion,
and replication according to the instructions of the namenode.
 They send heartbeats to the NameNode periodically to report
the overall health of HDFS, by default; this frequency is set to 3
seconds.

61
HDFS & Map Reduce

Secondary NameNode:
It is a separate physical machine which acts as a helper of name
node. It performs periodic check points. It communicates with the
name node and take snapshot of meta data which helps minimize

downtime and loss of data. The Secondary NameNode works

concurrently with the primary NameNode as a helper daemon.

61
HDFS & Map Reduce

Functions of Secondary NameNode:

 The Secondary NameNode is one which constantly reads all the


file systems and metadata from the RAM of the NameNode
and writes it into the hard disk or the file system.
 It is responsible for combining the EditLogs with FsImage from the
NameNode.

 It downloads the EditLogs from the NameNode at regular


intervals and applies to FsImage. The new FsImage is copied
back to the NameNode, which is used whenever the
NameNode is started the next time.
 Hence, Secondary NameNode performs regular checkpoints in
HDFS. Therefore, it is also called CheckpointNode.

HDFS Architecture: Apache Hadoop HDFS Architecture follows a


Master/Slave Architecture, where a cluster comprises of a single
NameNode (Master node) and all the other nodes are DataNodes
61
(Slave nodes). HDFS can be deployed on a broad spectrum of
HDFS & Map Reduce

machines that support Java. Though one can run several DataNodes on
a single machine, but in the practical world, these DataNodes are
spread across various machines.

Storing data into HDFS:


HDFS stores data in a reliable fashion using replication and
distribution. Here is the series of steps that happen when a client
writes a file in hdfs:

1. Client requests Namenode to create the file. It passes size of file as a


parameter
2. Namenode responds with location of nodes where client can
store data. By default there'll be 3 locations per block. If file
size is 200mb, there'll be 2 blocks, first 128 mb, 2nd 72 mb.
Similarly depending on the size, you'll have n number of
blocks.
3. Client directly starts writing data to the first datanode out of
61
three given by namenode. Please note that if there are 2
HDFS & Map Reduce

blocks to be written client can start writing them in parallel.


4. When the first datanode has stored the block, it replies to
the client with success and now it passes on the same block
to 2nd datanode. 2nd datanode will write this block and
pass it on to 3rd datanode.
5. So basically writing of blocks from client to datanodes
happens in parallel but replication happens in series.
Blocks of same file can go to different nodes, at least the replicated
blocks will always be on different nodes. The first block is always on
the datanode which is nearest to the client, 2nd and 3rd blocks are
stored based on free capacity of the datanodes and/or rack
awareness.

What is Replication Management?


 HDFS performs replication to provide Fault Tolerant and to improve data
reliability.
 There could be situations where the data is lost in many ways-
node is down, Node lost the network connectivity, a node is
physically damaged, and a node is intentionally made unavailable
for horizontal scaling.
 For any of the above-mentioned reasons, data will not be available
if the replication is not made. HDFS usually maintains 3 copies
of each Data Block in different nodes and different Racks. By
doing this, data is made available even if one of the systems is
down.
 Downtime will be reduced by making data replications. This
improves the reliability and makes HDFs fault tolerant.
 Block replication provides fault tolerance. If one copy is not
accessible and corrupted, we can read data from other copy.

 The number of copies or replicas of each block of a file in HDFS


61
Architecture is replication factor. The default replication factor is
HDFS & Map Reduce

3 which are again configurable. So, each block replicates three


times and stored on different DataNodes.
 So, as you can see in the figure below where each block is
replicated three times and stored on different DataNodes
(considering the default replication factor): If we are storing a file
of 128 MB in HDFS using the default configuration, we will end up
occupying a space of 384 MB (3*128 MB).

61 Rack Awareness in HDFS Architecture:


HDFS & Map Reduce

 Rack- It the collection of machines around 30-40. All these


machines are connected using the same network switch and if that
network goes down then all machines in that rack will be out of
service. Thus we say rack is down.
 Rack Awareness was introduced by Apache Hadoop to overcome
this issue. Rack awareness is the knowledge that how the data
nodes are distributed across the rack of Hadoop cluster.
 In the large cluster of Hadoop, in order to improve the network
traffic while reading/writing HDFS file, NameNode chooses the
DataNode which is closer to the same rack or nearby rack to Read
/write request. NameNode achieves rack information by
maintaining the rack ids of each DataNode. This concept that
chooses Datanodes based on the rack information is called Rack
Awareness in Hadoop.

 In HDFS, NameNode makes sure that all the replicas are not stored
on the same rack or single rack; it follows Rack Awareness
Algorithm to reduce latency as well as fault tolerance.
 As we know the default Replication Factor is 3 and Client want to
place a file in HDFS, then Hadoop places the replicas as follows:
1) The first replica is written to the data node creating the
file, to improve the write performance because of the
write affinity.
2) The second replica is written to another data
node within the same rack, to minimize the cross-
rack network traffic.
3) The third replica is written to a data node in a different
rack, ensuring that even if a switch or rack fails, the data is
not lost (Rack awareness).
 This configuration is maintained to make sure that the File is
never lost in case of a Node Failure or even an entire Rack Failure.
61
HDFS & Map Reduce

Advantages of Rack Awareness:

 Minimize the writing cost and Maximize read speed – Rack


awareness places read/write requests to replicas on the same
or nearby rack. Thus minimizing writing cost and maximizing
reading speed.
 Provide maximize network bandwidth and low latency – Rack
awareness maximizes network bandwidth by blocks transfer
within a rack. This is particularly beneficial in cases where tasks
cannot be assigned to nodes where their data is stored locally.
 Data protection against rack failure – By default, the namenode
assigns 2nd & 3rd replicas of a block to nodes in a rack different
from the first replica. This provides data protection even
against rack failure

61
https://www.npntraining.com/blog/anatomy-of-file-read-and-write/
HDFS & Map Reduce

Anatomy of File Read and Write


HDFS has a master and slave kind of architecture.

Namenode acts as master and Datanodes as worker.

All the metadata information is with namenode and the original data is stored on the
datanodes.

Keeping all these in mind the

below figure will give idea about how data flow happens between the Client
interacting with HDFS, i.e. the Namenode and the Datanodes.

Anatomy of File Read

The following steps are involved in reading the file from HDFS:
Let’s suppose a Client (a HDFS Client) wants to read a file from HDFS.

Step 1: First the Client will open the file by giving a call to open() method on
FileSystem object, which for HDFS is an instance of DistributedFileSystem class.

Step 2: DistributedFileSystem calls the Namenode, using RPC (Remote Procedure


61
Call), to determine the locations of the blocks for the first few blocks of the file. For
HDFS & Map Reduce

each block, the NameNode returns the addresses of all the DataNode's that have a
copy of that block. Client will interact with respective DataNode's to read the file.
NameNode also provide a token to the client which it shows to data node for
authentication.

The DistributedFileSystem returns an object of FSDataInputStream(an input stream


that supports file seeks) to the client for it to read data from FSDataInputStream in
turn wraps a DFSInputStream, which manages the datanode and namenode I/O

Step 3: The client then calls read() on the stream. DFSInputStream, which has
stored the DataNode addresses for the first few blocks in the file, then connects to
the first closest DataNode for the first block in the file.

Step 4: Data is streamed from the DataNode back to the client, which calls read()
repeatedly on the stream.

Step 5: When the end of the block is reached, DFSInputStream will close the
connection to the DataNode , then find the best DataNode for the next block. This
happens transparently to the client, which from its point of view is just reading a
continuous stream.

Step 6: Blocks are read in order, with the DFSInputStream opening new connections
to datanodes as the client reads through the stream. It will also call the namenode to
retrieve the datanode locations for the next batch of blocks as needed. When the
client has finished reading, it calls close() on the FSDataInputStream.

Anatomy of File Write

Step 1: The client creates the file by calling create() method on


DistributedFileSystem.

Step
61 2: DistributedFileSystem makes an RPC call to the namenode to create a new
file in the filesystem’s namespace, with no blocks associated with it.
HDFS & Map Reduce

The namenode performs various checks to make sure the file doesn’t already exist
and that the client has the right permissions to create the file.

If these checks pass, the namenode makes a record of the new file; otherwise, file
creation fails and the client is thrown an IOException.

TheDistributedFileSystem returns an FSDataOutputStream for the client to start


writing data to.

Step 3: As the client writes data, DFSOutputStream splits it into packets, which it
writes to an internal queue, called the data queue.

The data queue is consumed by the DataStreamer, which is responsible for asking
the namenode to allocate new blocks by picking a list of suitable datanodes to store
the replicas.

The list of datanodes forms a pipeline, and here we’ll assume the replication level is
three, so there are three nodes in the pipeline.

TheDataStreamer streams the packets to the first datanode in the pipeline, which
stores the packet and forwards it to the second datanode in the pipeline.

Step 4: Similarly, the second datanode stores the packet and forwards it to the third
(and last) datanode in the pipeline.

Step 5: DFSOutputStream also maintains an internal queue of packets that are


waiting to be acknowledged by datanodes, called the ack queue.

A packet is removed from the ack queue only when it has been acknowledged by all
the datanodes in the pipeline.

Step 6: When the client has finished writing data, it calls close() on the stream.

This action flushes all the remaining packets to the datanode pipeline and waits for
acknowledgments before contacting the namenode to signal that the file is complete
The namenode already knows which blocks the file is made up of , so it only has to
wait for blocks to be minimally replicated before returning successfully.

File Read operation:


The steps involved in the File Read are as follows:
1. The client opens the file that it wishes to read from by calling open() on the DFS.
2. The DFS communicates with the NameNode to get the location of data blocks.
NameNode returns with the addresses of the DataNodes that the data blocks are stored
on.

Subsequent to this, the DFS returns an FSD to client to read from the file.
3. Client then calls read() on the stream DFSInputStream, which has addresses of
DataNodes for the first few block of the file.
61
4. Client calls read() repeatedly to stream the data from the DataNode.
HDFS & Map Reduce

5. When the end of the block is reached, DFSInputStream closes the connection with the
DataNode. It repeats the steps to find the best DataNode for the next block and
subsequent blocks.
6. When the client completes the reading of the file, it calls close() on the FSInputStream
to the connection.

Fig. File Read Anatomy


File Write operation:
1. The client calls create() on DistributedFileSystem to create a file.
2. An RPC call to the namenode happens through the DFS to create a new file.
3. As the client writes data, data is split into packets by DFSOutputStream, which is then
writes to an internal queue, called data queue. Datastreamer consumes the data queue.

61
HDFS & Map Reduce

4. Data streamer streams the packets to the first DataNode in the pipeline. It stores
packet and forwards it to the second DataNode in the pipeline.
5. In addition to the internal queue, DFSOutputStream also manages on “Ackqueue” of
the packets that are waiting for acknowledged by DataNodes.
6. When the client finishes writing the file, it calls close() on the stream.

Fig. File Write Anatomy

Explain basic HDFS File operations with an example.

1. Creating a directory:
Syntax: hdfs dfs –mkdir <path>
Eg. hdfs dfs –mkdir /chp
2. Remove a file in specified path:

Syntax: hdfs dfs –rm <src>


Eg. hdfs dfs –rm /chp/abc.txt
3. Copy file from local file system to hdfs:

Syntax: hdfs dfs –copyFromLocal <src> <dst>


Eg.
61 hdfs dfs –copyFromLocal /home/hadoop/sample.txt /chp/abc1.txt
HDFS & Map Reduce

4. To display list of contents in a directory:

Syntax: hdfs dfs –ls <path>


Eg. hdfs dfs –ls /chp
5. To display contents in a file:

Syntax: hdfs dfs –cat <path>


Eg. hdfs dfs –cat /chp/abc1.txt
6. Copy file from hdfs to local file system:

Syntax: hdfs dfs –copyToLocal <src <dst>


Eg. hdfs dfs –copyToLocal /chp/abc1.txt /home/hadoop/Desktop/sample.txt
7. To display last few lines of a file:

Syntax: hdfs dfs –tail <path>


Eg. hdfs dfs –tail /chp/abc1.txt
8. Display aggregate length of file in bytes:

Syntax: hdfs dfs –du <path>


Eg. hdfs dfs –du /chp
9. To count no.of directories, files and bytes under given path:

Syntax: hdfs dfs –count <path>


Eg. hdfs dfs –count /chp
o/p: 1 1 60
10. Remove a directory from hdfs

Syntax: hdfs dfs –rmr <path>


Eg. hdfs dfs rmr /chp

Hadoop Configuraion
Cluster Specification
Hadoop is designed to run on commodity hardware.
How large should your cluster be?
There isn’t an exact answer to this question, but the beauty of Hadoop is that you
can start with a small cluster (say, 10 nodes) and grow it as your storage and
computational needs grow.
For a small cluster (on the order of 10 nodes), it is usually acceptable to run the
namenode and the jobtracker on a single master machine (as long as at least one
copy of the namenode’s metadata is stored on a remote filesystem).

61
HDFS & Map Reduce

As the cluster and the number of files stored in HDFS grow, the namenode needs
more memory, so the namenode and jobtracker should be moved onto separate
machines.
The secondary namenode can be run on the same machine as the namenode, but
again for reasons of memory usage (the secondary has the same memory
requirements as the primary), it is best to run it on a separate piece of hardware,
especially for larger clusters.

Network Topology
A common Hadoop cluster architecture consists of a two-level network topology,
as illustrated in Figure 1. Typically there are 30 to 40 servers per rack, with a 1 GB
switch for the rack (only three are shown in the diagram), and an uplink to a core
switch or router (which is normally 1 GB or better).

Fig:1 Network Topolgy of common Hadoop Cluster

The salient point is that the aggregate bandwidth between nodes on the same rack is
much greater than that between nodes on different racks.

Rack awareness
To get maximum performance out of Hadoop, it is important to configure Hadoop
so that it knows the topology of your network. For multirack clusters, you need
to map nodes to racks. By doing this, Hadoop will prefer within-rack transfers
61
HDFS & Map Reduce

(where there is more bandwidth available) to off-rack transfers when placing


MapReduce tasks on nodes.
The namenode uses the network location when determining where to place block
replicas. The MapReduce scheduler uses network location to determine where the
closest replica is as input to a map task.
For the network in Figure 1, the rack topology is described by two network locations,
say, /switch1/rack1 and /switch1/rack2. Since there is only one top-level switch in
this cluster, the locations can be simplified to /rack1 and /rack2.
The Hadoop configuration must specify a map between node addresses and network
locations. The map is described by a Java interface, DNSToSwitchMapping
For the network in our example, we would map node1, node2, and node3 to
/rack1,and node4, node5, and node6 to /rack2.
The default behavior is to map all nodes to a single network location, called
/default-rack.

Cluster Setup and Installation


Installing Java
Java 6 or later is required to run Hadoop
Installing Hadoop
Download Hadoop from the Apache Hadoop releases page
(http://hadoop.apache.org/core/releases.html), and unpack the contents of the
distribution in a sensible location,such as /usr/local (/opt is another standard choice).
Note that Hadoop is not installed in the hadoop user’s home directory, as that may
be an NFS-mounted directory:
% cd /usr/local
% sudo tar xzf hadoop-x.y.z.tar.gz
We also need to change the owner of the Hadoop files to be the hadoop user and
group:
% sudo chown -R hadoop:hadoop hadoop-x.y.z

Hadoop Configuration
There are a handful of files for controlling the configuration of a Hadoop installation;
the
61 most important ones are listed in Table-1.
HDFS & Map Reduce

Table-1 Hadoop Configuration Files

These files are all found in the etc/hadoop directory of the Hadoop
distribution.
The configuration directory can be relocated to another part of the
filesystem (outside the Hadoop installation, which makes upgrades marginally
easier) as long as daemons are started with the --config option (or, equivalently,
with the HADOOP_CONF_DIR environment variable set) specifying the
location of this directory on the local filesystem.

Configuration Management
Hadoop does not have a single, global location for configuration information.
Instead, each Hadoop node in the cluster has its own set of configuration files,
and it is up to administrators to ensure that they are kept in sync across the system.
61
Hadoop provides a rudimentary facility for synchronizing configuration using rsync
HDFS & Map Reduce

alternatively, there are parallel shell tools that can help do this, like dsh or
pdsh.
Hadoop is designed so that it is possible to have a single set of configuration files
that are used for all master and worker machines.
For a cluster of any size, it can be a challenge to keep all of the machines in
sync: consider what happens if the machine is unavailable when you push out
an update—who ensures it gets the update when it becomes available? This is
a big problem and can lead to divergent installations, so even if you use the
Hadoop control scripts for managing Hadoop, it may be a good idea to use
configuration management tools for maintaining the cluster. These tools are
also excellent for doing regular maintenance, such as patching security holes and
updating system packages.

Control scripts
Hadoop comes with scripts for running commands, and starting and stopping
daemons across the whole cluster. To use these scripts (which can be found in the
bin directory),you need to tell Hadoop which machines are in the cluster. There are
two files for this purpose, called masters and slaves, each of which contains a
list of the machine hostnames or IP addresses, one per line. Both masters and
slaves files reside in the configuration directory, although the slaves file may be
placed elsewhere (and given another name) by changing the HADOOP_SLAVES
setting in hadoop-env.sh. Also, these files do not need to be distributed to worker
nodes, since they are used only by the control scripts running on the
namenode or jobtracker.
For example, the start-dfs.sh script, which starts all the HDFS daemons in the
cluster, runs the namenode on the machine the script is run on.
In slightly more detail, it:
1. Starts a namenode on the local machine (the machine that the script is run on)
2. Starts a datanode on each machine listed in the slaves file
3. Starts a secondary namenode on each machine listed in the masters file
There is a similar script called start-mapred.sh, which starts all the MapReduce
daemons in the cluster.
More specifically, it:
1.
61 Starts a jobtracker on the local machine
HDFS & Map Reduce

2. Starts a tasktracker on each machine listed in the slaves file

Note that masters is not used by the MapReduce control scripts.

Also there are stop-dfs.sh and stop-mapred.sh scripts to stop the daemons
started by the corresponding start script.

Master node scenarios


Depending on the size of the cluster, there are various configurations for running the
master daemons: the namenode, secondary namenode, and jobtracker.
 On a small cluster (a few tens of nodes), it is convenient to put them on a
single machine; however,as the cluster gets larger, there are good reasons to
separate them.
 The namenode has high memory requirements, as it holds file and block
metadata for the entire namespace in memory.
 The secondary namenode, while idle most of the time,has a comparable
memory footprint to the primary when it creates a checkpoint.
 For filesystems with a large number of files, there may not be enough physical
memory on one machine to run both the primary and secondary namenode.
 The secondary namenode keeps a copy of the latest checkpoint of the
filesystem metadata that it creates.
 Keeping this (stale) backup on a different node to the namenode allows
recovery in the event of loss (or corruption) of all the namenode’s metadata
files.
 On a busy cluster running lots of MapReduce jobs, the jobtracker uses
considerable memory and CPU resources, so it should run on a dedicated
node.

Whether the master daemons run on one or more nodes, the following
instructions apply:
• Run the HDFS control scripts from the namenode machine. The masters file should
contain the address of the secondary namenode.
•61Run the MapReduce control scripts from the jobtracker machine.
HDFS & Map Reduce

When the namenode and jobtracker are on separate nodes, their slaves files need to
be kept in sync, since each node in the cluster should run a datanode and a
tasktracker.

Memory
By default, Hadoop allocates 1,000 MB (1 GB) of memory to each daemon it
runs. This is controlled by the HADOOP_HEAPSIZE setting in hadoop-env.sh.

In addition, the task tracker launches separate child JVMs to run map and reduce
tasks in, so we need to factor these into the total memory footprint of a worker
machine.

The maximum number of map tasks that can run on a tasktracker at one time
is controlled by the mapred.tasktracker.map.tasks.maximum property, which
defaults to two tasks.
There is a corresponding property for reduce tasks, mapred.task
tracker.reduce.tasks.maximum, which also defaults to two tasks.
The tasktracker is said to have two map slots and two reduce slots.

Hadoop also provides settings to control how much memory is used for MapReduce
operations. These can be set on a per-job basis also.

For the master nodes, each of the namenode, secondary namenode, and
jobtracker daemons uses 1,000 MB by default, a total of 3,000 MB.
Java
The location of the Java implementation to use is determined by the
JAVA_HOME setting in hadoop-env.sh or from the JAVA_HOME shell
environment variable, if not set in hadoopenv. sh.
It’s a good idea to set the value in hadoop-env.sh, so that it is clearly defined in one
place and to ensure that the whole cluster is using the same version of Java.
System logfiles
61
HDFS & Map Reduce

System logfiles produced by Hadoop are stored in $HADOOP_INSTALL/logs by


default. Each Hadoop daemon running on a machine produces two logfiles. The
first is the log output written via log4j. This file, which ends in .log, should be the
first port of call when diagnosing problems, since most application log messages
are written here.
Old logfiles are never deleted, so you should arrange for them to be periodically
deleted or archived, so as to not run out of disk space on the local node.
The second logfile is the combined standard output and standard error log.
This logfile, which ends in .out, usually contains little or no output, since
Hadoop uses log4j for logging. It is only rotated when the daemon is restarted, and
only the last five logs are retained. Old log files are suffixed with a number between 1
and 5, with 5 being the oldest file.
Logfile names (of both types) are a combination of the name of the user running the
daemon, the daemon name, and the machine hostname.

Important Hadoop Daemon Properties


Hadoop has a large number of configuration properties. For any real world working
cluster we need to set the minimum following properties in in the Hadoop site files:
core-site.xml, hdfs-site.xml, and mapred-site.xml.
Typical examples of these files are shown in Example 9-1, Example 9-2, and
Example 9-3.
Notice that most properties are marked as final, in order to prevent them from being
overridden by job configurations.

Example 9-1. A typical core-site.xml configuration file


<?xml version="1.0"?>
<!-- core-site.xml -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://namenode/</value>
<final>true</final>
61
</property>
HDFS & Map Reduce

</configuration>

Example 9-2. A typical hdfs-site.xml configuration file


<?xml version="1.0"?>
<!-- hdfs-site.xml -->
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/disk1/hdfs/name,/remote/hdfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.data.dir</name>
<value>/disk1/hdfs/data,/disk2/hdfs/data</value>
<final>true</final>
</property>
<property>
<name>fs.checkpoint.dir</name>
<value>/disk1/hdfs/namesecondary,/disk2/hdfs/namesecondary</value>
<final>true</final>
</property>
</configuration>

Example 9-3. A typical mapred-site.xml configuration file


<?xml version="1.0"?>
<!-- mapred-site.xml -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>jobtracker:8021</value>
<final>true</final>
</property>
<property>
61
<name>mapred.local.dir</name>
HDFS & Map Reduce

<value>/disk1/mapred/local,/disk2/mapred/local</value>
<final>true</final>
</property>
<property>
<name>mapred.system.dir</name>
<value>/tmp/hadoop/mapred/system</value>
<final>true</final>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>7</value>
<final>true</final>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>7</value>
<final>true</final>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx400m</value>
<!-- Not marked as final so jobs can include JVM debugging options -->
</property>
</configuration>

HDFS
To run HDFS, you need to designate one machine as a namenode. In this case, the
property fs.default.name is an HDFS filesystem URI, whose host is the
namenode’s hostname or IP address, and port is the port that the namenode
will listen on for RPCs. If no port is specified, the default of 8020 is used.
The fs.default.name property also doubles as specifying the default filesystem. The
default filesystem is used to resolve relative paths, which are handy to use since

61
HDFS & Map Reduce

they save typing (and avoid hardcoding knowledge of a particular namenode’s


address).
There are a few other configuration properties you should set for HDFS: those
that set the storage directories for the namenode and for datanodes. The property
dfs.name.dir specifies a list of directories where the namenode stores
persistent filesystem metadata (the edit log and the filesystem image). A copy
of each of the metadata files is stored in each directory for redundancy. It’s common
to configure dfs.name.dir so that the namenode metadata is written to one or two
local disks, and a remote disk, such as an NFS-mounted directory. Such a setup
guards against failure of a local disk and failure of the entire namenode, since in both
cases the files can be recovered and used to start a new namenode. (The secondary
namenode takes only periodic checkpoints of the namenode, so it does not provide
an up-to-date backup of
the namenode.)
You should also set the dfs.data.dir property, which specifies a list of
directories for a datanode to store its blocks. Unlike the namenode, which uses
multiple directories for redundancy, a datanode round-robins writes between its
storage directories, so for performance you should specify a storage directory for
each local disk. Read performance also benefits from having multiple disks for
storage, because blocks will be spread across them, and concurrent reads for
distinct blocks will be correspondingly spread across disks.
Finally, you should configure where the secondary namenode stores its
checkpoints of the filesystem. The fs.checkpoint.dir property specifies a list of
directories where the checkpoints are kept.
Like the storage directories for the namenode, which keep redundant copies of the
namenode metadata, the checkpointed filesystem image is stored in each
checkpoint directory for redundancy.

Table -2 summarizes the important configuration properties for HDFS.

61
HDFS & Map Reduce

Table -2. Important HDFS daemon properties

MapReduce
To run MapReduce, you need to designate one machine as a jobtracker, which on
small clusters may be the same machine as the namenode. To do this, set the
mapred.job.tracker property to the hostname or IP address and port that the
jobtracker will listen on. Note that this property is not a URI, but a host-port pair,
separated by a colon. The port number 8021 is a common choice.
During a MapReduce job, intermediate data and working files are written to
temporary local files. Since this data includes the potentially very large output of map
tasks, you need to ensure that the mapred.local.dir property, which controls
the location of local temporary storage, is configured to use disk partitions
that are large enough. The mapred.local.dir property takes a comma-separated
list of directory names, and you should use all available local disks to spread
disk I/O.
MapReduce uses a distributed filesystem to share files (such as the job JAR file)
with the tasktrackers that run the MapReduce tasks. The mapred.system.dir
property is used to specify a directory where these files can be stored. This
directory is resolved relative to the default filesystem (configured in fs.default.name),
which is usually HDFS.
61
HDFS & Map Reduce

Finally, you should set the mapred.tasktracker.map.tasks.maximum and


mapred.tasktracker.reduce.tasks.maximum properties to reflect the number of
available cores on the tasktracker machines and mapred.child.java.opts to
reflect the amount of memory available for the tasktracker child JVMs.
Table -3 summarizes the important configuration properties for MapReduce.
Table -3. Important MapReduce daemon properties

Other Hadoop Properties


This section discusses some other properties that you might consider setting.
Cluster membership
To aid the addition and removal of nodes in the future, you can specify a file
containing a list of authorized machines that may join the cluster as datanodes
or
61 tasktrackers.
HDFS & Map Reduce

The file is specified using the dfs.hosts (for datanodes) and mapred.hosts (for
tasktrackers) properties, as well as the corresponding dfs.hosts.exclude and
mapred.hosts.exclude files used for decommissioning.

Buffer size
Hadoop uses a buffer size of 4 KB (4,096 bytes) for its I/O operations. This is a
conservative setting, and with modern hardware and operating systems, you will
likely see performance benefits by increasing it; 128 KB (131,072 bytes) is a
common choice. Set this using the io.file.buffer.size property in core-site.xml.
HDFS block size
The HDFS block size is 64 MB by default, but many clusters use 128 MB
(134,217,728 bytes) or even 256 MB (268,435,456 bytes) to ease memory pressure
on the namenode and to give mappers more data to work on. Set this using the
dfs.block.size property in hdfs-site.xml.
Reserved storage space
By default, datanodes will try to use all of the space available in their storage
directories.If you want to reserve some space on the storage volumes for non-HDFS
use, then you can set dfs.datanode.du.reserved to the amount, in bytes, of space to
reserve.
Trash
Hadoop filesystems have a trash facility, in which deleted files are not actually
deleted,but rather are moved to a trash folder, where they remain for a minimum
period before being permanently deleted by the system. The minimum period in
minutes that a file will remain in the trash is set using the fs.trash.interval
configuration property in core-site.xml. By default, the trash interval is zero, which
disables trash.
HDFS will automatically delete files in trash folders, but other filesystems will not, so
you have to arrange for this to be done periodically. You can expunge the trash,
which will delete files that have been in the trash longer than their minimum period,
using the filesystem shell:
% hadoop fs -expunge
The Trash class exposes an expunge() method that has the same effect.
61
Job scheduler
HDFS & Map Reduce

Particularly in a multiuser MapReduce setting, consider changing the default FIFO


job scheduler to one of the more fully featured alternatives.

Reduce slow start


By default, schedulers wait until 5% of the map tasks in a job have completed before
scheduling reduce tasks for the same job. For large jobs this can cause problems
with cluster utilization, since they take up reduce slots while waiting for the map
tasks to complete.
Setting mapred.reduce.slowstart.completed.maps to a higher value, such as 0.80
(80%), can help improve throughput.

User Account Creation


Once you have a Hadoop cluster up and running, you need to give users access to
it. This involves creating a home directory for each user and setting ownership
permissions on it:
% hadoop fs -mkdir /user/username
% hadoop fs -chown username:username /user/username
This is a good time to set space limits on the directory. The following sets a 1 TB
limit on the given user directory:
% hadoop dfsadmin -setSpaceQuota 1t /user/username

YARN Configuration

YARN is the next-generation architecture for running MapReduce. It has a different


set of daemons and configuration options to classic MapReduce (also called
MapReduce 1),
Under YARN you no longer run a jobtracker or tasktrackers. Instead, there is a
single resource manager running on the same machine as the HDFS namenode (for
small clusters) or on a dedicated machine, and node managers running on each
worker node in the cluster.
The YARN start-all.sh script (in the bin directory) starts the YARN daemons in the
cluster. This script will start a resource manager (on the machine the script is run
on),and a node manager on each machine listed in the slaves file.
61
HDFS & Map Reduce

YARN also has a job history server daemon that provides users with details of past
job runs, and a web app proxy server for providing a secure way for users to access
the UI provided by YARN applications.
YARN has its own set of configuration files listed in Table -4, these are used in
addition to those in Table -1.
Table -4. YARN configuration files

Important YARN Daemon Properties


When running MapReduce on YARN the mapred-site.xml file is still used for general
MapReduce properties, although the jobtracker and tasktracker-related properties
are
not used.
Table -5 Important YARN Daemon Properties

Hadoop’s performance, functionality, and environment can be fine-tuned through various


configuration files and settings. Here are the main Hadoop configuration files and some of
the key properties commonly configured:
61
1. Core Hadoop Configuration Files
HDFS & Map Reduce

1. core-site.xml
o Contains configuration settings for the Hadoop core, such as the default
filesystem and I/O settings.
o Key configurations:
 fs.defaultFS: Sets the default file system URI (e.g., HDFS, local file
system).
 hadoop.tmp.dir: The default directory for Hadoop to store temporary
files.
 io.file.buffer.size: Buffer size for reading and writing data to/from
HDFS.
2. hdfs-site.xml
o Configures settings for HDFS, including replication, storage, and
namenode/datanode settings.
o Key configurations:
 dfs.replication: Sets the default replication factor for HDFS blocks.
 dfs.blocksize: Specifies the block size for HDFS files.
 dfs.namenode.name.dir: Local filesystem paths for storing
NameNode metadata.
 dfs.datanode.data.dir: Paths where DataNodes store data.
3. mapred-site.xml (replaced by mapreduce-site.xml in Hadoop 2.x and later)
o Configures settings for MapReduce, such as job execution parameters.
o Key configurations:
 mapreduce.framework.name: Defines the execution framework for
MapReduce (e.g., YARN).
 mapreduce.job.tracker: Specifies the job tracker (used in older
versions; replaced by YARN ResourceManager in newer versions).
 mapreduce.task.io.sort.mb: Buffer size for sorting map output before
writing to disk.
 mapreduce.map.memory.mb / mapreduce.reduce.memory.mb:
Memory allocation for map and reduce tasks.
4. yarn-site.xml (Hadoop 2.x and later)
o Configures settings for YARN (Yet Another Resource Negotiator), which
manages resources and job scheduling.
o Key configurations:
 yarn.resourcemanager.hostname: Specifies the hostname of the
YARN ResourceManager.
 yarn.nodemanager.resource.memory-mb: Maximum memory
available for NodeManager containers.
 yarn.scheduler.maximum-allocation-mb: Maximum memory that
can be allocated to a single container.
 yarn.nodemanager.vmem-pmem-ratio: Virtual memory to physical
memory ratio.

Additional Hadoop Configuration Files

5. slaves (or workers)


o Lists the hostnames or IP addresses of all worker nodes.
o Used to manage which nodes are designated as DataNodes and NodeManagers
in the Hadoop cluster.
61
6. masters
HDFS & Map Reduce

o Lists the master nodes responsible for NameNode and ResourceManager


functions.
o Typically, this file contains the hostname of the NameNode and/or Secondary
NameNode.

Configuration Modes

1. Standalone Mode
o Hadoop runs on a single JVM, mainly for testing and debugging.
o Only uses the local filesystem, so no HDFS configuration is required.
2. Pseudo-Distributed Mode
o Runs on a single machine but simulates a cluster by using multiple JVMs.
o HDFS and YARN are set up as if it were a cluster, but everything runs on one
machine.
3. Fully Distributed Mode
o Runs Hadoop on a cluster of multiple nodes, distributing storage and
processing across the nodes.
o This requires setting up and configuring all the above-mentioned configuration
files and tuning them according to cluster size and workload.

MapReduce Framework
MapReduce is a programming framework that allows us to perform parallel and
distributed processing on huge data sets in distributed environment.
Map Reduce can be implemented by its components:
1. Architectural components
a) Job Tracker
b) Task Trackers
2. Functional components
a) Mapper (map)
b) Combiner (Shuffler)
c) Reducer (Reduce)

Architectural Components:
 The complete execution process (execution of Map and Reduce
tasks, both) is controlled by two types of entities called:
1. A Job tracker : Acts like a master (responsible for complete execution of
submitted job) JobTracker is a master daemon responsible for executing
61 over MapReduce job. It provides connectivity between Hadoop and
application.
HDFS & Map Reduce

2. Multiple Task Trackers : Acts like slaves, each of them performing the job
This daemon is responsible for executing individual tasks that is assigned by
the Job Tracker.
 For every job submitted for execution in the system, there is
one Job tracker that resides on Name node and there are
multiple task trackers which reside on Data node.

 A job is divided into multiple tasks which are then run


onto multiple data nodes in a cluster.
 It is the responsibility of job tracker to coordinate the activity
by scheduling tasks to run on different data nodes.
 Execution of individual task is then look after by task
tracker, which resides on every data node executing part
of the job.
 Task tracker's responsibility is to send the progress report to the job
tracker.
 In addition, task tracker periodically sends 'heartbeat' signal
to the Job tracker so as to notify him of current state of the
system.
 Thus job tracker keeps track of overall progress of each job.
In the event of task failure, the job tracker can reschedule it

61
HDFS & Map Reduce

on a different task tracker.

Functional Components:
A job (complete Work) is submitted to the master, Hadoop divides the
job into phases , map phase and reduce phase. In between Map and
Reduce, there is small phase called shuffle & Sort in MapReduce.
1. Map tasks
2. Reduce tasks

Map Phase: This is very first phase in the execution of map-reduce


program. In this phase data in each split is passed to a mapping
function to produce output values. The map takes key/value pair as
input. Key is a reference to the input. Value is the data set on which to
operate. Map function applies the business logic to every value in
input. Map produces an output is called intermediate output. An
output of map is stored on the local disk from where it is shuffled to
reduce nodes.

Reduce Phase: In MapReduce Reduce takes intermediate Key / Value


pairs as input and process the output of the mapper. Key/value pairs
provided to reducer are sorted by key. Usually, in the reducer, we do
aggregation or summation sort of computation. A function defined by
user supplies the values for a given key to the Reduce function.
Reduce produces a final output as a list of key/value pairs. This final
output is stored in HDFS and replication is done as usual.

Shuffling: This phase consumes output of mapping phase. Its task is to


consolidate the relevant records from Mapping phase output.

61
HDFS & Map Reduce

A Word Count Example of MapReduce:


Let us understand, how a MapReduce works by taking an example where I have a
text file called example.txt whose contents are as follows:
Dear, Bear, River, Car, Car, River, Deer, Car and Bear
Now, suppose, we need to perform a word count on the sample.txt
using MapReduce. So, we will be finding the unique words and the
number of occurrences of those unique words.

61
HDFS & Map Reduce

 First, we divide the input in three splits as shown in the figure.


This will distribute the work among all the map nodes.
 Then, we tokenize the words in each of the mapper and give a
hardcoded value (1) to each of the tokens or words. The
rationale behind giving a hardcoded value equal to 1 is that
every word, will occur once.
 Now, a list of key-value pair will be created where the key is the
individual word and value is one. So, for the first line (Dear Bear
River) we have 3 key-value pairs – Dear, 1; Bear, 1; River, 1. The
mapping process remains the same on all the nodes.
 After mapper phase, a partition process takes place where
sorting and shuffling happens so that all the tuples with the
same key are sent to the corresponding reducer.
 So, after the sorting and shuffling phase, each reducer will have
a unique key and a list of values corresponding to that very
61
key. For example, Bear, [1,1]; Car, [1,1,1].., etc.
HDFS & Map Reduce

 Now, each Reducer counts the values which are present in that
list of values. As shown in the figure, reducer gets a list of
values which is [1,1] for the key Bear. Then, it counts the
number of ones in the very list and gives the final output as –
Bear, 2.
 Finally, all the output key/value pairs are then collected and written in the
output file.

Heartbeat Signal:
 HDFS follows a master slave architecture. Namenode (master)
stores metadata about the data and Datanodes store/process the
actual data (and its replications).
 Now the namenode should know if any datanode in a cluster is
down (power failure/network failure) otherwise it will continue
assigning tasks or sending data/replications to that dead datanode.
 Heartbeat is a mechanism for detecting datanode failure and
ensuring that the link between datanodes and namenode is intact.
In Hadoop , Name node and data node do communicate using
Heartbeat. Therefore, Heartbeat is the signal that is sent by the
datanode to the namenode after the regular interval of time to
indicate its presence, i.e. to indicate that it is available.
 The default heartbeat interval is 3 seconds. If the DataNode in
HDFS does not send heartbeat to NameNode in ten minutes, then
NameNode considers the DataNode to be out of service and the
Blocks replicas hosted by that DataNode to be unavailable.
 Hence once the heartbeat stops sending a signal to NameNode,
then NameNode perform certain tasks such as replicating the
blocks present in DataNode to other DataNodes to make the data
is highly available and ensuring data reliability.
61  NameNode that receives the Heartbeats from a DataNode also
HDFS & Map Reduce

carries information like total storage capacity, the fraction of


storage in use, and the number of data transfers currently in
progress. For the NameNode’s block allocation and load balancing
decisions, we use these statistics.

Speculative Execution
 In Hadoop, MapReduce breaks jobs into tasks and these tasks run
parallel rather than sequential, thus reduces overall execution
time. This model of execution is sensitive to slow tasks (even if
they are few in numbers) as they slow down the overall execution
of a job.
 There may be various reasons for the slowdown of tasks, including
hardware degradation or software misconfiguration, but it may be
difficult to detect causes since the tasks still complete successfully,
although more time is taken than the expected time.
 The Hadoop framework does not try to diagnose or fix the slow-
running tasks. The framework tries to detect the task which is
running slower than the expected speed and launches another task,
which is an equivalent task as a backup. The backup task is known
as the speculative task, and this process is known as speculative
execution in Hadoop.

61
HDFS & Map Reduce

 As the name suggests, Hadoop tries to speculate the slow running


tasks, and runs the same tasks in the other nodes parallel.
Whichever task is completed first, that output is considered for
proceeding further, and the slow running tasks are killed.
 Firstly, all the tasks for the job are launched in Hadoop MapReduce.
The speculative tasks are launched for those tasks that have been
running for some time (at least one minute) and have

not made any much progress, on average, as compared with other


tasks from the job. The speculative task is killed if the original task
completes before the speculative task, on the other hand, the
original task is killed if the speculative task finishes before it.
 In conclusion, we can say that Speculative execution is the key
feature of Hadoop that improves job efficiency. Hence, it reduces
the job execution time.

A MapReduce programming using Java requires three classes:


1. Driver Class: This class specifies Job configuration details.
2. MapperClass: this class overrides the MapFunction based on the problem statement.
3. Reducer Class: This class overrides the Reduce function based on the problem
statement.

Map Reduce Framework


Phases: Daemons:
Map: Converts input into key-value JobTracker: Master, Schedules Task
pairs. TaskTracker: Slave, Execute task
Reduce: Combines output of
mappers and produces a reduced
61 set.
result
HDFS & Map Reduce

What is MapReduce. Explain indetail different phases in MapReduce. (or) Explain


MapReduce anatomy.
MapReduce is a programming model for data processing. Hadoop can run MapReduce
programs written in Java, Ruby and Python.
MapReduce programs are inherently parallel, thus very large scale data analysis can be
done fastly.

61
HDFS & Map Reduce

In MapReduce programming, Jobs(applications) are split into a set of map tasks and
reduce tasks.
Map task takes care of loading, parsing, transforming and filtering.
The responsibility of reduce task is grouping and aggregating data that is produced by
map tasks to generate final output.
Each map task is broken down into the following phases:
1. Record Reader 2. Mapper
3. Combiner 4.Partitioner.

The output produced by the map task is known as intermediate <keys, value> pairs.
These intermediate <keys, value> pairs are sent to reducer.
The reduce tasks are broken down into the following phases:
1. Shuffle 2. Sort
3. Reducer 4. Output format.

Hadoop assigns map tasks to the DataNode where the actual data to be processed
resides. This way, Hadoop ensures data locality. Data locality means that data is not
moved over network; only computational code moved to process data which saves
network bandwidth.
Mapper Phases:
Mapper maps the input <keys, value> pairs into a set of intermediate <keys, value>
pairs.
Each map task is broken into following phases:
1. RecordReader: converts byte oriented view of input in to Record oriented view and
presents it to the Mapper tasks. It presents the tasks with keys and values.
i) InputFormat: It reads the given input file and splits using the method getsplits().
ii) Then it defines RecordReader using createRecordReader() which is responsible for
generating <keys, value> pairs.

2. Mapper: Map function works on the <keys, value> pairs produced by RecordReader
and generates intermediate (key, value) pairs.

Methods:
- protected void cleanup(Context context): called once at tend of task.
- protected void map(KEYIN key, VALUEIN value, Context context): called once for
each key-value pair in input split.
- void run(Context context): user can override this method for complete control over
execution of Mapper.
- protected void setup(Context context): called once at beginning of task to perform
required activities to initiate map() method.
3. Combiner: It takes intermediate <keys, value> pairs provided by mapper and applies
user specific aggregate function to only one mapper. It is also known as local Reducer.

61
HDFS & Map Reduce

We can optionally specify a combiner using Job.setCombinerClass(ReducerClass) to


perform local aggregation on intermediate outputs.

Fig. MapReduce without Combiner class

Fig. MapReduce with Combiner class


4. Partitioner: Take intermediate <keys, value> pairs produced by the mapper, splits
them into partitions the data using a user-defined condition.

The default behavior is to hash the key to determine the reducer.User can control by
using the method:
int getPartition(KEY key, VALUE value, int numPartitions )

61
HDFS & Map Reduce

Reducer Phases:
1. Shuffle & Sort:
Downloads the grouped key-value pairs onto the local machine, where the Reducer is
running.
The individual <keys, value> pairs are sorted by key into a larger data list.
The data list groups the equivalent keys together so that their values can be iterated
easily in the Reducer task.
2. Reducer:
The Reducer takes the grouped key-value paired data as input and runs a Reducer
function on each one of them.
Here, the data can be aggregated, filtered, and combined in a number of ways, and it
requires a wide range of processing.
Once the execution is over, it gives zero or more key-value pairs to the final step.

Methods:
- protected void cleanup(Context context): called once at tend of task.
- protected void reduce(KEYIN key, VALUEIN value, Context context): called once
for each key-value pair.
- void run(Context context): user can override this method for complete control over
execution of Reducer.
- protected void setup(Context context): called once at beginning of task to perform
required activities to initiate reduce() method.
3. Output format:
In the output phase, we have an output formatter that translates the final key-value
pairs from the Reducer function and writes them onto a file using a record writer.

61
HDFS & Map Reduce

Compression: In MapReduce programming we can compress the output file.


Compression provides two benefits as follows:
Reduces the space to store files.
Speeds up data transfer across the network.

We can specify compression format in the Driver program as below:


conf.setBoolean(“mapred.output.compress”,true);
conf.setClass(“mapred.output.compression.codec”,GzipCodec.class,CompressionCodec
.class);
Here, codec is the implementation of a compression and decompression algorithm, GzipCodec is
the compression algorithm for gzip.

61

You might also like