Introduction to the
Hadoop Distributed File System (HDFS)
Course Road Map
Lesson 5: Introduction to the Hadoop
Module 1: Big Data Management System Distributed File System (HDFS)
Lesson 6: Acquire Data using CLI, Fuse-
Module 2: Data Acquisition and Storage DFS, and Flume
Lesson 07: Acquire and Access Data
Module 3: Data Access and Processing
Using Oracle NoSQL Database
Module 4: Data Unification and Analysis Lesson 08: Primary Administrative Tasks
for Oracle NoSQL Database
Module 5: Using and Managing Oracle
Big Data Appliance
5-2
Objectives
After completing this lesson, you should be able to:
• Describe the architectural components of HDFS
• Use the FS shell command-line interface (CLI) to interact
with data stored in HDFS
5-3
Agenda
• Understand the architectural components of HDFS
• Use the FS shell command-line interface (CLI) to interact
with data stored in HDFS
5-4
HDFS: Characteristics
HDFS is designed more for batch processing rather than interactive use by users.
Use a scale-out model based on inexpensive commodity servers with internal disks
rather than RAID to achieve large-scale storage.
Highly fault-tolerant
High throughput
Suitable for applications with large
data sets
Streaming access to file system data
Can be built out of commodity
hardware
5-5
HDFS Deployments:
High Availability (HA) and Non-HA
• Non-HA Deployment:
– Uses the NameNode/Secondary NameNode architecture
– The Secondary NameNode is not a failover for the
NameNode.
– The NameNode was the Single Point of Failure (SPOF) of
the cluster before Hadoop 2.0 and CDH 4.0.
• HA Deployment:
– Active NameNode
– Standby NameNode
5-7
HDFS Key Definitions
Term Description
Cluster A group of servers (nodes) on a network that are configured to
work together. A server is either a master node or a slave
(worker) node.
Hadoop A batch processing infrastructure that stores and distributes
files and distributes work across a group of servers (nodes).
Hadoop Cluster A collection of Racks containing master and slave nodes
Blocks HDFS breaks down a data file into blocks or "chunks" and
stores the data blocks on different slave DataNodes in the
Hadoop cluster.
Replication Factor HDFS makes three copies of data blocks and stores on
different DataNodes/Racks in the Hadoop cluster.
NameNode (NN) A service (Daemon) that maintains a directory of all files in
HDFS and tracks where data is stored in the HDFS cluster.
Secondary NameNode Performs internal NameNode transaction log checkpointing
DataNode (DN) Stores the blocks "chunks" of data for a set of files
5-8
NameNode (NN)
Manages the file system namespace (metadata) and controls access to files by client applications
File: movieplex1.log
Blocks (chunks) Blocks:
A, B, C
A Data Nodes:
1, 2, 3
B
Replication Factor: 3
C A: DN 1,DN 2, DN 3
B: DN 1,DN 2, DN 3
movieplex1.log C: DN 1,DN 2, DN 3
. . .
•
•
•
•
•
•
5-9
Functions of the NameNode
• Acts as the repository for all HDFS metadata
• Maintains the file system namespace
• Executes the directives for opening, closing, and renaming
files and directories
• Stores the HDFS state in an image file (fsimage)
• Stores file system modifications in an edit log file (edits)
• On startup, merges the fsimage and edits files, and
then empties edits
• Places replicas of blocks on multiple racks for fault
tolerance
• Records the number of replicas (replication factor) of a file
specified by an application
5 - 10
Secondary NameNode (Non-HA)
Backup of Namenode
NameNode Secondary NameNodes
File: movieplex1.log File: movieplex1.log
Blocks (chunks) Blocks: Blocks:
A, B, C A, B, C
A Data Nodes: Data Nodes:
1, 2, 3 1, 2, 3
B
Replication Factor: 3 Replication Factor: 3
C A: DN 1,DN 2, DN 3 A: DN 1,DN 2, DN 3
B: DN 1,DN 2, DN 3 B: DN 1,DN 2, DN 3
movieplex1.log C: DN 1,DN 2, DN 3 C: DN 1,DN 2, DN 3
. . . . . .
•
•
•
•
5 - 11
DataNodes (DN)
DataNode is responsible for storing the actual data in HDFS.
Blocks NameNode (Master)
A (128 MB) File: movieplex1.log
Blocks: A, B, C
Data Nodes: 1, 2, 3
B (128 MB) Replication Factor: 3
A: DN 1,DN 2, DN 3
C (94 MB) B: DN 1,DN 2, DN 3
C: DN 1,DN 2, DN 3
. . .
movieplex1.log; 350 MB in size
and a block size of 128 MB.
The Client chunks the file into
(3) blocks: A, B, and C
A B
C A
...
B C
Data Node 1 (slave) Data Node 2 (slave)
5 - 12
Functions of DataNodes
DataNodes perform the following functions:
• Serving read and write requests from the file system
clients
• Performing block creation, deletion, and replication based
on instructions from the NameNode
• Providing simultaneous send/receive operations to
DataNodes during replication (“replication pipelining”)
DataNode
A
C
B
Slave Node
5 - 13
NameNode and Secondary NameNodes
NameNode and Secondary
Blocks NameNodes (Masters)
A (128 MB) File: movieplex1.log
File: movieplex1.log
Blocks:
A, Blocks:
B, C
C (128 MB) Data Nodes:
A, B, C
1, Data
2, 3 Nodes:
1, 2, 3
RF:3
B (94 MB)
A: RF:
DN 31,DN 2, DN 3
DNDN
B: A: 1,DN
1,DN 2, 2,
DNDN 3 3
movieplex1.log; 350 MB in size
C: DN 1,DN 2, DN 3 3
B: DN 1,DN 2, DN
and a block size of 128 MB. C: DN 1,DN 2, DN 3
The Client chunks the file into . . .
(3) blocks: A, B, and C
A B C
C A B
B C A
DataNode 1 (slave) DataNode 2 (slave) DataNode 3 (salve)
5 - 14
Storing and Accessing Data Files in HDFS
NameNode Secondary NameNode
Blocks File: movieplex1.log File: movieplex1.log
Blocks: A, B, C Blocks: A, B, C
A Data Nodes: 1, 2, 3 Data Nodes: 1, 2, 3
A: DN1,DN2, DN3 A: DN1,DN2, DN3
B: DN1,DN2, DN3 B: DN1,DN2, DN3
B C: DN1,DN2, DN3 C: DN1,DN2, DN3
. . . . . .
C
movieplex1.log
Master Master
Ack messages from the pipeline are sent
back to the client (blocks are copied)
Slave Slave Slave
A B C
C A B
B C A
DataNode 1 DataNode 2 DataNode 3
5 - 15
HDFS Architecture: HA
Component Description
NameNode Responsible for all client operations in the cluster
(Active) Daemon
NameNode Acts as a slave or "hot" backup to the Active NameNode,
(Standby) Daemon maintaining enough information to provide a fast failover if
necessary
DataNode Daemon This is where the data is stored (HDFS) and processed
(MapReduce). This is a slave node.
Hadoop 2.0 & later, CDH 4.0 & Later
Master Node Master Node Slave Node
5 - 16
Data Replication Rack-Awareness in HDFS
Block A : A Block B : B Block C : C
Rack 1 Rack 2 Rack 3
A A
C A B B
C B
5 - 17
Accessing HDFS
5 - 18
Agenda
• Understand the architectural components of HDFS
• Use the FS shell command-line interface (CLI) to interact
with data stored in HDFS
5 - 19
HDFS Commands
5 - 20
The File System Namespace:
The HDFS FS (File System) Shell Interface
• HDFS supports a traditional hierarchical file organization.
• You can use the FS shell command-line interface to
interact with the data in HDFS. The syntax of this
command set is similar to other shells (e.g., bash, csh)
– You can create, remove, rename, and move directories/files.
• You can invoke the FS shell as follows:
hadoop fs <args>
• The general command-line syntax is as follows:
hadoop command [genericOptions] [commandOptions]
5 - 21
FS Shell Commands
5 - 22
Basic File System Operations: Examples
hadoop fs -ls
• For a file returns stat on the file with the following format:
– permissions number_of_replicas userid groupid
filesize modification_date modification_time
filename
• For a directory it returns list of its direct children as in
UNIX. A directory is listed as:
– permissions userid groupid modification_date
modification_time dirname
5 - 23
Basic File System Operations: Examples
Create an HDFS directory named curriculum by using the mkdir command:
Copy lab_05_01.txt from the local file system to the curriculum HDFS
directory by using the copyFromLocal command:
5 - 24
Basic File System Operations: Examples
Delete the curriculum HDFS directory by using the rm command. Use the -r option
to delete the directory and any content under it recursively:
Display the contents of the part-r-00000 HDFS file by using the cat command:
5 - 25
Using the hdfs fsck Command: Example
5 - 26
HDFS Features and Benefits
HDFS provides the following features and benefits:
• A Rebalancer to evenly distribute data across the
DataNodes
• A file system checking utility (fsck) to perform health
checks on the file system
• Procedures for upgrade and rollback
• A secondary NameNode to enable recovery and keep the
edits log file size within a limit
• A Backup Node to keep an in-memory copy of the
NameNode contents
5 - 27
Summary
In this lesson, you should have learned how to:
• Describe the architectural components of HDFS
• Use the FS shell command-line interface (CLI) to interact
with data stored in HDFS
5 - 28