CS6CRT19 Big Data Analytics Module 3
The Hadoop Distributed File System (HDFS)
Big Data analytics applications are software applications that make use of
large-scale data. The applications analyze Big Data using massive parallel
processing frameworks. HDFS is a core component of Hadoop. HDFS is designed
to run on a cluster of computers and servers at cloud-based utility services. HDFS
stores Big Data which may range from GBs to PBs. HDFS stores the data in a
distributed manner in order to compute fast. The distributed data store in HDFS
stores data in any format regardless of schema. HDFS provides high throughput
access to data-centric applications that require large-scale data processing
workloads. Figure 3-1 shows the HDFS architecture.
Figure 3-1: HDFS architecture
Namenode
The name node is the commodity hardware that contains the GNU/Linux
operating system and the name node software. It is a software that can be run on
Swamy Saswathikananda College, Poothotta 1
CS6CRT19 Big Data Analytics Module 3
commodity hardware. The system having the name node acts as the master server
and it does the following tasks:
● Manages the file system namespace.
● Regulates client’s access to files.
● It also executes file system operations such as renaming, closing, and
opening files and directories.
Data node
The data node is a commodity hardware having the GNU/Linux operating
system and data node software. For every node (Commodity hardware/System) in a
cluster, there will be a data node. These nodes manage the data storage of their
system.
● Datanodes perform read-write operations on the file systems, as per client
request.
● They also perform operations such as block creation, deletion, and
replication according to the instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file
system will be divided into one or more segments and/or stored in individual data
nodes. These file segments are called blocks. In other words, the minimum
amount of data that HDFS can read or write is called a Block. The default block
size is 64MB, but it can be increased as per the need to change in HDFS
configuration.
HDFS Data Storage
Hadoop data store concept implies storing the data at a number of clusters.
Each cluster has a number of data stores, called racks. Each rack stores a number
of DataNodes. Each DataNode has a large number of data blocks. The racks
distribute across a cluster. The nodes have processing and storage capabilities. The
Swamy Saswathikananda College, Poothotta 2
CS6CRT19 Big Data Analytics Module 3
nodes have the data in data blocks to run the application tasks. The data blocks
replicate by default at least on three DataNodes in same or remote nodes. Data
at the stores enable running the distributed applications including analytics, data
mining, OLAP using the clusters.
Hadoop HDFS features are as follows:
● Create, append, delete,, rename and attribute modification functions
● Content of individual files cannot be modified or replaced but appended with
new data at the end of the file.
● It is suitable for distributed storage and processing.
● Hadoop provides a command interface to interact with HDFS
● The built-in servers of name node and data node help users to easily check
the status of the cluster
● Streaming access to file system data
● HDFS provides file permissions and authentication
Swamy Saswathikananda College, Poothotta 3