Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views3 pages

Module 3 Session 3 HDFS

The document discusses the Hadoop Distributed File System (HDFS), a core component of Hadoop designed for storing and processing large-scale data across clusters of computers. It details the roles of the Namenode and Datanodes in managing file systems and data storage, as well as the concept of data blocks within HDFS. Key features of HDFS include support for distributed storage, file operations, and data replication for efficient analytics and processing.

Uploaded by

s903019.1265
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views3 pages

Module 3 Session 3 HDFS

The document discusses the Hadoop Distributed File System (HDFS), a core component of Hadoop designed for storing and processing large-scale data across clusters of computers. It details the roles of the Namenode and Datanodes in managing file systems and data storage, as well as the concept of data blocks within HDFS. Key features of HDFS include support for distributed storage, file operations, and data replication for efficient analytics and processing.

Uploaded by

s903019.1265
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

CS6CRT19 Big Data Analytics​ ​ ​ ​ ​ ​ ​ Module 3

The Hadoop Distributed File System (HDFS)


​ Big Data analytics applications are software applications that make use of
large-scale data. The applications analyze Big Data using massive parallel
processing frameworks. HDFS is a core component of Hadoop. HDFS is designed
to run on a cluster of computers and servers at cloud-based utility services. HDFS
stores Big Data which may range from GBs to PBs. HDFS stores the data in a
distributed manner in order to compute fast. The distributed data store in HDFS
stores data in any format regardless of schema. HDFS provides high throughput
access to data-centric applications that require large-scale data processing
workloads. Figure 3-1 shows the HDFS architecture.

Figure 3-1: HDFS architecture


Namenode
The name node is the commodity hardware that contains the GNU/Linux
operating system and the name node software. It is a software that can be run on

Swamy Saswathikananda College, Poothotta​ ​ ​ ​ ​ ​ ​ 1


CS6CRT19 Big Data Analytics​ ​ ​ ​ ​ ​ ​ Module 3

commodity hardware. The system having the name node acts as the master server
and it does the following tasks:
●​ Manages the file system namespace.
●​ Regulates client’s access to files.
●​ It also executes file system operations such as renaming, closing, and
opening files and directories.
Data node
The data node is a commodity hardware having the GNU/Linux operating
system and data node software. For every node (Commodity hardware/System) in a
cluster, there will be a data node. These nodes manage the data storage of their
system.
●​ Datanodes perform read-write operations on the file systems, as per client
request.
●​ They also perform operations such as block creation, deletion, and
replication according to the instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file
system will be divided into one or more segments and/or stored in individual data
nodes. These file segments are called blocks. In other words, the minimum
amount of data that HDFS can read or write is called a Block. The default block
size is 64MB, but it can be increased as per the need to change in HDFS
configuration.
HDFS Data Storage
Hadoop data store concept implies storing the data at a number of clusters.
Each cluster has a number of data stores, called racks. Each rack stores a number
of DataNodes. Each DataNode has a large number of data blocks. The racks
distribute across a cluster. The nodes have processing and storage capabilities. The

Swamy Saswathikananda College, Poothotta​ ​ ​ ​ ​ ​ ​ 2


CS6CRT19 Big Data Analytics​ ​ ​ ​ ​ ​ ​ Module 3

nodes have the data in data blocks to run the application tasks. The data blocks
replicate by default at least on three DataNodes in same or remote nodes. Data
at the stores enable running the distributed applications including analytics, data
mining, OLAP using the clusters.
Hadoop HDFS features are as follows:
●​ Create, append, delete,, rename and attribute modification functions
●​ Content of individual files cannot be modified or replaced but appended with
new data at the end of the file.
●​ It is suitable for distributed storage and processing.
●​ Hadoop provides a command interface to interact with HDFS
●​ The built-in servers of name node and data node help users to easily check
the status of the cluster
●​ Streaming access to file system data
●​ HDFS provides file permissions and authentication

Swamy Saswathikananda College, Poothotta​ ​ ​ ​ ​ ​ ​ 3

You might also like