Big Data & Big Data
Analytics
Dr. Iman Ahmed ElSayed
Spring 24-25 / fourth level
Lecture 2- Evolution of Hadoop File
Systems
Big Data Big Data & Analytics
Lecture Contents:
1. The Beginning and the Need for a Distributed File System
2. Google influence (GFS)
3. Nutch Distributed File System (NDFS)
4. Birth of Hadoop 1.0
5. Hadoop’s Rise
6. Evolution of HDFS
7. Current HDFS System & EcoSystem Integration
8. Key Features of HDFS
2
Big Data
The Beginning and the Need
for a Distributed File System
Limitations of a traditional file system
Single point of failure
Capacity (storage) limitations
Performance Bottlenecks
Lack of Parallelism
3
Big Data
The Beginning and the Need
for a Distributed File System
"Imagine searching
through a library. One
person searching a huge
library takes a long time.
But if many people each
search a small section,
it's much faster."
4
Big Data
The Beginning and the Need
for a Distributed File System
Distributed Parallel Data
Storage Processing Locality
• data is spread across multiple • processing happens on the
machines (nodes) in a cluster. • multiple machines can same machines where the
• eliminates the single point of failure work on different parts of data resides, minimizing data
• allows for increased storage the data simultaneously. transfer and improving
performance.
capacity. • speeding up analysis
5
Big Data
The Beginning and the Need
for a Distributed File System
01 Fault Tolerance
Principles of a 02 Scalability
Distributed File System
03 High Throughput
6
Big Data The Origins
Google File Apache Nutch
System (1990) (2002)
7
Big Data The Origins - GFS
Google File Apache Nutch
System (1990) (2002)
• It’s a scalable, distributed and fault • Doug Cutting and Mike Cafarella started
tolerant file system. working on a web search engine project.
• Tailored for data intensive • Had a significant impact on handling big
applications. data.
• Running on inexpensive commodity • necessity for a distributed file system to
hardware. manage vast datasets
• Delivers high aggregate performance. • pav thede way for the development of
(HDFS). 8
Big Data Google File System (GFS)
Google's Need for a Scalable File System:
Explosive Data Growth
Commodity Hardware
.
Web Crawling and Indexing
The Problem: existing file systems couldn't meet these
demands, leading Google to develop GFS.
9
Big Data Google File System (GFS)
10
Big Data Key Concepts of GFS
Chunk Servers:
files are divided into fixed-size chunks (typically 64MB).
These chunks are stored on multiple chunk servers, which are
the worker nodes in the GFS cluster.
P.S.: "The data is broken into pieces, and those pieces are
stored on many machines.“
Master Node: It is the central coordinator of the GFS cluster.
It stores metadata about the file system, including the location of
chunks, file namespaces, and access control information.
It does not store the actual data.
P.S.: "The master node keeps track of where all the pieces are."
11
Big Data Key Concepts of GFS
Large File Sizes:
GFS was designed to handle very large files, which are
common in Big Data applications.
The large chunk size helps to reduce metadata overhead and
improve performance.
Data Replication:
GFS achieves fault tolerance through data replication.
Each chunk is replicated multiple times (typically three) and
stored on different chunk servers.
P.S.: "Each piece of data is copied multiple times, so if one
machine fails, the data is still safe."
This influenced the architecture of Hadoop 1.0
which was the first inline 12
Big Data Hadoop v1.0 (HDFS)
HDFS Inspiration from GFS
Open-Source Implementation: HDFS is an open-source
implementation of the concepts pioneered by Google's GFS.
Core Principles Adopted: HDFS adopted the core principles of
GFS, including:
• Distributed storage.
• Data replication for fault tolerance.
• Handling large files.
• Using commodity hardware.
Adaptation: HDFS was designed to be more general-purpose
than GFS, catering to a broader range of Big Data applications.
13
Big Data Hadoop v1.0 (HDFS) success
14
Big Data Hadoop v1.0 (HDFS) architecture
NameNode (Single Point of Failure): the NameNode is the
central master server that manages the file system namespace
and metadata.
In HDFS v1, there's only one NameNode, making it a single
point of failure. If it goes down, the entire file system becomes
inaccessible.
P.S.: "The NameNode is like the librarian, it knows where all the
books are, but there is only one librarian.“
DataNodes: the worker nodes that store the actual data blocks.
DataNodes report to the NameNode and perform read/write
operations on the data blocks.
15
Big Data Hadoop v1.0 (HDFS) architecture
Blocks: files are divided into fixed-size blocks (default 64MB or
128MB).
These blocks are distributed across multiple DataNodes.
Replication Factor: HDFS achieves fault tolerance through
data replication.
The replication factor determines the number of copies of each
block (default is 3).
P.S.: "Each block is copied 3 times, and those 3 copies are on
different DataNodes."
16
Big Data Hadoop v1.0 (HDFS)
Advantages Limitations
Scalability: the ability to scale Single NameNode (Scalability
horizontally by adding more DataNodes Bottleneck): the single NameNode is a
major limitation, as it can become a
Fault Tolerance: Emphasize the fault bottleneck for large clusters and a single
tolerance provided by data replication. point of failure.
High Throughput: the ability to handle Limited Namespace: the NameNode's
large volumes of data and provide high memory limits the number of files and blocks
throughput for read/write operations.. that can be managed.
17