0% found this document useful (0 votes)

6 views17 pages

Bigdata Lecture 2

The document discusses the evolution of Hadoop File Systems, focusing on the need for a distributed file system due to limitations of traditional systems. It outlines the influence of the Google File System (GFS) and the development of Hadoop 1.0, highlighting key features such as fault tolerance, scalability, and high throughput. The document also addresses the architecture of HDFS, including its advantages and limitations.

Uploaded by

ماسبيرو مباشر

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views17 pages

Bigdata Lecture 2

Uploaded by

ماسبيرو مباشر

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Big Data & Big Data

Analytics
Dr. Iman Ahmed ElSayed

Spring 24-25 / fourth level

Lecture 2- Evolution of Hadoop File

Systems
Big Data Big Data & Analytics
Lecture Contents:

1. The Beginning and the Need for a Distributed File System

2. Google influence (GFS)

3. Nutch Distributed File System (NDFS)

4. Birth of Hadoop 1.0

5. Hadoop’s Rise

6. Evolution of HDFS

7. Current HDFS System & EcoSystem Integration

8. Key Features of HDFS

2
Big Data
The Beginning and the Need
for a Distributed File System
Limitations of a traditional file system

Single point of failure

Capacity (storage) limitations

Performance Bottlenecks

Lack of Parallelism

3
Big Data
The Beginning and the Need
for a Distributed File System

"Imagine searching
through a library. One
person searching a huge
library takes a long time.
But if many people each
search a small section,
it's much faster."

4
Big Data
The Beginning and the Need
for a Distributed File System
Distributed Parallel Data
Storage Processing Locality

• data is spread across multiple • processing happens on the

machines (nodes) in a cluster. • multiple machines can same machines where the
• eliminates the single point of failure work on different parts of data resides, minimizing data
• allows for increased storage the data simultaneously. transfer and improving
performance.
capacity. • speeding up analysis

5
Big Data
The Beginning and the Need
for a Distributed File System

01 Fault Tolerance

Principles of a 02 Scalability
Distributed File System

03 High Throughput

6
Big Data The Origins

Google File Apache Nutch

System (1990) (2002)

7
Big Data The Origins - GFS

Google File Apache Nutch

System (1990) (2002)

• It’s a scalable, distributed and fault • Doug Cutting and Mike Cafarella started
tolerant file system. working on a web search engine project.

• Tailored for data intensive • Had a significant impact on handling big

applications. data.

• Running on inexpensive commodity • necessity for a distributed file system to

hardware. manage vast datasets

• Delivers high aggregate performance. • pav thede way for the development of
(HDFS). 8
Big Data Google File System (GFS)

Google's Need for a Scalable File System:

Explosive Data Growth

Commodity Hardware
.
Web Crawling and Indexing

The Problem: existing file systems couldn't meet these

demands, leading Google to develop GFS.

9
Big Data Google File System (GFS)

10
Big Data Key Concepts of GFS

Chunk Servers:
files are divided into fixed-size chunks (typically 64MB).
These chunks are stored on multiple chunk servers, which are
the worker nodes in the GFS cluster.
P.S.: "The data is broken into pieces, and those pieces are
stored on many machines.“

Master Node: It is the central coordinator of the GFS cluster.

It stores metadata about the file system, including the location of
chunks, file namespaces, and access control information.
It does not store the actual data.
P.S.: "The master node keeps track of where all the pieces are."

11
Big Data Key Concepts of GFS

Large File Sizes:

GFS was designed to handle very large files, which are
common in Big Data applications.
The large chunk size helps to reduce metadata overhead and
improve performance.

Data Replication:
GFS achieves fault tolerance through data replication.
Each chunk is replicated multiple times (typically three) and
stored on different chunk servers.
P.S.: "Each piece of data is copied multiple times, so if one
machine fails, the data is still safe."

This influenced the architecture of Hadoop 1.0

which was the first inline 12
Big Data Hadoop v1.0 (HDFS)

HDFS Inspiration from GFS

Open-Source Implementation: HDFS is an open-source

implementation of the concepts pioneered by Google's GFS.

Core Principles Adopted: HDFS adopted the core principles of

GFS, including:
• Distributed storage.
• Data replication for fault tolerance.
• Handling large files.
• Using commodity hardware.

Adaptation: HDFS was designed to be more general-purpose

than GFS, catering to a broader range of Big Data applications.

13
Big Data Hadoop v1.0 (HDFS) success

14
Big Data Hadoop v1.0 (HDFS) architecture

NameNode (Single Point of Failure): the NameNode is the

central master server that manages the file system namespace
and metadata.
In HDFS v1, there's only one NameNode, making it a single
point of failure. If it goes down, the entire file system becomes
inaccessible.

P.S.: "The NameNode is like the librarian, it knows where all the
books are, but there is only one librarian.“

DataNodes: the worker nodes that store the actual data blocks.
DataNodes report to the NameNode and perform read/write
operations on the data blocks.

15
Big Data Hadoop v1.0 (HDFS) architecture

Blocks: files are divided into fixed-size blocks (default 64MB or

128MB).
These blocks are distributed across multiple DataNodes.

Replication Factor: HDFS achieves fault tolerance through

data replication.
The replication factor determines the number of copies of each
block (default is 3).
P.S.: "Each block is copied 3 times, and those 3 copies are on
different DataNodes."

16
Big Data Hadoop v1.0 (HDFS)

Advantages Limitations

Scalability: the ability to scale Single NameNode (Scalability

horizontally by adding more DataNodes Bottleneck): the single NameNode is a
major limitation, as it can become a
Fault Tolerance: Emphasize the fault bottleneck for large clusters and a single
tolerance provided by data replication. point of failure.

High Throughput: the ability to handle Limited Namespace: the NameNode's

large volumes of data and provide high memory limits the number of files and blocks
throughput for read/write operations.. that can be managed.

HDFS Concepts
No ratings yet
HDFS Concepts
4 pages
List of Empanelled Firms On National Portal With Contacts
No ratings yet
List of Empanelled Firms On National Portal With Contacts
9 pages
Unit 3 (Big Data Analytics)
No ratings yet
Unit 3 (Big Data Analytics)
18 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Learning and Behavior 9th Edition Full Version Download
82% (11)
Learning and Behavior 9th Edition Full Version Download
17 pages
Second Series Plays, JUSTICE by John Galsworthy
100% (1)
Second Series Plays, JUSTICE by John Galsworthy
66 pages
Developing and Analysis of Power Systems Using Psat Software
100% (1)
Developing and Analysis of Power Systems Using Psat Software
5 pages
Big Data Refers To Extremely Large and Complex Datasets That 1
No ratings yet
Big Data Refers To Extremely Large and Complex Datasets That 1
421 pages
Unit 3 Notes FCC
No ratings yet
Unit 3 Notes FCC
51 pages
Bda 2 - Hadoop
No ratings yet
Bda 2 - Hadoop
112 pages
CH 05
No ratings yet
CH 05
20 pages
Hadoop Basics for Engineering Students
No ratings yet
Hadoop Basics for Engineering Students
18 pages
Lecture 4 Introduction To Hadoop
No ratings yet
Lecture 4 Introduction To Hadoop
24 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
14 pages
Changing Levels of Meaning and Experience - Steve Andreas
No ratings yet
Changing Levels of Meaning and Experience - Steve Andreas
5 pages
BDT - Unit - II - Hdfs and Hadoop Io
No ratings yet
BDT - Unit - II - Hdfs and Hadoop Io
42 pages
Unit 3 Full
No ratings yet
Unit 3 Full
89 pages
Unit 3 Hadoop
No ratings yet
Unit 3 Hadoop
50 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
248 pages
BD Unit II
No ratings yet
BD Unit II
57 pages
Rockwool Installation Guide
100% (1)
Rockwool Installation Guide
8 pages
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
No ratings yet
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
17 pages
5.apache Hadoop Updated
No ratings yet
5.apache Hadoop Updated
57 pages
BDA Unit 1 Notes
No ratings yet
BDA Unit 1 Notes
24 pages
Bigdta Unit 3
No ratings yet
Bigdta Unit 3
65 pages
Notes - 3 Unit Neha
No ratings yet
Notes - 3 Unit Neha
25 pages
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
No ratings yet
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
16 pages
DIY Guide To Building Your Own Pulk
No ratings yet
DIY Guide To Building Your Own Pulk
41 pages
Big Data Lecture Presentation
No ratings yet
Big Data Lecture Presentation
28 pages
Big Data Lecture # 05
No ratings yet
Big Data Lecture # 05
22 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
25 pages
BDA Exp 1
No ratings yet
BDA Exp 1
7 pages
Slide 2 GFS and Hadoop
No ratings yet
Slide 2 GFS and Hadoop
95 pages
CH-05 CC
No ratings yet
CH-05 CC
21 pages
Bda 3
No ratings yet
Bda 3
70 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
Unit I
No ratings yet
Unit I
38 pages
Unit-5 - Hadoop
No ratings yet
Unit-5 - Hadoop
29 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Hdfs Part 1
No ratings yet
Hdfs Part 1
72 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Tyre Industry in India - Me Project
100% (2)
Tyre Industry in India - Me Project
17 pages
When It Comes To Cloud File Systems Like GFS
No ratings yet
When It Comes To Cloud File Systems Like GFS
6 pages
10th August Morning and Afternoon Session Hadoop
No ratings yet
10th August Morning and Afternoon Session Hadoop
18 pages
Hadoop Intro and Hdfs
No ratings yet
Hadoop Intro and Hdfs
37 pages
4
No ratings yet
4
53 pages
Unit 3
No ratings yet
Unit 3
5 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
HDFS 3
No ratings yet
HDFS 3
51 pages
Bda Unit2
No ratings yet
Bda Unit2
24 pages
Grade 9 - English All Unit 3 and Moments #3
No ratings yet
Grade 9 - English All Unit 3 and Moments #3
5 pages
Introduction To HDFS
No ratings yet
Introduction To HDFS
25 pages
HADOOP
No ratings yet
HADOOP
18 pages
Brown and Black Modern Watercolor Presentation
No ratings yet
Brown and Black Modern Watercolor Presentation
11 pages
Big Data
No ratings yet
Big Data
51 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
Bsd1313 Chapter 4
No ratings yet
Bsd1313 Chapter 4
129 pages
HDFS: Architecture and Benefits
No ratings yet
HDFS: Architecture and Benefits
6 pages
5G Wireless Technology: Millimeter Wave Health Effects
No ratings yet
5G Wireless Technology: Millimeter Wave Health Effects
5 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Unit-1 Introduction To Big Data
No ratings yet
Unit-1 Introduction To Big Data
38 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
HDFS: Big Data Storage Solution
No ratings yet
HDFS: Big Data Storage Solution
14 pages
Big Data & Hadoop Essentials
No ratings yet
Big Data & Hadoop Essentials
26 pages
HDFS
No ratings yet
HDFS
11 pages
STEP 7 V56 - Compatibility List
No ratings yet
STEP 7 V56 - Compatibility List
31 pages
Phones 2017 PDF
No ratings yet
Phones 2017 PDF
161 pages
Hand, Foot and Mouth Disease (HFMD)
No ratings yet
Hand, Foot and Mouth Disease (HFMD)
3 pages
Distributed File Systems Leading To Hadoop File System: UNIT-2
No ratings yet
Distributed File Systems Leading To Hadoop File System: UNIT-2
12 pages
How To Send or Receive SMS Message Via GSM Module by at Commands
100% (1)
How To Send or Receive SMS Message Via GSM Module by at Commands
6 pages
Mapreduce: Simplified Data Processing On Large Clusters
No ratings yet
Mapreduce: Simplified Data Processing On Large Clusters
38 pages
Blue Lock: Isagi & Rin's Reunion
No ratings yet
Blue Lock: Isagi & Rin's Reunion
24 pages
Unit 2 Assignment CGC1W
No ratings yet
Unit 2 Assignment CGC1W
2 pages
Biology of Stem Cells: An Overview: Pedro C. Chagastelles and Nance B. Nardi
No ratings yet
Biology of Stem Cells: An Overview: Pedro C. Chagastelles and Nance B. Nardi
5 pages
Chapter 1+2+GSCM
No ratings yet
Chapter 1+2+GSCM
45 pages
Free Incoming Inspection Template
No ratings yet
Free Incoming Inspection Template
5 pages
HR Interview Questions
No ratings yet
HR Interview Questions
8 pages
Sale of Goods Act 1930 Overview
No ratings yet
Sale of Goods Act 1930 Overview
27 pages
Natural & Artificial Resources Vocabulary
No ratings yet
Natural & Artificial Resources Vocabulary
20 pages
Hypertension Cheat Sheet
No ratings yet
Hypertension Cheat Sheet
4 pages
9 RWS PT 4 Math Nida 202425
No ratings yet
9 RWS PT 4 Math Nida 202425
2 pages
Math 8 Q1 Week 2.2
No ratings yet
Math 8 Q1 Week 2.2
6 pages
Format Laporan MEM564 Ver2
No ratings yet
Format Laporan MEM564 Ver2
4 pages
A212 - MC 10 - PROVISIONS, CLCA - Student
No ratings yet
A212 - MC 10 - PROVISIONS, CLCA - Student
4 pages
Relational DB Design Lab Guide
No ratings yet
Relational DB Design Lab Guide
2 pages
Ictasol
No ratings yet
Ictasol
1 page
PNL Account Cashflow Forecast: Missing Values
No ratings yet
PNL Account Cashflow Forecast: Missing Values
5 pages

Bigdata Lecture 2

Uploaded by

Bigdata Lecture 2

Uploaded by

Big Data & Big Data

Spring 24-25 / fourth level

Lecture 2- Evolution of Hadoop File

1. The Beginning and the Need for a Distributed File System

2. Google influence (GFS)

3. Nutch Distributed File System (NDFS)

4. Birth of Hadoop 1.0

7. Current HDFS System & EcoSystem Integration

8. Key Features of HDFS

Single point of failure

Capacity (storage) limitations

• data is spread across multiple • processing happens on the

Google File Apache Nutch

Google File Apache Nutch

• Tailored for data intensive • Had a significant impact on handling big

• Running on inexpensive commodity • necessity for a distributed file system to

Google's Need for a Scalable File System:

The Problem: existing file systems couldn't meet these

Master Node: It is the central coordinator of the GFS cluster.

Large File Sizes:

This influenced the architecture of Hadoop 1.0

HDFS Inspiration from GFS

Open-Source Implementation: HDFS is an open-source

Core Principles Adopted: HDFS adopted the core principles of

Adaptation: HDFS was designed to be more general-purpose

NameNode (Single Point of Failure): the NameNode is the

Blocks: files are divided into fixed-size blocks (default 64MB or

Replication Factor: HDFS achieves fault tolerance through

Scalability: the ability to scale Single NameNode (Scalability

High Throughput: the ability to handle Limited Namespace: the NameNode's

You might also like