0% found this document useful (0 votes)

21 views5 pages

9238 DC Assignment 3

Uploaded by

crce.9238.cs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views5 pages

9238 DC Assignment 3

Uploaded by

crce.9238.cs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Assignment III – Paper Review

1. Purpose

This reading assignment intends to have you experience the very initial step of research activity, i.e.,
reading research papers. Unlike reading textbooks, you are not required to acquire well-known facts
but expected to summarize the key idea of each paper you have read and to discuss the
contribution/drawback of the research presented in the paper. Submit by 2nd April 2024

Each student is expected to review one or more related papers, and to present his/her
understanding of the research paper he/she has chosen.

2. Reading Assignment

The following shows the list of possible topics you should search/read (any one).

a. Process Migration
b. Communication
c. Synchronization
d. Distributed File Systems
e. Grid Computing
f. Replication and Fault Tolerance
Or any related topic of your choice

Decide one research project you are interested in, and review one or more readings related to the project.
Some of them may be research papers published through IEEE or ACM, the others from a textbook section.
Review the papers timely and get prepared for your presentation.

3. Peer Evaluation
Peer Evaluation will be done for this assignment

An Evaluation sheet will be provided for each student. Several student presentations categorized in the same
research topic can be evaluated by that group.

The peer group is expected to evaluate each student presentation according to the evaluation sheet. This sheet
includes the following criteria:

Item 1 Did he/she well understand the paper he/she reviewed?

Item 2 Did he/she well summarized the main idea of papers?
Item 3 Did he/she give clear answers to questions asked by the group/peer?
Item 4 Did he/she properly point out the contribution of the papers?
Item 5 Did he/she mention about any drawbacks of the ideas introduced in the papers?
Item 6 Did he/she express his/her own opinions to improve the quality of the papers, research, and
projects he/she reviewed?
Title: The Google File System

Introduction:
Distributed file systems change how large-scale data is handled by spreading it across
many computers while ensuring it's always available, can grow as needed, and stays reliable.
The Google File System (GFS) is a prime example, made to handle Google's big data needs. It
works differently from older file systems, expecting some parts to fail, dealing with really big
files, mainly adding new data, and working closely with applications to be flexible. GFS is
used across various clusters, helping power many of Google's data-driven tasks.

Paper Selection:
I have selected the research paper titled ‘The Google file system’, published in the ACM
conferences 2003. This paper highlights the design of Google File System, and how it is
designed where component failures are expected while still a high level of availability is
maintained. Authors discuss various design issues and reasoning behind why they
made those decisons.

Paper Summary:

Firstly, GFS operates in an environment where component failures are common. It utilizes
hundreds or thousands of storage machines made from inexpensive parts, with constant
access from client machines. This high quantity and varied quality of components mean
failures are expected, necessitating continuous monitoring, error detection, fault tolerance,
and automatic recovery.

Secondly, GFS handles large files, often multi-gigabytes in size, containing numerous
application objects such as web documents. Managing billions of small files is impractical,
prompting a reevaluation of design assumptions and parameters like I/O operations and
block sizes.

Thirdly, most file operations involve appending new data rather than overwriting existing
data. Random writes are rare, with files typically being read sequentially. This access pattern
prioritizes performance optimization for appending and atomicity guarantees, reducing the
emphasis on client-side data block caching.

Lastly, GFS benefits from co-designing applications and the file system API, allowing for
increased flexibility. For instance, GFS's consistency model is relaxed to simplify the file
system without imposing heavy burdens on applications. An atomic append operation
enables multiple clients to append to a file concurrently without additional synchronization.

Multiple GFS clusters are deployed for various purposes, with the largest clusters comprising
over 1000 storage nodes and 300 TB of disk storage. These clusters experience heavy
access from hundreds of clients on distinct machines continuously.
Assumptions made by the authors while designing:

1. Use of Inexpensive Components: The system is composed of low-cost, standard

components prone to failure. It must continuously monitor itself and promptly recover
from these failures.
2. Storage of Large Files: The system is optimized for storing a moderate number of large
files, typically in the range of multi-megabytes to gigabytes. While support for small files
exists, optimization is not a priority.
3. Workload Characteristics: Workloads primarily involve two types of reads: large streaming
reads and small random reads. Large reads typically span hundreds of kilobytes to
multiple megabytes, while small reads access a few kilobytes at arbitrary offsets.
Emphasis is placed on batch processing and sorting of small reads for efficiency.
4. Sequential Write Operations: Many sequential writes append data to files, with typical
operation sizes matching those for reads. Once written, files are rarely modified. While
support for small writes exists, efficiency is not a primary concern.
5. Concurrent File Access: The system must support multiple clients concurrently appending
data to the same file, often used for producer-consumer queues or merging. Atomicity and
minimal synchronization overhead are crucial, as files may be read while being appended
to.
6. Emphasis on Bandwidth over Latency: The system prioritizes high sustained bandwidth
over low latency. Most applications require bulk data processing at a high rate, with fewer
applications having strict response time requirements for individual read or write
operations.
Architecture design:

1. Cluster Structure: A GFS cluster consists of a single master and multiple

chunkservers, accessible by multiple clients. Each component typically runs on
commodity Linux machines, allowing for flexibility in deployment.
2. File Organization: Files are divided into fixed-size chunks, each identified by a unique
64-bit handle assigned by the master. Chunkservers store these chunks as local files
and handle read/write operations based on chunk handles and byte ranges. Data
replication is employed for reliability, with default settings of three replicas per chunk.
3. Master Responsibilities: The master maintains all file system metadata, including
namespace, access control, file-to-chunk mapping, and chunk locations. It manages
system-wide activities such as chunk lease management, garbage collection, and
chunk migration. Regular communication with chunkservers is maintained to ensure
coordination.
4. Client Interaction: Clients communicate with both the master and chunkservers. While
metadata operations involve the master, data-bearing communication goes directly to
chunkservers. Client caching of file data is omitted, simplifying the system and
avoiding cache coherence issues. Chunkservers similarly avoid caching file data due
to the nature of chunk storage as local files.
5. Chunk Size and Metadata Handling: A large chunk size (64 MB) is chosen to optimize
performance, reducing the need for frequent client-master interaction and minimizing
metadata size. Metadata is stored in the master's memory, allowing for fast
operations and periodic background tasks like garbage collection and load balancing.
Chunk location information is polled from chunkservers rather than persistently stored
to simplify management and ensure consistency.

The Google File System (GFS) employs a lazy garbage collection mechanism for
storage reclamation after file deletion. Deleted files are renamed with a hidden
timestamped name, allowing them to be potentially undeleted within a configurable
time frame. Regular scans by the master remove hidden files and orphaned chunks,
simplifying storage management and ensuring reliability in large-scale distributed
systems. Stale replica detection, achieved through chunk versioning and regular
garbage collection, further enhances fault tolerance by maintaining up-to-date data
and safeguarding against corrupted replicas.

1. Simplified Management: GFS's approach to garbage collection simplifies management

by merging storage reclamation with regular background tasks of the master, such as
namespace scans and communication with chunkservers. This integration reduces
the complexity of storage management operations.
2. Reliability Enhancement: By deferring storage reclamation until regular garbage
collection cycles, GFS ensures a uniform and dependable process even in the face of
common component failures. This reliability is crucial for maintaining system
availability and data integrity.
3. Cost Amortization: The batched nature of storage reclamation during regular garbage
collection allows for cost amortization. By performing these operations when the
master is relatively free, GFS optimizes resource utilization and can promptly respond
to client requests.
4. Safety Net: The delay in reclaiming storage serves as a safety net against accidental,
irreversible deletion. This provides users with a window of opportunity to recover
deleted files before they are permanently removed from the system, mitigating the risk
of data loss due to human error.

Overall, GFS's garbage collection mechanism not only simplifies storage management and
enhances reliability but also optimizes resource utilization and provides a safety net against
data loss, contributing to the system's robustness in the face of component failures.
Conclusion:

1. Reimagining File System Assumptions: GFS challenges traditional file system

assumptions by treating component failures as the norm, optimizing for huge files
predominantly appended and sequentially read, and extending the standard file
system interface to better suit the workload and technological environment.
2. Robust Fault Tolerance Mechanisms: The system ensures fault tolerance through
constant monitoring, data replication, and fast, automatic recovery. Chunk
replication and online repair mechanisms address the frequent component failures,
while checksumming detects data corruption at the disk level, ensuring data
integrity.
3. High Throughput and Scalability: GFS achieves high aggregate throughput for
multiple concurrent readers and writers by separating file system control from data
transfer. This separation, along with chunk leasing and a large chunk size,
minimizes master involvement and prevents it from becoming a bottleneck,
ensuring scalability.
4. Usage and Importance: GFS serves as Google's storage platform for research,
development, and production data processing, enabling innovation and
problem-solving at the scale of the entire web. Its success underscores its
importance as a tool for supporting large-scale data processing workloads on
commodity hardware.

Q1 WS TLE 8 Lesson 1 Week 1
No ratings yet
Q1 WS TLE 8 Lesson 1 Week 1
6 pages
Google File System Insights
50% (2)
Google File System Insights
4 pages
Ims-Lab (1 3a)
No ratings yet
Ims-Lab (1 3a)
56 pages
Gfs
No ratings yet
Gfs
15 pages
Case Study: Google File System
No ratings yet
Case Study: Google File System
7 pages
Chapter 2 1712934164766
No ratings yet
Chapter 2 1712934164766
21 pages
Google File System 1
No ratings yet
Google File System 1
48 pages
Google File System Review 2016
No ratings yet
Google File System Review 2016
4 pages
MIT 6.824 - Lecture 3 - GFS
No ratings yet
MIT 6.824 - Lecture 3 - GFS
1 page
Distributed File System Study
No ratings yet
Distributed File System Study
4 pages
Google File System
No ratings yet
Google File System
48 pages
What Is Distributed Data Processing?
No ratings yet
What Is Distributed Data Processing?
2 pages
Paper Gfs Summary
No ratings yet
Paper Gfs Summary
14 pages
GPS Vs Hdfs
No ratings yet
GPS Vs Hdfs
6 pages
BDA Unit I
No ratings yet
BDA Unit I
18 pages
Refer Slide Time: 00:15
No ratings yet
Refer Slide Time: 00:15
31 pages
Questions On Google File System
100% (1)
Questions On Google File System
3 pages
Google File System
No ratings yet
Google File System
6 pages
The Google File System
No ratings yet
The Google File System
21 pages
2 Uvm
No ratings yet
2 Uvm
15 pages
GFD Summary
No ratings yet
GFD Summary
3 pages
DS Mod 5.2
No ratings yet
DS Mod 5.2
6 pages
2 GFS
No ratings yet
2 GFS
30 pages
The Google File System: 1. Abstract
No ratings yet
The Google File System: 1. Abstract
9 pages
Google File System and Hadoop Distributed File System-An Analogy
No ratings yet
Google File System and Hadoop Distributed File System-An Analogy
11 pages
The Google File System: Alexandru Costan
No ratings yet
The Google File System: Alexandru Costan
38 pages
BDA Unit-1
No ratings yet
BDA Unit-1
19 pages
Thegooglefilesystem Lecturebyromainjacotin 141001154546 Phpapp02
No ratings yet
Thegooglefilesystem Lecturebyromainjacotin 141001154546 Phpapp02
52 pages
Google File System
No ratings yet
Google File System
22 pages
The Google File System: Firas Abuzaid
No ratings yet
The Google File System: Firas Abuzaid
22 pages
Storage Systems
No ratings yet
Storage Systems
23 pages
Hadoop and Big Data Unit 2
No ratings yet
Hadoop and Big Data Unit 2
11 pages
Google File System for Developers
No ratings yet
Google File System for Developers
28 pages
Lecture 4.1 - Hadoop - MapReduce - Hbase
No ratings yet
Lecture 4.1 - Hadoop - MapReduce - Hbase
94 pages
Bda Material Unit 2
No ratings yet
Bda Material Unit 2
19 pages
Chap 6
No ratings yet
Chap 6
54 pages
Saritha Gfs Report
No ratings yet
Saritha Gfs Report
28 pages
Unit 5 Lecture 2
No ratings yet
Unit 5 Lecture 2
22 pages
Chapter 2 Google File System 250525 070947
No ratings yet
Chapter 2 Google File System 250525 070947
42 pages
The Google File System: S. Ghemawat, H. Gobioff, and S. T. Leung. SOSP 2003
No ratings yet
The Google File System: S. Ghemawat, H. Gobioff, and S. T. Leung. SOSP 2003
33 pages
The File System: Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung (Google)
No ratings yet
The File System: Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung (Google)
31 pages
Distributed Systems U4
No ratings yet
Distributed Systems U4
8 pages
Google File System
No ratings yet
Google File System
20 pages
15 Gfs
No ratings yet
15 Gfs
40 pages
Rapid Application Development and Short-Time To The Market Low Latency Scalability High Availability Consistent View of The Data
No ratings yet
Rapid Application Development and Short-Time To The Market Low Latency Scalability High Availability Consistent View of The Data
21 pages
R16 4-1 BDA - Unit-2 (Ref-3)
No ratings yet
R16 4-1 BDA - Unit-2 (Ref-3)
22 pages
Google File System Basics: Google World Wide Web Computers
No ratings yet
Google File System Basics: Google World Wide Web Computers
5 pages
Unit 2
No ratings yet
Unit 2
22 pages
1564-Article Text-2810-1-10-20171231 PDF
No ratings yet
1564-Article Text-2810-1-10-20171231 PDF
5 pages
Large Scale Distributed File System Survey
No ratings yet
Large Scale Distributed File System Survey
7 pages
An Overview of Google File System (GFS) - Medium
No ratings yet
An Overview of Google File System (GFS) - Medium
10 pages
P2P File Sharing
No ratings yet
P2P File Sharing
43 pages
Storage Systems
No ratings yet
Storage Systems
23 pages
A Novel Distributed File System Using Blockchain Metadata
No ratings yet
A Novel Distributed File System Using Blockchain Metadata
20 pages
Cloud Storage Systems: Unit-Iii
No ratings yet
Cloud Storage Systems: Unit-Iii
40 pages
Navigating The Landscape of Distributed File Systems: Architectures, Implementations, and Considerations
No ratings yet
Navigating The Landscape of Distributed File Systems: Architectures, Implementations, and Considerations
10 pages
Class Notes
No ratings yet
Class Notes
9 pages
Distributed Computing Module 5 Important Topics PYQs
No ratings yet
Distributed Computing Module 5 Important Topics PYQs
23 pages
Google File System Overview
No ratings yet
Google File System Overview
18 pages
SCR Basics for Engineering Students
100% (1)
SCR Basics for Engineering Students
13 pages
System Specs for Tech Support
No ratings yet
System Specs for Tech Support
28 pages
Soft-Starter: Programming Manual
No ratings yet
Soft-Starter: Programming Manual
162 pages
Lumeon Resting ECG Machine Tear Sheet 2016-07
No ratings yet
Lumeon Resting ECG Machine Tear Sheet 2016-07
3 pages
Huawei Ups Alarm Table
100% (1)
Huawei Ups Alarm Table
43 pages
SAP ABAP Consultant Profile
No ratings yet
SAP ABAP Consultant Profile
1 page
What Is Veip in Olt - Google Search
No ratings yet
What Is Veip in Olt - Google Search
5 pages
Single-Carrier Phase-Disposition PWM Implementation For Multilevel Flying Capacitor Converters
No ratings yet
Single-Carrier Phase-Disposition PWM Implementation For Multilevel Flying Capacitor Converters
5 pages
Uta16 Report
No ratings yet
Uta16 Report
13 pages
Ap4955 PDF
No ratings yet
Ap4955 PDF
4 pages
End To End QoS For Video Across The Cambium Wireless Fabric v5
No ratings yet
End To End QoS For Video Across The Cambium Wireless Fabric v5
13 pages
Programmable Logic Controllers
No ratings yet
Programmable Logic Controllers
11 pages
Implementation of Acoustic Echo Canceller With FPGA
No ratings yet
Implementation of Acoustic Echo Canceller With FPGA
6 pages
Lhd427 FC - Dkenhlk Web
No ratings yet
Lhd427 FC - Dkenhlk Web
43 pages
Mosfet 10NM60N
0% (1)
Mosfet 10NM60N
19 pages
Assemble Computer Hardware Guide
No ratings yet
Assemble Computer Hardware Guide
2 pages
DLO 9.0 Overview - v2
No ratings yet
DLO 9.0 Overview - v2
17 pages
Joybautista Lab5
No ratings yet
Joybautista Lab5
12 pages
DPP 2
No ratings yet
DPP 2
9 pages
Label Printing Bartender
No ratings yet
Label Printing Bartender
80 pages
8085 Microprocessor Overview
No ratings yet
8085 Microprocessor Overview
3 pages
Slide 01 Winforms 2005
No ratings yet
Slide 01 Winforms 2005
26 pages
Python
No ratings yet
Python
106 pages
102 v03000003 Physical Layer
No ratings yet
102 v03000003 Physical Layer
7 pages
6
No ratings yet
6
3 pages
Types of Angular Forms
No ratings yet
Types of Angular Forms
14 pages
IDOCs
100% (1)
IDOCs
32 pages
Cassandra: Decentralized Storage System
No ratings yet
Cassandra: Decentralized Storage System
37 pages

9238 DC Assignment 3

Uploaded by

9238 DC Assignment 3

Uploaded by

Assignment III – Paper Review

Item 1 Did he/she well understand the paper he/she reviewed?

1. Use of Inexpensive Components: The system is composed of low-cost, standard

1. Cluster Structure: A GFS cluster consists of a single master and multiple

1. Simplified Management: GFS's approach to garbage collection simplifies management

1. Reimagining File System Assumptions: GFS challenges traditional file system

You might also like