Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
21 views5 pages

9238 DC Assignment 3

Uploaded by

crce.9238.cs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views5 pages

9238 DC Assignment 3

Uploaded by

crce.9238.cs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Assignment III – Paper Review

1. Purpose

This reading assignment intends to have you experience the very initial step of research activity, i.e.,
reading research papers. Unlike reading textbooks, you are not required to acquire well-known facts
but expected to summarize the key idea of each paper you have read and to discuss the
contribution/drawback of the research presented in the paper. Submit by 2nd April 2024

Each student is expected to review one or more related papers, and to present his/her
understanding of the research paper he/she has chosen.

2. Reading Assignment

The following shows the list of possible topics you should search/read (any one).

a. Process Migration
b. Communication
c. Synchronization
d. Distributed File Systems
e. Grid Computing
f. Replication and Fault Tolerance
Or any related topic of your choice

Decide one research project you are interested in, and review one or more readings related to the project.
Some of them may be research papers published through IEEE or ACM, the others from a textbook section.
Review the papers timely and get prepared for your presentation.

3. Peer Evaluation
Peer Evaluation will be done for this assignment

An Evaluation sheet will be provided for each student. Several student presentations categorized in the same
research topic can be evaluated by that group.

The peer group is expected to evaluate each student presentation according to the evaluation sheet. This sheet
includes the following criteria:

Item 1 Did he/she well understand the paper he/she reviewed?


Item 2 Did he/she well summarized the main idea of papers?
Item 3 Did he/she give clear answers to questions asked by the group/peer?
Item 4 Did he/she properly point out the contribution of the papers?
Item 5 Did he/she mention about any drawbacks of the ideas introduced in the papers?
Item 6 Did he/she express his/her own opinions to improve the quality of the papers, research, and
projects he/she reviewed?
Title: The Google File System

Introduction:
Distributed file systems change how large-scale data is handled by spreading it across
many computers while ensuring it's always available, can grow as needed, and stays reliable.
The Google File System (GFS) is a prime example, made to handle Google's big data needs. It
works differently from older file systems, expecting some parts to fail, dealing with really big
files, mainly adding new data, and working closely with applications to be flexible. GFS is
used across various clusters, helping power many of Google's data-driven tasks.

Paper Selection:
I have selected the research paper titled ‘The Google file system’, published in the ACM
conferences 2003. This paper highlights the design of Google File System, and how it is
designed where component failures are expected while still a high level of availability is
maintained. Authors discuss various design issues and reasoning behind why they
made those decisons.

Paper Summary:

Firstly, GFS operates in an environment where component failures are common. It utilizes
hundreds or thousands of storage machines made from inexpensive parts, with constant
access from client machines. This high quantity and varied quality of components mean
failures are expected, necessitating continuous monitoring, error detection, fault tolerance,
and automatic recovery.

Secondly, GFS handles large files, often multi-gigabytes in size, containing numerous
application objects such as web documents. Managing billions of small files is impractical,
prompting a reevaluation of design assumptions and parameters like I/O operations and
block sizes.

Thirdly, most file operations involve appending new data rather than overwriting existing
data. Random writes are rare, with files typically being read sequentially. This access pattern
prioritizes performance optimization for appending and atomicity guarantees, reducing the
emphasis on client-side data block caching.

Lastly, GFS benefits from co-designing applications and the file system API, allowing for
increased flexibility. For instance, GFS's consistency model is relaxed to simplify the file
system without imposing heavy burdens on applications. An atomic append operation
enables multiple clients to append to a file concurrently without additional synchronization.

Multiple GFS clusters are deployed for various purposes, with the largest clusters comprising
over 1000 storage nodes and 300 TB of disk storage. These clusters experience heavy
access from hundreds of clients on distinct machines continuously.
Assumptions made by the authors while designing:

1. Use of Inexpensive Components: The system is composed of low-cost, standard


components prone to failure. It must continuously monitor itself and promptly recover
from these failures.
2. Storage of Large Files: The system is optimized for storing a moderate number of large
files, typically in the range of multi-megabytes to gigabytes. While support for small files
exists, optimization is not a priority.
3. Workload Characteristics: Workloads primarily involve two types of reads: large streaming
reads and small random reads. Large reads typically span hundreds of kilobytes to
multiple megabytes, while small reads access a few kilobytes at arbitrary offsets.
Emphasis is placed on batch processing and sorting of small reads for efficiency.
4. Sequential Write Operations: Many sequential writes append data to files, with typical
operation sizes matching those for reads. Once written, files are rarely modified. While
support for small writes exists, efficiency is not a primary concern.
5. Concurrent File Access: The system must support multiple clients concurrently appending
data to the same file, often used for producer-consumer queues or merging. Atomicity and
minimal synchronization overhead are crucial, as files may be read while being appended
to.
6. Emphasis on Bandwidth over Latency: The system prioritizes high sustained bandwidth
over low latency. Most applications require bulk data processing at a high rate, with fewer
applications having strict response time requirements for individual read or write
operations.
Architecture design:

1. Cluster Structure: A GFS cluster consists of a single master and multiple


chunkservers, accessible by multiple clients. Each component typically runs on
commodity Linux machines, allowing for flexibility in deployment.
2. File Organization: Files are divided into fixed-size chunks, each identified by a unique
64-bit handle assigned by the master. Chunkservers store these chunks as local files
and handle read/write operations based on chunk handles and byte ranges. Data
replication is employed for reliability, with default settings of three replicas per chunk.
3. Master Responsibilities: The master maintains all file system metadata, including
namespace, access control, file-to-chunk mapping, and chunk locations. It manages
system-wide activities such as chunk lease management, garbage collection, and
chunk migration. Regular communication with chunkservers is maintained to ensure
coordination.
4. Client Interaction: Clients communicate with both the master and chunkservers. While
metadata operations involve the master, data-bearing communication goes directly to
chunkservers. Client caching of file data is omitted, simplifying the system and
avoiding cache coherence issues. Chunkservers similarly avoid caching file data due
to the nature of chunk storage as local files.
5. Chunk Size and Metadata Handling: A large chunk size (64 MB) is chosen to optimize
performance, reducing the need for frequent client-master interaction and minimizing
metadata size. Metadata is stored in the master's memory, allowing for fast
operations and periodic background tasks like garbage collection and load balancing.
Chunk location information is polled from chunkservers rather than persistently stored
to simplify management and ensure consistency.

The Google File System (GFS) employs a lazy garbage collection mechanism for
storage reclamation after file deletion. Deleted files are renamed with a hidden
timestamped name, allowing them to be potentially undeleted within a configurable
time frame. Regular scans by the master remove hidden files and orphaned chunks,
simplifying storage management and ensuring reliability in large-scale distributed
systems. Stale replica detection, achieved through chunk versioning and regular
garbage collection, further enhances fault tolerance by maintaining up-to-date data
and safeguarding against corrupted replicas.

1. Simplified Management: GFS's approach to garbage collection simplifies management


by merging storage reclamation with regular background tasks of the master, such as
namespace scans and communication with chunkservers. This integration reduces
the complexity of storage management operations.
2. Reliability Enhancement: By deferring storage reclamation until regular garbage
collection cycles, GFS ensures a uniform and dependable process even in the face of
common component failures. This reliability is crucial for maintaining system
availability and data integrity.
3. Cost Amortization: The batched nature of storage reclamation during regular garbage
collection allows for cost amortization. By performing these operations when the
master is relatively free, GFS optimizes resource utilization and can promptly respond
to client requests.
4. Safety Net: The delay in reclaiming storage serves as a safety net against accidental,
irreversible deletion. This provides users with a window of opportunity to recover
deleted files before they are permanently removed from the system, mitigating the risk
of data loss due to human error.

Overall, GFS's garbage collection mechanism not only simplifies storage management and
enhances reliability but also optimizes resource utilization and provides a safety net against
data loss, contributing to the system's robustness in the face of component failures.
Conclusion:

1. Reimagining File System Assumptions: GFS challenges traditional file system


assumptions by treating component failures as the norm, optimizing for huge files
predominantly appended and sequentially read, and extending the standard file
system interface to better suit the workload and technological environment.
2. Robust Fault Tolerance Mechanisms: The system ensures fault tolerance through
constant monitoring, data replication, and fast, automatic recovery. Chunk
replication and online repair mechanisms address the frequent component failures,
while checksumming detects data corruption at the disk level, ensuring data
integrity.
3. High Throughput and Scalability: GFS achieves high aggregate throughput for
multiple concurrent readers and writers by separating file system control from data
transfer. This separation, along with chunk leasing and a large chunk size,
minimizes master involvement and prevents it from becoming a bottleneck,
ensuring scalability.
4. Usage and Importance: GFS serves as Google's storage platform for research,
development, and production data processing, enabling innovation and
problem-solving at the scale of the entire web. Its success underscores its
importance as a tool for supporting large-scale data processing workloads on
commodity hardware.

You might also like