GFS Large Scale

Uploaded by

Farid Shah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views7 pages

GFS Large Scale

Uploaded by

Farid Shah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

The Google File System

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Google∗


Google Inc. developed the Google File System (GFS), a scalable distributed
file system (DFS), to meet the company’s growing data processing needs.
GFS offers fault tolerance, dependability, scalability, availability, and
performance to big networks and connected nodes. GFS is made up of a
number of storage systems constructed from inexpensive commodity
hardware parts. The search engine, which creates enormous volumes of data
that must be kept, is only one example of how it is customized to meet
Google’s various data use and storage requirements.
The Google File System reduced hardware flaws while gains of commercially
available servers.
GoogleFS is another name for GFS. It manages two types of data namely File
metadata and File Data.
The GFS node cluster consists of a single master and several chunk servers
that various client systems regularly access. On local discs, chunk servers
keep data in the form of Linux files. Large (64 MB) pieces of the stored data
are split up and replicated at least three times around the network. Reduced
network overhead results from the greater chunk size.
Without hindering applications, GFS is made to meet Google’s huge cluster
requirements. Hierarchical directories with path names are used to store files.
The master is in charge of managing metadata, including namespace, access
control, and mapping data. The master communicates with each chunk server
by timed heartbeat messages and keeps track of its status updates.
More than 1,000 nodes with 300 TB of disc storage capacity make up the
largest GFS clusters. This is available for constant access by hundreds of
clients.
Components of GFS
A group of computers makes up GFS. A cluster is just a group of connected
computers. There could be hundreds or even thousands of computers in each
cluster. There are three basic entities included in any GFS cluster as follows:
 GFS Clients: They can be computer programs or applications which may
be used to request files. Requests may be made to access and modify
already-existing files or add new files to the system.
 GFS Master Server: It serves as the cluster’s coordinator. It preserves a
record of the cluster’s actions in an operation log. Additionally, it keeps
track of the data that describes chunks, or metadata. The chunks’ place in
the overall file and which files they belong to are indicated by the metadata
to the master server.
 GFS Chunk Servers: They are the GFS’s workhorses. They keep 64 MB-
sized file chunks. The master server does not receive any chunks from the
chunk servers. Instead, they directly deliver the client the desired chunks.
The GFS makes numerous copies of each chunk and stores them on
various chunk servers in order to assure stability; the default is three
copies. Every replica is referred to as one.

Features of GFS
 Namespace management and locking.
 Fault tolerance.
 Reduced client and master interaction because of large chunk server size.
 High availability.
 Critical data replication.
 Automatic and efficient data recovery.
 High aggregate throughput.
Advantages of GFS
1. High accessibility Data is still accessible even if a few nodes fail.
(replication) Component failures are more common than not, as the saying
goes.
1. Excessive throughput. many nodes operating concurrently.
1. Dependable storing. Data that has been corrupted can be found and
duplicated.
Disadvantages of GFS
1. Not the best fit for small files.
1. Master may act as a bottleneck.
1. Suitable for procedures or data that are written once and only read
(appended) later.

Hadoop
Apache Spark

Comparing Hadoop and Spark

Spark is a Hadoop enhancement to MapReduce. The primary difference between Spark and
MapReduce is that Spark processes and retains data in memory for subsequent steps, whereas
MapReduce processes data on disk. As a result, for smaller workloads, Spark’s data processing
speeds are up to 100x faster than MapReduce (link resides outside ibm.com).

Furthermore, as opposed to the two-stage execution process in MapReduce, Spark creates a

Directed Acyclic Graph (DAG) to schedule tasks and the orchestration of nodes across the
Hadoop cluster. This task-tracking process enables fault tolerance, which reapplies recorded
operations to data from a previous state.

Let’s take a closer look at the key differences between Hadoop and Spark in six critical contexts:

1. Performance: Spark is faster because it uses random access memory (RAM) instead of
reading and writing intermediate data to disks. Hadoop stores data on multiple sources
and processes it in batches via MapReduce.
2. Cost: Hadoop runs at a lower cost since it relies on any disk storage type for data
processing. Spark runs at a higher cost because it relies on in-memory computations for
real-time data processing, which requires it to use high quantities of RAM to spin up
nodes.
3. Processing: Though both platforms process data in a distributed environment, Hadoop is
ideal for batch processing and linear data processing. Spark is ideal for real-time
processing and processing live unstructured data streams.
4. Scalability: When data volume rapidly grows, Hadoop quickly scales to accommodate
the demand via Hadoop Distributed File System (HDFS). In turn, Spark relies on the fault
tolerant HDFS for large volumes of data.
5. Security: Spark enhances security with authentication via shared secret or event logging,
whereas Hadoop uses multiple authentication and access control methods. Though,
overall, Hadoop is more secure, Spark can integrate with Hadoop to reach a higher
security level.
6. Machine learning (ML): Spark is the superior platform in this category because it
includes MLlib, which performs iterative in-memory ML computations. It also includes
tools that perform regression, classification, persistence, pipeline construction,
evaluation, etc.

Technician Toolbox Install Manual v2.4.0
100% (1)
Technician Toolbox Install Manual v2.4.0
61 pages
PLC Automation Guide
No ratings yet
PLC Automation Guide
4,541 pages
Software Testing & Project Management Expert
No ratings yet
Software Testing & Project Management Expert
8 pages
Siebel Interview Questions
100% (1)
Siebel Interview Questions
9 pages
J2EE 3-Tier or N-Tier Architecture
No ratings yet
J2EE 3-Tier or N-Tier Architecture
2 pages
Bai Tap Thuc Hanh Phan 1
No ratings yet
Bai Tap Thuc Hanh Phan 1
16 pages
OBJECT ORIENTED PROGRAMMING WITH JAVA by Ladwa, Hanumanth
No ratings yet
OBJECT ORIENTED PROGRAMMING WITH JAVA by Ladwa, Hanumanth
367 pages
Active Directory CLI Guide
No ratings yet
Active Directory CLI Guide
9 pages
GPFS and HDFS
No ratings yet
GPFS and HDFS
5 pages
Data Encoding and Compression
No ratings yet
Data Encoding and Compression
7 pages
Questions On Google File System
100% (1)
Questions On Google File System
3 pages
Azure DevOps LAB
No ratings yet
Azure DevOps LAB
6 pages
Unit 3 Notes FCC
No ratings yet
Unit 3 Notes FCC
51 pages
Chapter 15: Controlling Computer-Based Accounting Information System Information Systems, Part I 3 Edition James Hall
No ratings yet
Chapter 15: Controlling Computer-Based Accounting Information System Information Systems, Part I 3 Edition James Hall
5 pages
Advanced Excel Top 9 Report
No ratings yet
Advanced Excel Top 9 Report
19 pages
The Data Security and Privacy Playbook
No ratings yet
The Data Security and Privacy Playbook
48 pages
Contents - SAP Smart Forms - The Comprehensive Manual
0% (1)
Contents - SAP Smart Forms - The Comprehensive Manual
12 pages
Comprehensive Test Plan Guide
No ratings yet
Comprehensive Test Plan Guide
14 pages
Big Data Refers To Extremely Large and Complex Datasets That 1
No ratings yet
Big Data Refers To Extremely Large and Complex Datasets That 1
421 pages
GAL Segmentation in Office 365
No ratings yet
GAL Segmentation in Office 365
3 pages
005 - Semester 2 Final Exam
No ratings yet
005 - Semester 2 Final Exam
9 pages
Hadoop Ecosystem
100% (2)
Hadoop Ecosystem
33 pages
5.cloud Computing Lecture
No ratings yet
5.cloud Computing Lecture
7 pages
Unit Iiibig Data Processing Demo
No ratings yet
Unit Iiibig Data Processing Demo
32 pages
Unit 3 Hadoop
No ratings yet
Unit 3 Hadoop
50 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
14 pages
Unit 1 (Chapter 2) - Big Data Storage
No ratings yet
Unit 1 (Chapter 2) - Big Data Storage
34 pages
Audit Strategy for Adani Power
No ratings yet
Audit Strategy for Adani Power
8 pages
Class01 DDL DML TCL
No ratings yet
Class01 DDL DML TCL
10 pages
BDA Unit I
No ratings yet
BDA Unit I
18 pages
Lecture 4.1 - Hadoop - MapReduce - Hbase
No ratings yet
Lecture 4.1 - Hadoop - MapReduce - Hbase
94 pages
Big Data Unit-III
No ratings yet
Big Data Unit-III
39 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
BDA Unit 1 Notes
No ratings yet
BDA Unit 1 Notes
24 pages
CS621 Week 15
No ratings yet
CS621 Week 15
64 pages
Chapter 2 1712934164766
No ratings yet
Chapter 2 1712934164766
21 pages
Chap 6
No ratings yet
Chap 6
54 pages
When It Comes To Cloud File Systems Like GFS
No ratings yet
When It Comes To Cloud File Systems Like GFS
6 pages
Hadoop Basics for Engineering Students
No ratings yet
Hadoop Basics for Engineering Students
18 pages
Set Up Ubuntu Server With Ehcp (Lamp, DNS, FTP, Mail)
No ratings yet
Set Up Ubuntu Server With Ehcp (Lamp, DNS, FTP, Mail)
14 pages
Bigdata Lecture 2
No ratings yet
Bigdata Lecture 2
17 pages
DS Lecture 5
No ratings yet
DS Lecture 5
28 pages
Changelog WREX100
No ratings yet
Changelog WREX100
79 pages
Big Data Analytics Course Guide
No ratings yet
Big Data Analytics Course Guide
50 pages
Storage Systems
No ratings yet
Storage Systems
23 pages
DBMS Final
No ratings yet
DBMS Final
21 pages
Viorel's Tutorial - Front End Web Development & Digital Expertise
No ratings yet
Viorel's Tutorial - Front End Web Development & Digital Expertise
12 pages
Wa0001.
No ratings yet
Wa0001.
56 pages
MBR vs GPT: Partition Management Guide
No ratings yet
MBR vs GPT: Partition Management Guide
40 pages
The Hadoop Approach
100% (2)
The Hadoop Approach
14 pages
HDFS & MapReduce Explained
No ratings yet
HDFS & MapReduce Explained
16 pages
Healthy Happy and Safe Community Dha Medical Fitness
No ratings yet
Healthy Happy and Safe Community Dha Medical Fitness
19 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
A Developer'S Migration Plan: I D E M P I E R E
No ratings yet
A Developer'S Migration Plan: I D E M P I E R E
23 pages
What's New and Why It Matters: Amin Mesbahi
No ratings yet
What's New and Why It Matters: Amin Mesbahi
47 pages
Atul Tyagi
No ratings yet
Atul Tyagi
7 pages
97 Burp Suite Top 5 Community Edition Extensions
No ratings yet
97 Burp Suite Top 5 Community Edition Extensions
3 pages
Brown and Black Modern Watercolor Presentation
No ratings yet
Brown and Black Modern Watercolor Presentation
11 pages
Cloud Unit3
No ratings yet
Cloud Unit3
26 pages
Slide 2 GFS and Hadoop
No ratings yet
Slide 2 GFS and Hadoop
95 pages
Hadoop
No ratings yet
Hadoop
25 pages
Sodapdf
No ratings yet
Sodapdf
6 pages
An Overview of Google File System (GFS) - Medium
No ratings yet
An Overview of Google File System (GFS) - Medium
10 pages
4
No ratings yet
4
53 pages
AnalyzingGFS HDFS
No ratings yet
AnalyzingGFS HDFS
11 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Shivangi Pandey: Skills Experience
No ratings yet
Shivangi Pandey: Skills Experience
1 page
Hadoop and Big Data Unit 2
No ratings yet
Hadoop and Big Data Unit 2
11 pages
DC - PPT A Case Study On Distributed File Systems
No ratings yet
DC - PPT A Case Study On Distributed File Systems
17 pages
Unit 5 CC
No ratings yet
Unit 5 CC
8 pages
Ccomputing Madurya
No ratings yet
Ccomputing Madurya
20 pages
3
No ratings yet
3
11 pages
BDA Unit-1
No ratings yet
BDA Unit-1
19 pages
Hadoop Distributed Programming Guide
No ratings yet
Hadoop Distributed Programming Guide
38 pages
A Novel Distributed File System Using Blockchain Metadata
No ratings yet
A Novel Distributed File System Using Blockchain Metadata
20 pages
Google File System and Hadoop Distributed File System-An Analogy
No ratings yet
Google File System and Hadoop Distributed File System-An Analogy
11 pages
Distributed File Systems Leading To Hadoop File System: UNIT-2
No ratings yet
Distributed File Systems Leading To Hadoop File System: UNIT-2
12 pages
GPS Vs Hdfs
No ratings yet
GPS Vs Hdfs
6 pages
Distributed File System Study
No ratings yet
Distributed File System Study
4 pages
Large Scale Distributed File System Survey
No ratings yet
Large Scale Distributed File System Survey
7 pages
9238 DC Assignment 3
No ratings yet
9238 DC Assignment 3
5 pages
Mapreduce: Simplified Data Processing On Large Clusters
No ratings yet
Mapreduce: Simplified Data Processing On Large Clusters
38 pages
1564-Article Text-2810-1-10-20171231 PDF
No ratings yet
1564-Article Text-2810-1-10-20171231 PDF
5 pages
Gfs
No ratings yet
Gfs
15 pages
What Is Distributed Data Processing?
No ratings yet
What Is Distributed Data Processing?
2 pages

GFS Large Scale

Uploaded by

GFS Large Scale

Uploaded by

The Google File System

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Comparing Hadoop and Spark

Furthermore, as opposed to the two-stage execution process in MapReduce, Spark creates a

You might also like