Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
36 views7 pages

GFS Large Scale

Uploaded by

Farid Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views7 pages

GFS Large Scale

Uploaded by

Farid Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

The Google File System

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung


Google∗


Google Inc. developed the Google File System (GFS), a scalable distributed
file system (DFS), to meet the company’s growing data processing needs.
GFS offers fault tolerance, dependability, scalability, availability, and
performance to big networks and connected nodes. GFS is made up of a
number of storage systems constructed from inexpensive commodity
hardware parts. The search engine, which creates enormous volumes of data
that must be kept, is only one example of how it is customized to meet
Google’s various data use and storage requirements.
The Google File System reduced hardware flaws while gains of commercially
available servers.
GoogleFS is another name for GFS. It manages two types of data namely File
metadata and File Data.
The GFS node cluster consists of a single master and several chunk servers
that various client systems regularly access. On local discs, chunk servers
keep data in the form of Linux files. Large (64 MB) pieces of the stored data
are split up and replicated at least three times around the network. Reduced
network overhead results from the greater chunk size.
Without hindering applications, GFS is made to meet Google’s huge cluster
requirements. Hierarchical directories with path names are used to store files.
The master is in charge of managing metadata, including namespace, access
control, and mapping data. The master communicates with each chunk server
by timed heartbeat messages and keeps track of its status updates.
More than 1,000 nodes with 300 TB of disc storage capacity make up the
largest GFS clusters. This is available for constant access by hundreds of
clients.
Components of GFS
A group of computers makes up GFS. A cluster is just a group of connected
computers. There could be hundreds or even thousands of computers in each
cluster. There are three basic entities included in any GFS cluster as follows:
 GFS Clients: They can be computer programs or applications which may
be used to request files. Requests may be made to access and modify
already-existing files or add new files to the system.
 GFS Master Server: It serves as the cluster’s coordinator. It preserves a
record of the cluster’s actions in an operation log. Additionally, it keeps
track of the data that describes chunks, or metadata. The chunks’ place in
the overall file and which files they belong to are indicated by the metadata
to the master server.
 GFS Chunk Servers: They are the GFS’s workhorses. They keep 64 MB-
sized file chunks. The master server does not receive any chunks from the
chunk servers. Instead, they directly deliver the client the desired chunks.
The GFS makes numerous copies of each chunk and stores them on
various chunk servers in order to assure stability; the default is three
copies. Every replica is referred to as one.

Features of GFS
 Namespace management and locking.
 Fault tolerance.
 Reduced client and master interaction because of large chunk server size.
 High availability.
 Critical data replication.
 Automatic and efficient data recovery.
 High aggregate throughput.
Advantages of GFS
1. High accessibility Data is still accessible even if a few nodes fail.
(replication) Component failures are more common than not, as the saying
goes.
1. Excessive throughput. many nodes operating concurrently.
1. Dependable storing. Data that has been corrupted can be found and
duplicated.
Disadvantages of GFS
1. Not the best fit for small files.
1. Master may act as a bottleneck.
1. Suitable for procedures or data that are written once and only read
(appended) later.

Hadoop
Apache Spark

Comparing Hadoop and Spark

Spark is a Hadoop enhancement to MapReduce. The primary difference between Spark and
MapReduce is that Spark processes and retains data in memory for subsequent steps, whereas
MapReduce processes data on disk. As a result, for smaller workloads, Spark’s data processing
speeds are up to 100x faster than MapReduce (link resides outside ibm.com).

Furthermore, as opposed to the two-stage execution process in MapReduce, Spark creates a


Directed Acyclic Graph (DAG) to schedule tasks and the orchestration of nodes across the
Hadoop cluster. This task-tracking process enables fault tolerance, which reapplies recorded
operations to data from a previous state.

Let’s take a closer look at the key differences between Hadoop and Spark in six critical contexts:

1. Performance: Spark is faster because it uses random access memory (RAM) instead of
reading and writing intermediate data to disks. Hadoop stores data on multiple sources
and processes it in batches via MapReduce.
2. Cost: Hadoop runs at a lower cost since it relies on any disk storage type for data
processing. Spark runs at a higher cost because it relies on in-memory computations for
real-time data processing, which requires it to use high quantities of RAM to spin up
nodes.
3. Processing: Though both platforms process data in a distributed environment, Hadoop is
ideal for batch processing and linear data processing. Spark is ideal for real-time
processing and processing live unstructured data streams.
4. Scalability: When data volume rapidly grows, Hadoop quickly scales to accommodate
the demand via Hadoop Distributed File System (HDFS). In turn, Spark relies on the fault
tolerant HDFS for large volumes of data.
5. Security: Spark enhances security with authentication via shared secret or event logging,
whereas Hadoop uses multiple authentication and access control methods. Though,
overall, Hadoop is more secure, Spark can integrate with Hadoop to reach a higher
security level.
6. Machine learning (ML): Spark is the superior platform in this category because it
includes MLlib, which performs iterative in-memory ML computations. It also includes
tools that perform regression, classification, persistence, pipeline construction,
evaluation, etc.

You might also like