Evaluating Fault Tolerance and Scalability in Distributed File Systems: A Case Study of GFS, HDFS, and Minio
Evaluating Fault Tolerance and Scalability in Distributed File Systems: A Case Study of GFS, HDFS, and Minio
Abstract—Distributed File Systems (DFS) are essential for Distributed file systems allow for efficient, easily manage-
managing vast datasets across multiple servers, offering ben- able, and extensible data storage and sharing given a network
efits in scalability, fault tolerance, and data accessibility. This and relevant network protocols. As mentioned before, one of
paper presents a comprehensive evaluation of three prominent
DFSs—Google File System (GFS), Hadoop Distributed File the key features of a DFS is the ability of the server to set up
System (HDFS), and MinIO—focusing on their fault tolerance a protocol to restrict client access based. This allows multiple
mechanisms and scalability under varying data loads and client clients to be able to access multiple files hosted on a server
demands. Through detailed analysis, how these systems handle with varying degrees of access. DFSs depend on three main
data redundancy, server failures, and client access protocols, notions of transparency, fault tolerance, and scalability.
ensuring reliability in dynamic, large-scale environments is as-
sessed. In addition, the impact of system design on performance, A. Transparency
particularly in distributed cloud and computing architectures is
assessed. By comparing the strengths and limitations of each [2] implies the user should be able to access the distributed
DFS, the paper provides practical insights for selecting the most file system regardless of their login node or client machine
appropriate system for different enterprise needs, from high- and, based on their access level, should be able to perform the
availability storage to big data analytics.
same operations without caring about the faults in the system
because of the fault-tolerant mechanisms of the distributed
I. I NTRODUCTION
system itself. Therefore, the client should be able to access
A distributed file system (DFS) is a file system implemented the requisite files without worrying about consistency, faults,
on a client/server architecture, where one or more central and the complexity of the underlying file system.
servers store files that can be accessed by multiple remote
clients, which is based on the access protocol defined by B. Fault Tolerance
the server and the level of access granted to the client. The [3] should not be stopped in case of transient or partial
fundamental feature of a distributed file system is that the failures. Faults could be of network failure or server failure,
interface of the file access by the remote client is as if the which would result in data and service unavailability, compro-
file is being accessed from their local machine. Therefore, mising data integrity. Fault tolerance plays an integral role in
files can be cached, accessed, and managed on the local client consistency when several users concurrently access the data,
machine, but the process is managed by a single, or a network typically when the data is stored in multiple server locations.
of centralized servers. The cost and complexity of designing such systems increases
DFSs constitute the primary support for data management. significantly based on the severity of the failures and the
They provide an interface whereby to store information in the relative importance of the data to the server host(s).
form of files and later access them for read and write oper-
ations. Among the several implementations of file systems, C. Scalability
few of them specifically address the management of huge [4] is the ability to efficiently leverage large amounts
quantities of data on a large number of nodes. Mostly, these file of servers, which can be either dynamically or continuously
systems constitute the data storage support for large computing added to the system. A distributed file system should be
clusters, supercomputers, massively parallel architectures, and scalable to account for maintaining replicas and increasing
lately, storage/computing clouds [1]. fault tolerance as the number of files, size of files, or number of
clients increase. Scalability implies both storage space as well initiative, a unique volume can be created for that purpose.
as distributed compute. Some systems that adopt a centralized The MapR file system will then group all containers that hold
architecture provide tools such as multi-threading for scaling the data related to the initiative within the designated volume.
client access to files in DFSs. A cluster can host multiple volumes.
Mirror volumes are non-writable replicas of a primary
D. Terms and Jargon used in Distributed File System Litera- volume. These mirror volumes can be promoted to writable
ture volumes. This functionality is particularly useful in disaster-
The literature on DFSs is based on a combination of recovery situations, where a read-only mirror can be promoted
literature in distributed systems as well as local file system to a writable volume to serve as the primary data storage.
and computer system architecture. Therefore, a high density Furthermore, writable volumes that have been mirrored can
of jargon has evolved that is specific to this field. A few of also be converted into mirrors (to establish a reverse mirroring
these terms have been detailed here. relationship). Writable volumes can additionally be reverted to
1) High Availability: High availability is that characteristic read-only mirrors.
of a system, which aims to ensure an agreed level of opera- 5) Snapshots: Snapshots allow you to revert to a previously
tional performance, usually uptime, for a higher than normal saved and stable data set. A snapshot is a non-writable replica
period. This implies that a highly available system will: of a volume at a specific point in time, offering recovery to
• Add redundancy to remove single points of failure. One that exact moment. Snapshots only track the changes made to
component failing shouldn’t bring the system down. the volume’s data, which results in an efficient use of disk
• Reliably crossover. Given that the system is redundant, resources within the cluster. They help preserve access to
the crossover point is prevented from becoming a single past data and safeguard the cluster against user or application
point of failure. mistakes. Snapshots can be created manually, or you can
• Limit the chances of failure. Given the above two automate the process with scheduled tasks.
principles, failures shouldn’t occur. If and when they do, 6) MapReduce: MapReduce is a computational framework
they should be detected as soon as possible, and regular that enables applications to process extensive amounts of data.
maintenance must be done. It operates by executing applications concurrently across a
2) Shards: A database shard is a horizontal partition of group of inexpensive machines in a dependable and fault-
data in a database or search engine. Each individual partition tolerant manner. The framework consists of several map and
is referred to as a shard or database shard. Each shard is held reduce tasks, with each task processing a portion of the data.
on a separate database server instance to spread load. This enables the workload to be distributed throughout the
The advantages of having shards include: cluster. The map tasks are responsible for loading, parsing,
transforming, and filtering the data. On the other hand, reduce
• Total rows per table reduced
tasks aggregate and group the intermediate results produced
• Index size reduced, improving search performance
by the map tasks.
• Distribute database over more hardware.
The input file for a MapReduce job resides on HDFS. The
• Some data is naturally distributed like this (e.g., country-
input format determines how the file is divided into smaller
specific data stored in a shard of database that is physi-
chunks, known as input splits. These input splits provide a
cally present in that country for lower network latency)
byte-level representation of a segment of the input file. The
However, there are some disadvantages as well, which are: map task then processes the input split locally on the node
• Increased complexity of SQL where the data is stored. This eliminates the need for data
• Sharding introduces complexity transfer over the network, ensuring that the data is processed
• Single point of failure locally.
• Failover servers more complex
• Backups more complex E. Types of Distributed File Systems
• Operational complexity added By architecture design, DFSs can be classified as Client-
3) Volumes: Server Architectures and Cluster-Based Distributed File
4) Volumes: Volumes serve as a management construct that System.
logically organizes a cluster’s data. Since a container is always • A client-server architecture has several servers that man-
associated with one volume, all replicas of that container age the data store and caching functions, access manage-
are also tied to the same volume. Volumes do not have a ment, and data transfer between different clients. The data
predefined size and consume disk space only when the MapR transfer and the metadata are not decoupled, meaning that
file system stores data within a container assigned to the the data access is based on a global namespace, which is
volume. A large volume may consist of anywhere from 50 shared by multiple clients. In a client-server architecture,
to 100 million containers. Common use cases include creating the addition of new servers increases processing capacity
volumes for specific users, projects, or different stages, such as well as storage.
as development and production environments. For instance, • On the contrary, a cluster-based distributed file system
when an administrator needs to arrange data for a particular decouples the metadata and the data transfer, by having
dedicated servers to manage storage and others to man-
age metadata. Increasing the number of storage servers
increases the overall capacity of the distributed system,
but might negatively affect the query time if the metadata
server(s) are not increased in capacity. Systems having
a single metadata server are referred to as centralized
cluster-based DFSs and often have a single point of
failure. Distributed metadata servers occur in totally dis-
tributed cluster-based DFSs.
In terms of cache consistency, the distributed systems can
be analyzed as systems that implement Write Once Read Many
(WORM), Transactional Locking, and Leasing [5].
• Write Once Read Many (WORM) is an approach that
ensures mutual consistency across the servers and clients
based on a single write to a file segment which can then
be accessed multiple times for reading. Once a file is
created, it cannot be modified, rewritten, or appended. Fig. 1. The Hadoop Distributed File System Architecture
Cached files are in read-only mode, therefore each read
reflects the latest version of the data. Consistency here
mirrors guaranteed eventual consistency in distributed • Optimized Data Access: HDFS is tailored for stream-
systems as the final version of a ing data access, supporting batch processing rather than
interactive use. High throughput is prioritized over low
II. H ADOOP D ISTRIBUTED F ILE S YSTEM (HDFS) latency, with certain POSIX requirements relaxed to boost
The Hadoop Distributed File System (HDFS) [6] is designed performance.
to run on commodity hardware, making it extensible and easily • Handling Large Datasets: HDFS is optimized for large-
accessible for consumer-grade networks and low-performance scale applications, capable of managing petabytes of data
servers. While similar to other distributed file systems in its across hundreds of nodes. It supports millions of files
WORM model and persistent failure handling, HDFS stands within a single instance while maintaining high aggregate
out for its high fault tolerance and cost-effectiveness, as it is bandwidth.
optimized for deployment on inexpensive hardware. • Simplified Coherency Model: HDFS adopts a write-
HDFS follows a centralized architecture where metadata is once-read-many (WORM) model, where files are not
managed by a single server. This server maintains a persistent modified once closed. This simplifies data consistency
copy of metadata, enabling quick restart and migration of issues, enabling efficient data access and making HDFS
metadata from the secondary to the primary data node when ideal for batch-processing applications like MapReduce.
necessary. • Cross-Platform Portability: HDFS is designed to be
HDFS provides high throughput for large data sets and easily portable across different platforms, ensuring broad
is ideal for applications requiring streaming access. It was compatibility and adoption for various applications.
originally developed for the Apache Nutch web search engine
project and is part of the Apache Hadoop Core initiative.
B. Architecture Description
HDFS divides files into large blocks, distributing them across
cluster nodes, and runs packaged code on these nodes to As seen in Figure 1, an HDFS cluster consists of a single
process data in parallel. This method benefits from data NameNode, a master server that manages the file system
locality, improving processing speed and efficiency compared namespace and regulates access to files by clients. Therefore,
to traditional supercomputing models that rely on high-speed the NameNode is the master metadata and processing node.
networking to distribute computation and data.
The Hadoop architecture is open-source and available at C. HDFS Components and Architecture
http://hadoop.apache.org/.
HDFS consists of two primary components: the NameNode
A. Goals of HDFS and the DataNodes. The NameNode manages the file system
HDFS is designed for deployment across hundreds of server namespace, handling operations like file creation, deletion, and
nodes with attached data stores. Its core goals include: renaming while also mapping file blocks to DataNodes. The
• Efficient Fault Detection and Recovery: Given that DataNodes manage the storage of these blocks and execute
hardware failure is common in large systems, HDFS is tasks like block creation, deletion, and replication as directed
built to detect faults quickly and recover automatically. by the NameNode. These components run on commodity
With potentially thousands of nodes, fault tolerance is a machines, usually using a GNU/Linux OS. The system is
primary design consideration. implemented in Java, ensuring cross-platform compatibility.
1) Data Replication and Reliability: HDFS ensures high 6) Upon reaching the block’s end, DFSInputStream
fault tolerance by replicating file blocks across multiple data closes the connection and locates the next DataNode for
nodes. The replication factor is configurable, allowing fine- the subsequent block.
tuned fault tolerance depending on the file’s importance. For 7) After reading, the close() method is invoked to finish
efficient data replication, the NameNode tracks the state of the the operation.
DataNodes and manages block replication decisions. HDFS 2) Write Operations in HDFS:
supports a WORM model where files, once written, cannot be 1) The client initiates the write operation by calling
modified except for appending or truncating. create() on the DistributedFileSystem to
a) Replica Placement: HDFS optimizes replica place- create a new file.
ment to balance reliability, availability, and network band- 2) The NameNode verifies file existence and client permis-
width. Data replicas are placed with respect to rack-awareness, sions. If valid, a new file record is created; otherwise,
ensuring that replicas are distributed across different racks an IOException is thrown.
to minimize data loss in case of a rack failure. The default 3) Upon success, the client receives a
policy places replicas in such a way that minimizes inter-rack FSDataOutputStream object for writing data.
traffic, improving write performance without compromising 4) FSDataOutputStream contains
data reliability. DFSOutputStream, which manages communication
b) Replica Selection: To optimize read performance, with DataNodes and the NameNode. Data is written in
HDFS prioritizes retrieving data from the closest replica. packets that are queued in DataQueue.
When the cluster spans multiple data centers, local replicas 5) The DataStreamer consumes the packets from
are preferred to minimize latency. DataQueue and requests block allocation from the
2) Metadata Persistence and Communication Protocols: NameNode, choosing DataNodes for replication.
The NameNode stores metadata using a transaction log, called 6) A pipeline of DataNodes is established for replication.
the EditLog, and an image of the file system namespace, the With a replication factor of 3, data is streamed from the
FsImage. The EditLog records changes to the file system, first DataNode to the second and then to the third.
while the FsImage represents a snapshot of the file system’s 7) DFSOutputStream maintains an Ack Queue to
state. Periodic checkpointing ensures the file system’s consis- store packets awaiting acknowledgment from DataN-
tency by merging the EditLog with the FsImage. The DataN- odes.
ode stores blocks of data on its local file system and sends 8) Once acknowledgments are received, the packet is re-
periodic reports, known as Block reports, to the NameNode, moved from the Ack Queue. If any DataNode fails,
listing the blocks it manages. the operation is retried using the remaining packets.
HDFS communication uses TCP/IP protocols. Clients inter- 9) Upon completion, the client calls close(), flushing
act with the NameNode via the ClientProtocol, and DataNodes remaining data and waiting for final acknowledgment.
communicate with the NameNode using the DataNode Pro- 10) The NameNode is notified once the write operation is
tocol. Both protocols are built upon Remote Procedure Calls finished.
(RPCs), where the NameNode only responds to incoming RPC
requests from clients or DataNodes, never initiating them. E. Properties of HDFS
1) Robustness: HDFS ensures data reliability despite fail-
D. Client Operations
ures, including NameNode, DataNode, and network partition
Data read requests are processed by HDFS, with interactions failures.
between the NameNode and DataNodes. The reader, referred a) Data Disk Failure: Periodically, each DataNode sends
to as a client, follows these steps: a heartbeat to the NameNode. In the case of a network
1) Read Operations in HDFS: partition, a lack of heartbeat indicates failure, causing the
1) The client initiates a read request by calling open() NameNode to mark the DataNode as dead. Blocks registered to
on the FileSystem object, which is of type the dead node become inaccessible, and replication is triggered
DistributedFileSystem. to restore the desired replication factor.
2) The DistributedFileSystem connects to the Na- b) Cluster Rebalancing: HDFS supports dynamic data
meNode using RPC and retrieves metadata such as block movement to balance storage. Although rebalancing schemes
locations. are not yet implemented, data can be reallocated to maintain
3) The NameNode returns addresses of DataNodes storing optimal space usage.
the first few blocks. c) Metadata Disk Failure: The FsImage and EditLog
4) The client receives an object of type are critical for HDFS operation. The NameNode can maintain
FSDataInputStream, containing multiple copies of these files to prevent data loss. Synchronous
DFSInputStream that facilitates interactions updates of these copies ensure consistency but may affect
with DataNodes and NameNodes. metadata transaction throughput.
5) The client repeatedly calls read(), which fetches data d) Snapshots: Snapshots capture a point-in-time copy of
in streams until the end of a block. data, enabling rollback to a prior state in case of corruption.
2) Accessibility: HDFS supports diverse access methods,
including the native Java FileSystem API, a C wrapper, and
REST API. A web interface and NFS gateway allow browsing
and mounting of HDFS filesystems.
3) Data Organization: HDFS is optimized for large files
and is suitable for applications requiring high-throughput,
read-heavy operations. It supports the write-once-read-many
model, with files typically split into 128 MB blocks distributed
across DataNodes.
a) Replication Pipeline: With a replication factor of 3,
data is written to a pipeline of DataNodes. Each DataNode in
the pipeline stores and forwards data to the next. This pipelined
approach ensures parallel processing and data redundancy.
4) Space Reclamation: When enabled, HDFS moves
deleted files to a trash directory rather than immediately
removing them. Files remain in the trash for a configurable Fig. 2. GFS architecture
period, and old checkpoints are deleted after expiry. After the
retention period, the NameNode permanently deletes the file
and frees associated blocks, though this may result in some Secondly, the size of files far exceeds traditional standards.
delay in reclaiming space. Multi-gigabyte files are commonplace. Each file often consists
of numerous application objects, such as web documents.
III. G OOGLE F ILE S YSTEM (GFS) Managing billions of roughly kilobyte-sized files within fast-
growing datasets of several terabytes containing billions of
The Google File System (GFS) is a scalable, distributed objects becomes unmanageable. Even if the file system could
file system designed to handle large, data-intensive applica- theoretically support it, this scale of data requires rethinking
tions. It ensures fault tolerance while operating on low-cost certain design assumptions and parameters, such as I/O oper-
commodity hardware and delivers high aggregate performance ations and block sizes.
to numerous clients. Large-scale GFS deployments can offer Thirdly, the majority of files are modified by appending data
hundreds of terabytes of storage spread across thousands of rather than overwriting existing content. Random writes within
disks across over a thousand machines, with concurrent access a file are virtually nonexistent. After a file is written, it is
from hundreds of clients. typically only read, and often sequentially. This access pattern
While GFS provides a familiar file system interface, it does is common for a variety of data types. For example, large
not strictly adhere to POSIX semantics. Files are organized repositories of data are often scanned by analysis programs,
hierarchically in directories and identified by pathnames. Key while other types may represent continuously generated data
operations such as create, delete, open, close, read, streams or archival information. Additionally, intermediate
write, and especially append to files are supported. results may be produced on one machine and later processed
on another. Due to these usage patterns with large files, the
A. Objectives of GFS focus is placed on optimizing the performance of appending
The primary goal of GFS in large-scale environments is operations and ensuring atomicity, while caching data blocks
to offer a highly transparent, consistent, and fault-resistant on the client side become less relevant.
framework for distributed file reading, writing, accessing, and Fourthly, the co-design of applications and the file system
appending. This objective is achieved through a combination API enhances overall system flexibility. For example, the
of principles, as detailed below. consistency model of GFS has been relaxed significantly,
Firstly, component failures are considered typical rather simplifying the file system while reducing the burden on ap-
than exceptional. The system comprises hundreds or even plications. Furthermore, an atomic append operation has been
thousands of storage machines made from affordable com- introduced, allowing multiple clients to append data to a file
modity hardware, with a similar number of client machines concurrently without the need for additional synchronization.
accessing them. Given the sheer volume and variability of
these components, it is practically inevitable that some will be B. Architecture Description
non-functional at any given time, and certain failures will be
irreparable. The system has encountered issues stemming from Severe failures or network partitions. Minio ensures that the
application bugs, operating system failures, human mistakes, data remains safe and accessible by employing a highly re-
as well as failures in disks, memory, connectors, networking, silient architecture with redundancy across multiple locations.
and power supplies. As a result, the system must integrate By using distributed erasure coding and continuous replication,
continuous monitoring, error detection, fault tolerance, and Minio provides excellent data durability and availability across
automatic recovery mechanisms. different failure scenarios.
of CPU and memory resources and thus allows the co-hosting
of a large number of tenants on shared hardware.
MinIO fucntions well on commodity servers (though often
benchmarked with state-of-the-art hardware) with locally at-
tached drives (JBOD/JBOF). Keeping to the norms of efficient
scalability, all the servers in a cluster are equal in capability
(fully symmetrical architecture). In fact, there are no name
nodes or metadata servers.
An important implementation detail within MinIO is that it
writes data and metadata together as objects, thus eliminating
the requirement for a separate metadata database. MinIO’s
resiliency can also be attributed to the fact that all functions
it uses are inline and strictly consistent.
While a MinIO cluster is a collection of distributed MinIO
Fig. 3. MinIO Architecture Diagram
servers with one process per node, it itself runs in the user
space as a single process and uses lightweight co-routines for
high concurrency. A deterministic hashing algorithm is utilized
• Erasure Coding: A method for providing redundancy that
to place objects within Erasure sets, which have 16 drives per
minimizes the overhead compared to traditional replica-
set by default.
tion. This allows Minio to store data more efficiently and
Architecturally, MinIO is designed to operate at scale across
with fewer resources while maintaining durability.
multi-datacenter cloud services. In it, each tenant runs their
• Bitrot Protection: Minio includes mechanisms to detect
own fully-isolated MinIO cluster, protecting them from all
and repair Bitrot, ensuring that data corruption due to disk
possible disruptions to to things like upgrades, updates, and
errors or other hardware failures does not go unnoticed.
security breaches. Additionally, each tenant can scale inde-
• Encryption and WORM (Write Once Read Many): Sen-
pendently by federating clusters across geographical locations.
sitive data can be encrypted at rest and during transit to
Minio’s Object Storage Architecture is also quite interesting.
ensure security, and the WORM feature can enforce data
Most existing storage solutions follow a system that involves
immutability, preventing unauthorized changes.
a multi-layer storage architecture comprising a durable block
• Identity Management: Integrates with identity providers
layer at the bottom , a filesystem as ”middleware”, and APIs
for secure access control and authentication, providing
on top implementing protocols for various operations on files,
robust mechanisms for managing who can access the data
blocks, and objects.
and how.
While public cloud architectures provide separate object, file
• Global Federation: Allows you to federate multiple Minio
and block storage, Minio follows a fundamentally different
instances across different regions or data centers, creating
architecture: using a single layer to achieve everything. As
a unified global object storage system for the applications.
a result, the minio object server is high-performance and
• Multi-Cloud Support: With Minio’s gateway mode, users
lightweight.
can seamlessly integrate Minio with existing public cloud
1) MinIO Design Decisions:
providers, offering hybrid cloud storage solutions.
a) Lambda* Function Support: Enterprise standard mes-
Minio’s scalability makes it a competitive alternative for saging platforms can be used to deliver notifications of events
large-scale storage solutions, and it integrates easily with via the Amazon-compatible lambda event notifications. This
applications, including those built using cloud-native archi- allows notifications for object-level events/actions like access,
tectures. Its simplicity in deployment and operation makes it creation, and deletion to be delivered to the application layer.
appealing for both small and large organizations seeking to b) Linear Scaling: By taking inspiration from hyper-
manage their storage infrastructure independently. scalers, Minio clusters can be deployed at a wide range of
In summary, Minio provides a robust, scalable, and cost- nodes ranging from 4 to 32. Through its feature of Federation,
effective solution for distributed object storage, with the flexi- multiple clusters are can easily be joined together under the
bility to handle demanding workloads and large data sets while same ”global” namespace, which essentially is a single entity
supporting advanced features like encryption, federation, and to the outside world. As a result of federation:
erasure coding for data protection and availability. • All nodes are considered equal
• Any node can serve requests concurrently
C. Architecture Description • A DLM (Distributed Locking Manager) helps a cluster
By design, MinIO is cloud native and can be run using manage updates and deletions
lightweight containers managed by external orchestration ser- • There is no performance degradation of an individual
vices (ex. Kubernetes). A competitive advantage of Minio is cluster due to the addition of more clusters under the
that it fits the entire server into a single 40MB static binary. global namespace
Though not compiled at source, it is highly efficient in its use • Cluster-level failure domain
c) Erasure Code: The integrity of object data is main- are achieved by using Minio’s distributed erasure code,
tained by erasure coding and bitrot protection checksums. For which uses multiple redundant parity blocks to protect
some background information, erasure code is a mathematical data. In case of data center level outages, there is also
algorithm that can be used to reconstruct missing or corrupted support for continuous mirroring to remote sites (disaster
data. To achieve this the Reed-Solomon code is used to shard recovery).
objects into data and parity blocks, and hashing algorithms • Metadata Architecture: As a result of not having a sep-
are used to help protect individual shards. As a result, Minio arate metadata store, all failures are contained within an
is resilient to silent data corruption (which happens quite object and do not spill over to the rest of the system.
often at a petascale of data) and other hardware failures. Additionally, all operations within Minio are performed
Raid configurations and data replicas suffer from high storage atomically at an object level of granularity. Other data
overheads. Erasure code solves this problem while allowing integrity requirements are satisfied by the existence of
for the loss of up to 50 percent of drives and 50 percent of erasure code and bitrot hash per object.
servers. RAID-6, as a case study, is only able to withstand two- • Strict Consistency: Strictly consistent operations allow for
drive failures. By applying erasure code to individual objects, the system to survive crashes even in the middle of other
Minio allows the healing of one object at a time. On the other workloads without any loss to pre-existing data. This is
hand, RAID-protected storage solutions have healing done at highly useful in use cases related to machine learning and
a RAID Volume level, which impacts performance for files other big data workloads.
within that level for the duration of the healing. • Geographic Namespace: Using Mini’s federation feature,
users can opt to scale in an incremental fashion rather
D. Properties of MinIO Distributed File System than having to deploy across data centers via hyperscalers
• Performance: One platform allows support for various from the start itself. It can be deployed in units that have a
different use cases. Industrial use cases have shown that failure domain restricted to the size at which it is currently
a user can run multiple Spark*, Presto*, and Hive* scaled.
queries or even deploy large AI workloads and algorithms • Cloud-native design: Kubernetes and other orchestration
without encountering any issues related to storage (read platforms can easily be used along with Minio’s multi-
or write). Minio object storage allows for high throughput instance and multi-tenant design. Containerized deploy-
and low latency for cloud-native applications. Coupled ment of Minio leads to the use of these orchestration
with the latest hardware and network infrastructure, services for reliable scaling. Each instance of Minio can
Minio outperforms all traditional object storage. be provisioned on demand through self-service registra-
• Scalability: Single clusters can be federated with other tion. While traditional monolithic storage systems (which
clusters to create namespaces that can span multiple data have its own disadvantages) do compete with Kubernetes
centers. Incremental expansion of physical servers is what resource management, they lack its easily scalable nature
leads to the gradual expansion of the global namespace. and the ability to pack many tenants simultaneously on
Though a simple design, it leverages all cutting-edge the same shared infrastructure.
knowledge of hyperscalers.
• Ease of Use: Start-up configuration time is only a few IV. C OMPARING THE DFS S
minutes, given the existence of a single 40MB binary for
the server. Default configurations work very well, thus TABLE I
C OMPARISON OF DFS (PART 1)
allowing for use in smaller projects as well, without full
knowledge of system administration. DFS Written In License Access API
• Encryption and WORM: Unique keys are used to encrypt Minio Go Apache v2 AWS S3 API
HDFS Java Apache v2 Java and C client, HTTP
each object (per-object key). By supporting integration GFS UNK Proprietary Native File System API
with external key management solutions, state-of-the-art
cryptography solutions can be used to secure and manage
encryption keys without having any coupling with Minio. TABLE II
WORM (write once, read many) mode prevents data C OMPARISON OF DFS (PART 2)
tampering.
• Identity and Access Management: Support for OpenID*- DFS High Availability Shards Release
Minio Yes Yes 2014
compatible identity management servers allows for se- HDFS Transparent master failover No 2005
cure and well-tested access control. Temporary rotating GFS Yes Yes (by Spanner) 2003
credentials within Minio prevent the need to embed long-
term credentials within an application.
• High Availability: Even if a minio cluster loses up to 50% A. HDFS vs MinIO
of its drives and servers, Minio will continue to serve While HDFS has been a long-standing player in the dis-
objects. Additionally, total rack failures are also mitigated tributed filesystem market, Minio has shown to outperform
(if the cluster is deployed across racks). These features it in many of its seminal tasks. Their core difference lies in
the philosophy of storage. HDFS achieves its high throughput • Sort
values by colocating compute and data on the same nodes. • Wordcount
As a result they get to exploit fewer network calls and
B. HDFS vs GFS
overcome the limitations of slow network access. However, as
storage requirements tend to grow much faster than compute Sources [7] and data published in the papers of HDFS and
requirements, node scaling within HDFS tends to lead to GFS are used to draw comparisons.
wastage of compute resources. GFS was built for the unique of Google as a company: The
An example of an issue caused by this would be that if use of off-the-shelf hardware to run production servers, their
HDFS were to have to store 10 petabytes of data due to its scale of batch data processing, the generally append only the
replication factor of 3, it would need to store 30 petabytes on nature of their high-throughput, latency-insensitive workload.
the whole. At a max storage of 100TB per node, this would On the other hand, HDFS was implemented for the purpose
need 300 nodes, which would clearly overprovision compute of running Hadoop’s MapReduce applications. It was created
facilities, thus causing other overheads. as an open-source framework for the usage of different clients
Minio overcomes these issues by the natural solution of with different needs. This makes Hadoop is far less opin-
separating storage an compute resources. Along with its cloud- ionated; Therefore, it is also far less optimized at handling
native infrastructure, it is able to use orchestration frameworks workloads that GFS is tuned for.
like Kubernetes, the software stack. In terms of data storage, recall that GFS chunks data into
1) Performance: To compare the performance of the two 64 MB chunks that are uniquely identified. These chunks are
filesystems, an experiment is set up. First, the infrastructure is replicated into 64 KB blocks which are checksummed. This
benchmarked to see the baseline limitations of the setup. permits fast chunked reads while allowing for error detection
• Hard Drive Performance
using the per-block checksum. On the other hand, HDFS
divides data into 128MB blocks. The HDFS NameNode holds
– Write: 137 MB/s
block replica as two files: one with the data, the other with
– Read: 205 MB/s
the checksum and generation stamp.
– Multi-drive performance with 32 threads + 32kb
In terms of architecture, GFS is more involved due to its
blocks:
focus on concurrent, atomic appends and snapshotting support.
∗ Write: 655 MB/s In particular, GFS requires leases. The client is told where to
∗ Read: 1.26 GB/s write by the master. In HDFS, the client decides where to
• Network Performance write.
– Ethernet cables support 3.125 GB/s, but with mul- Data about reads and writes are collated from the data
tiple connections, the sustained throughput was at published by Google on GFS performance as it runs on real-
around 1.26 GB/s world clusters (Cluster B in ) [8]).
After the benchmarking of the basic infrastructure, some
degree of tuning is done for each of the filesystems. This, on TABLE III
PERFORMANCE COMPARISON : HDFS V / S GFS ON PRODUCTION
the whole, is to ensure that each of the filesystems is using WORKLOADS . HDFS DATA GATHERED USING 3000 NODES ON DFSIO
all the resources allocated to them to the maximum extent TEST SUITE . GFS NUMBERS FROM G OOGLE ’ S REPORTED PERFORMANCE
IN PRODUCTION
possible, as would be done in a normal stress test.
• Minio Measurement GFS HDFS
Read (Busy) 380 MB/s 1.02 MB/s/node
– 1.2 TB aggregate memory across 12 nodes. Tuning Writes (Busy) 117 MB/s 1.09 MB/s per node
was done such that MapReduce jobs could use
the whole allocated CPU and Memory provided by V. C ONCLUSION
compute nodes.
– The entire 144GB of RAM of each node was used. In this paper, a detailed evaluation of three promi-
– S3A connector (API) is tuned nent distributed file systems (DFSs)—Google File Sys-
tem (GFS), Hadoop Distributed File System (HDFS), and
• HDFS
MinIO—focusing is conducted on their scalability, fault tol-
– Tuned to 2.4 TB aggregate memory across 12 nodes. erance, and overall performance in large-scale, dynamic en-
– Tuned til the entirety of the 256 GB Ram on all vironments. Each of these systems brings unique advantages
compute nodes was being used. (Higher RAM due and challenges to managing vast datasets across distributed
to shared compute and storage nodes). networks, with varying approaches to data redundancy, server
– Ensured that computations do not go into swap space failures, and client access protocols.
as well (as that causes lower performance) The analysis has demonstrated that fault tolerance is a
– Configured to replicate data with a factor of 3 critical factor for ensuring consistent data availability and
The following tasks, which are considered to be Hadoop’s integrity in the face of network or server failures. While GFS
most proven benchmarks, are evaluated: and HDFS employ replication and redundancy techniques to
• Terasort maintain data availability, MinIO offers a more lightweight
and flexible approach, catering to cloud-native environments.
Scalability, as explored, is another defining feature of these
systems, with all three DFSs leveraging mechanisms such as
sharding and multi-node architectures to scale efficiently in
response to increasing data loads and client demands.
Additionally, the impact of system design on performance
was assessed, revealing that the choice of DFS is highly
dependent on the specific enterprise requirements—whether
prioritizing high availability, fault tolerance, or ease of integra-
tion with cloud computing infrastructures. For organizations
involved in big data analytics, both HDFS and GFS offer
robust, well-established solutions, while MinIO provides an
appealing alternative for cloud-centric, object-based storage.
In conclusion, selecting the most appropriate DFS depends
on the scale of operations, fault tolerance requirements, and
integration needs with other cloud-based or distributed com-
puting services. This paper serves as a comprehensive guide
for understanding the strengths, limitations, and suitability of
these DFSs, providing valuable insights for enterprises seeking
to optimize their data management strategies in large-scale
distributed environments.
R EFERENCES
[1] R. Buyya, C. Vecchiola, and S. T. Selvi, Mastering Cloud Computing:
Foundations and Applications Programming. Cambridge, MA, USA:
Morgan Kaufmann, 2013.
[2] R. W. Floyd, “Transparency in distributed systems,” Communications of
the ACM, vol. 32, no. 6, pp. 678–685, 1989.
[3] H. Becker, “Application of fault tolerance in distributed systems,” Journal
of Distributed Computing, vol. 7, no. 3, pp. 201–215, 1994.
[4] R. G. McCreadie, N. Zeldovich, and M. F. Kaashoek, “Mapreduce and
the evolution of large-scale data processing,” IEEE Internet Computing,
vol. 16, no. 3, pp. 15–23, 2012.
[5] D. P. Reed, “An analysis of cache consistency mechanisms,” ACM
Computing Surveys (CSUR), vol. 28, no. 1, pp. 38–64, 1996.
[6] D. Borthakur, “Hadoop distributed file system (hdfs),” Apache Hadoop
Project, 2007, https://hadoop.apache.org/.
[7] Y. Carmel, “Hdfs vs gfs,” Advanced Topics in Storage Systems, Lecture
Slides.
[8] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The google file system,”
in Proceedings of the nineteenth ACM symposium on Operating systems
principles, 2003, pp. 29–43.
[9] R. Depardon et al., “Analysis of visual data and its implications for
multimodal perception,” Journal of Cognitive Science, vol. 15, no. 2,
pp. 45–60, 2013.