Filesystem Notes
Filesystem Notes
http://www.upgrade-cepis.org
Vol. II, No. 6, December 2001
1
Open Source / Free Software: Towards Maturity
First of all, there is no a clear winner, XFS is better in some aspects or cases, ReiserFS in others, and both
are better than Ext2 in the sense that they are comparable in performance (again, sometimes faster,
sometimes slightly slower) but they a journaling file systems, and you already know what are their
advantages… And perhaps the most important moral, is that Linux buffer/cache is really impressive and
affected, positively, all the figures of my compilations, copies and random reads and writes. So, I would say,
buy memory and go journaled ASAP…
Keywords: Linux, Operating Systems, Page Cache, Buffer 2 The Linux Virtual File System
Cache, Journal File Systems, ext3, xfs, reiserfs, jfs. A File is a very important abstraction in programming
field. Files serve for storing data permanently and they offer a
1 Introduction few simple but powerful primitives to the programmers. Files
The paragraph above was the first one in an article are normally organised in a tree-like hierarchy where interme-
published in the Balearic Islands Linux User Group Web1 diate nodes are directories, which in turns are capable of group-
which described a second run of benchmarks over the Linux ing files and sub-directories. The file systems is the way the
journaled file systems in comparison against traditional Unix operating system organises, manages and maintains the file
file systems. Although somehow informal, both benchmarks hierarchy into mass-storage devices, normally hard disks.
covered FAT32, Ext2, ReiserFS, XFS, and JFS for several cas- Every modern operating system supports several different,
es: Hans Reiser’s Mongo, file copying, kernel compilation and disparate, file systems. In order to maintain the operating
a small test program that tried to simulate the access pattern of system modular, and to provide applications with a uniform
database systems. programming interface (API), a higher layer that implements
Both articles were among the first published and our server the common functionality of those underlying file systems is
was overwhelmed due the slashdot effect derived from their implemented in the kernel: the Virtual File System.
publication in slashdot.org. They mainly served to de-mystify
the common believe that journaling file systems are significant-
ly slower in comparison with traditional Unix file systems Dr. Ricardo Galli, born in 1965, is an associated professor at
(UFS) and derived, namely ext2, which has been the standard the University of Balearic Islands. He holds a Computer Engi-
file system in Linux. Since those days, more benchmarks have neering Degree (Argentina) and a Ph.D. in Computer Science
been published, but the truth is still the same: there is no a clear (University of Balearic Islands). His PhD was entitled “Data Con-
winner. Some systems perform better than other in some cases, sistency Methods for Collaborative 3D Editing”.
He works in the area of Computer Graphics since 1991 and
for example ReiserFS is really good for reading small to medi-
CSCW (Computer Supported Collaborative Work) since 1995. He
um size files, while XFS behaves better for large files, and JFS
has published more than 40 papers in National and International
is said to facilitate the migration of existing system running in conferences, magazines, including IEEE Computer Graphics and
OS/2 Warp and AIX systems. Application and IEEE Multimedia and books’ chapters. He also
This article presents the all journal file systems available for has participated and led several R&D European Projects, and he
Linux: Ext3, ReiserFS, XFS and JFS. We also introduce to the is also a coordinator of the Socrates/Erasmus European Program.
basic concepts of file systems, buffer-cache, and page-cache He was a co-founder and shareholder of Atlas-IAP S.L., a local
implemented in the Linux kernel. The performance of the ISP. He has directed or coordinated the development of the digital
different file systems is strongly affected by those optimization (Internet) version of six local newspapers: www.diaridebalears.
techniques. Indeed, not only the performance is affected, but com, www.ultimahora.es, www.majorcadailybulletin.es, www.
also the implementation and porting of the different file mallorcamagazin.net, etc., and several e-commerce sites. He is
collaborating in some Open Source projects, such as Alternative
systems. SGI introduced a new module, pagebuf, that serves as
PHP Cache, and he is member of the Balearic Islands Linux User
the interface between their own XFS buffering techniques and
Group. Currently, he is coordinating the European R&D Project
the Linux page cache. “e-Content: MNM – Minority News Paper to Multimedia” and
also the CTO of the ISA3D, a European wide enterprise (Portugal,
The Netherlands, Spain), spin-off of the M3D European Project.
<[email protected]>, <m3d.uib.es/gallir/>
1. http://bulmalug.net/body.phtml?nIdNoticia=642
The file systems supported by the Linux VFS fall into three buffer refers to a single arbitrary block on the hard disk, and it
categories: consists of a header and an area of memory equal to the block
1. Disk based, including hard disk, floppy disk and CD-ROM, size of the associated device.
including ext2fs, ReiserFS, XFS, ext3fs, UFS, iso9660, etc. To minimise management overhead, all buffers are held in
2. Network based, including NFS, Coda, and SMB. one of several linked lists. Each linked list contains buffers in
3. Special file systems, including /proc, ramfs, and devfs. the same state: unused, free, clean, dirty, locked, etc. Every
The common file model can be viewed as object-oriented, time a read occurs, the buffer cache sub-system must find in the
with objects being software constructs (data structures and target block is already in cache. To find it quickly, a hash table
associated methods/functions) of the following types: is maintained of all the buffers present in the cache. The buffer
• Super block: stores information relating to a mounted file cache is also used to improve writing performance. Instead of
system. It is represented by a file system control block stored carrying out all writes immediately, the kernel stores data tem-
on disk (for disk based file systems). porally in the buffer cache, waiting to see if it is possible to
• i-node: stores information relating to a single file. It group several writes together. A buffer that contains data that is
corresponds to a file system control block stored on disk. waiting to be written to disk is termed dirty.
Each i-node holds the meta-information of the file: owner,
group, creation time, access-time and a set of pointer to the The Page Cache
disk block that store the file date. The page cache, instead, holds full pages virtual memory
• File: stores information relating to the interaction of an open pages (4 KB in x86 Linux platform). The pages come from files
file and a process. This object only exists while a process is in the file system, and, in fact, page cache entries are partially
interacting with a file. indexed by the file i-node number and its offset within the file.
• Dentry: links a directory entry (pathname) with its corre- A page is almost invariably larger than a single disk logical
sponding file. Recently used dentry objects are held in a block, and the blocks that make up a single page cache entry
dentry cache to speed up the translation from a pathname to may not be contiguous on the disk.
the inode of the corresponding file. The page cache is largely used to interface the requirements
All modern Unix systems allow file system data to be ac- of the virtual memory subsystem, which uses fixed 4 KB size
cessed using two mechanisms (Figure 1). pages, to the VFS subsystem, which uses variable size blocks
1. Memory mapping with mmap: The mmap() system call or other types of techniques, such as extents in XFS and JFS.
gives the application direct memory-mapped access to the
kernel’s page cache data. The purpose of mmap is to map a 2.2 Integration of page and buffer cache
file data into a VMS address space, so data in the file can be The above two mechanisms operated semi-independently of
treated as a standard in-memory array or structure. File data each other. The operating system has to take special care to
is read into the page cache lazily as processes attempt to synchronise the two caches in order to prevent applications
access the mappings created with mmap() and generate page from receiving invalid data. Furthermore, if the system become
faults.
2. Direct block I/O system call such as read and write: The
read() system call reads data from block devices into the
kernel cache (avoided for CD and DVD reading by means of
O_DIRECT ioctl parameter), then, copies data from the
kernel cached copy onto the application address space. The
write() system call copies data in the opposite direction,
from the application address space into the kernel cache and
eventually, in a near future, writing the data from the cache
to disk. These interfaces are implemented using either the
buffer cache or the page cache to store the data in the kernel.
Page-cache Unification
Following the Unified I/O and Memory Caching Subsystem
for NetBSD, Linus Torvalds wanted to change the page-buffer
cache behaviour for Linux. In May 4th, 2001, in a message to
the Linux-kernel developer’s list, he wrote the following:
I do want to re-write block_read/write to use the page
cache, but not because it would impact anything in this
discussion. I want to do it early in 2.5.x, because:
• it will speed up accesses
• it will re-use existing code better and conceptual-
ize things more cleanly (i.e. it would turn a disk into a
_really_ simple filesystem with just one big file;).
• it will make MM handling much better for things
like fsck - the memory pressure is designed to work on
page cache things.
• it will be one less thing that uses the buffer cache
as a cache (I want people to think of, and use, the buff-
er cache as an _IO_ entity, not a cache).
It will not make the cache at bootup thing change at all
(because even in the page cache, there is no commonal-
ity between a virtual mapping of a _file_ (or metadata)
Fig. 2: Data is shared by page cache and buffer cache and a virtual mapping of a _disk_).
Although these desirable changes were not expected until
2.5.x, finally Linus decided to integrate a bunch of Andrea
short on memory, the system has to take hard decisions on Arcangeli patches, plus changes on his own, and released
whether reclaim memory from the page cache or buffer cache. 2.4.10, which finally unified the page and buffer cache (Figure
The page cache tends to be easier to deal with, since it more 3). Thus, an important improvement in I/O operations is
directly represents the concepts used in higher levels of the expected, altogether with a better VM tuning, especially for
kernel code. The buffer cache also has the limitation that memory shortage situation.
cached data must always be mapped into kernel virtual space,
which puts an additional artificial limit on the amount of data 3 Journaling File Systems
which can be cached since modern hardware can easily have The standard file system for Linux for a long time was
more RAM than kernel virtual address space. ext2fs. Ext2 was designed by Wayne Davidson with the collab-
Thus, over time, parts of the kernel have shifted over from oration of Stephen Tweedie and Theodore Ts’o. It is an
using the buffer cache to using the page cache. The individual improvement of the previous ext file system designed by Rémy
blocks of a page cache entry are still managed through the buff- Card. The ext2fs is an i-node based file-system, the i-node
er cache. But accessing the buffer cache directly can create
confusion between the two levels of caching.
This lack of integration led to inefficient overall system
performance and a lack of exibility. To achieve good perform-
ance it is important for the virtual memory and I/O subsystems
to be highly integrated.
The approach taken by Linux to reduce the inefficiencies of
double copies is to store the file data only in the page cache
(Figure 2).
Temporary mappings of page cache pages to support read()
and write() normally are not needed since Linux maps perma-
nently all of physical memory into the kernel virtual address
space. One interesting twist that Linux adds is that the device
block numbers where a page is stored on disk are cached with
the page in the form of a list of buffer_head structures. When a
modified page is to be written back to disk, the I/O requests can
be sent to the device driver right away, without needing to read
any indirect blocks to determine where the data must be
written.
Fig. 3: Unified to page cache
maintains the metadata of the file and the pointer to the actual ReiserFS is the only one included in the standard Linux
data blocks. kernel tree, the others are planned to be included in version 2.5,
To speed up the performance of I/O operations, data is tem- although XFS and JFS are fully functional, officially released
porally allocated in RAM memory by means of the buffer- as kernel patches and in production quality.
cache and page-cache subsystem. The problem appears if there Ext3, written by Stephen C. Tweedie, is an extension to ext2.
is a system crash of electric outage before the modified data in It adds two independent modules, a transaction and a logging
the cache (dirty buffers) have been written to disk. This would module. Ext3 is close to is final version, RedHat 7.2 already
cause an inconsistency of the whole file system, for example a include it as option and will be the official file system of Red
new file that wasn’t create in the disk or files that were remove Hat distributions.
but their i-nodes and data blocks still remain in the disk.
The fsck (file system check) was the common recovering tool B-Trees
to resolve the inconsistencies. But fsck has to scan the whole The basic tool for improving performance compared to tradi-
disk partition and check the interdependencies among i-nodes, tional UNIX file systems is to avoid the use of linked lists or
data blocks and directory contents. With the enlargement of bitmaps for free blocks, directory entries and data block
disk’s capacity, restoring the consistency of the file system has addressing – that have inherent scalability problem (typical
become a very time consuming task, which creates serious complexity for search is O(n)) and are not adequate for new,
problems of availability for large servers. This is the main rea- vary large capacity disks. All the new systems use Balanced
son for the file systems to inherit database transaction and Trees (B-Trees) or variation of them (B+Trees).
recover technologies, and thus the appearance of Journaling Balanced tree is an well studied structure, they are more
File Systems or Journal File Systems. robust in their performance but at the same time the manage-
A journaling file system is a fault-resilient file system in ment and balancing algorithm are more complex. The B+Tree
which data integrity is ensured because updates to files’ meta- structure has been used on databases indexing structures for a
data is written to a serial log on disk before the original disk log long time. This structure provided databases with a scalable
is updated. In the event of a system failure, a full journaling file and fast manner to access their records. The + sign means that
system ensures that the file system consistency is restored. The the B-Tree is a modified version of the original that:
most common approach is a method of journaling or logging • Place all keys at the leaves.
the metadata of files. With logging, whenever something • Leaves nodes can be linked together.
changes in the metadata of a file, this new attribute information • Internal nodes and leaves can be of different sizes.
will be logged into a reserved area of the file system. The file • Never needs to change parent if a key in a leaf is deleted.
system will write the actual data to the disk only after the write • Makes sequential operations easier and cheaper.
of the metadata to the log is complete. When a system crash
occurs, the system recovery code will analyse the metadata log 3.1 ReiserFS
and try to clean up only those inconsistent files by replaying the ReiserFS is based on fast balanced trees (B+Tree) to organise
log file. file system objects. File systems objects are the structures used
The earliest journaling file systems, created in the mid- to maintain file information: access time, file permissions, etc.
1980s, included Veritas (VxFS), Tolerant, and IBM’s JFS. With In other words, the information contained within an i-node,
increasing demands being placed on file systems to support directories and the files’ data. ReiserFS calls those objects, stat
terabytes of data, thousands upon thousands of files per direc- data items, directory items and direct/indirect items, respec-
tory and 64-bit capability, the interest in journaling file system tively. ReiserFS only provides metadata journaling. In the case
for Linux has growth over the last years. of a non-planned reboot, data in blocks that were being used at
Linux has three new contenders in the journaling file systems the time of the crash could have been corrupted; thus ReiserFS
in the past few months: ReiserFS from Namesys2, XFS from does not guarantee the file contents themselves are uncorrupt-
SGI3, JFS from IBM4 and Ext3 developed by Stephen Tweedie, ed.
co-creator of ext25. Unformatted nodes are logical blocks with no given format,
While ReiserFS is a complete new file system written from used to store file data, and the direct items consist of file data
scratch, XFS, JFS and Ext3 are derived from commercial itself. Also, those items are of variable size and stored within
products or existing file systems. XFS is based and partially the leaf nodes of the tree, sometimes with others in case there
shares the same code from the system developed by SGI for its is enough space within the node. File information is stored
workstations and servers. JFS was designed and developed by close to file data, since the file system always tries to put stat
IBM for its OS/2 warp, which is itself derived from the AIX file data items and the direct/indirect items of the same file togeth-
system. er. Opposed to direct items, the file data pointed by indirect
items is not stored within the tree. This special management of
direct items is due to small file support: tail packing.
2. http://www. namesys.com Tail packing is a special ReiserFS feature. Tails are files that
3. http//oss.sgi.com/projects/xfs/ are smaller than a file system block or the trailing portions of
4. http://oss.software.ibm.com/developerworks/opensource/jfs/ files that do not fill up a complete file system block. To save
5. http://e2fsprogs.sourceforge.net/ext2. html disk space, ReiserFS uses tail packing to hold tails into as small
3.2 XFS
On May 1 2001, SGI made available Release 1.0 of its jour-
naling XFS file system for Linux. XFS, is recognised by its
support for large disk farm and very high I/O throughput (tested
up to 7GB/sec). XFS was developed for the IRIX 5.3 SGI Unix
operating system, its first version was introduced in December
1994. The target of the file system was to support vary large
files and high throughput for real time video recording and
playing.
To increase the scalability of the file system XFS uses of
B+Trees extensively. They are used for tracking free extents,
index directories and to keep track of dynamically allocated
i–nodes scattered throughout the file system. In addition, XFS
uses an asynchronous write ahead logging scheme for protect-
ing metadata updates and allowing fast file system recovery.
Fig. 5: Extent based allocation
disk until the data is committed to the on-disk log. XFS gains JFS dynamically allocates space for disk i-nodes as required,
two things by writing the log asynchronously: freeing the space when it is no longer needed. Two different
1. Multiple updates can be batched into a single log write. This directory organizations are provided.
increases the efficiency of the log writes with respect to the 1. The first organization is used for small directories and stores
underlying disk array. the directory contents within the directory’s i-node. This
2. The performance of metadata updates is normally made eliminates the need for separate directory block I/O as well
independent of the speed of the underlying drives. This as the need to allocate separate storage.
independence is limited by the amount of buffering dedicat- 2. The second organization is used for larger directories and
ed to the log, but it is far better than the synchronous updates represents each directory as a B+Tree keyed on name. It
of older file systems. provides faster directory lookup, insertion, and deletion
XFS also has a fairly extensive set of userspace tools for capabilities.
dumping, restoring, repairing, growing, snapshotting, tools for JFS supports both sparse and dense files, on a per-file system
using ACLs and disk quotas, etc. basis. Sparse files allow data to be written to random locations
within a file without instantiating others unwritten file blocks.
3.3 JFS The file size reported is the highest byte that has been written
IBM introduced its UNIX file system as the Journaled File to, but the actual allocation of any given block in the file does
System (JFS) with the initial release of AIX Version 3.1. It has not occur until a write operation is performed on that block.
now introduced a second file system that is to run on AIX
systems called Enhanced Journaled File System (JFS2), which 3.4 Ext3
is available in AIX Version 5.0 and later versions. The JFS Ext3 is a journaling file system developed by Stephen
Open Source code on originated from that currently shipping Tweedie. It is compatible to ext2, actually it is an ext2fs with a
with the OS/2 Warp Server for e-business. journal file. Ext3 is the half of a journaling file system
JFS is tailored primarily for the high throughput and reliabil- mentioned, it is a layer atop the traditional ext2 file system that
ity requirements of servers. JFS uses extent-based addressing does keep a journal file of disk activity so that recovery from an
structures (Figure 5), along with clustered block allocation improper shutdown is much quicker than that of ext2 alone.
policies (4), to produce compact, efficient, and scalable struc- But, because it is tied to ext2, it suffers some of the limitations
tures for mapping logical offsets within files to physical of the older ext2 system and therefore does not exploit all the
addresses on disk. An extent is a sequence of contiguous blocks potential of the pure journaling file systems, for example, it is
allocated to a file as a unit and is described by a triple, consist- still block based (Figure 4) and uses sequential search of file
ing of <logical offset, length, physical>. The ad- names in directories. Its major advantages are:
dressing structure is a B+Tree populated with extent descrip- • Ext3 journal and maintain order consistency in both, the data
tors, rooted in the i-node and keyed by logical offset within the and the metadata. Differently to the above journal file
file. systems, consistency is assured for the content of the file as
JFS logs are maintained in each file system and used to well.
record information about operations on metadata. The log has • Ext3 partitions do not have a file structure different from
a format that also is set by the file system creation utility. ext2, so porting or backing out to the old system, by choice
JFS logging semantics are such that, when a file system or in the event the journal file were to become corrupted, is
operation involving meta-data changes returns a successful straightforward.
return code, the effects of the operation have been committed Ext3 reserves one of the special ext2 i-nodes for storing the
to the file system and will be seen even if the system crashes. journal log, but the journal can be on any i-node in any file
The old logging style introduced a synchronous write to the system or it can be on any arbitrary sub-range, set of contigu-
log disk into each i-node or VFS operation that modifies meta- ous blocks on any block device. It is possible to have multiple
data. In terms of performance, it is a performance disadvantage file systems sharing the same journal.
when compared to other journaling file systems, such as Veritas The journal file job is to record the new contents of file
VxFS and XFS, which use different logging styles and lazily system metadata blocks while it is in the process of committing
write log data to disk. When concurrent operations are transactions. The only other requirement is that the system
performed, this performance cost is reduced by group commit, must assure that can commit the transactions atomically.
which combines multiple synchronous write operations into a Three type of data blocks are written to the journal:
single write operation. JFS logging style has been improved 1. Metadata,
and it currently provides asynchronous logging, which increas- 2. Descriptor blocks, and
es performance of the file system. 3. Header blocks.
JFS supports block sizes of 512, 1024, 2048, and 4096 bytes A journal metadata block contains the entire single block of
on a per-file system basis. Smaller block sizes reduce the file system metadata as updated by a transaction. Whenever a
amount of internal fragmentation. However, small blocks can small change is done to the file systems, an entire journal block
increase path length since block allocation activities may occur has to be written. However, it is relatively cheap because jour-
more often than if a large block size were used. The default nal I/O operations can be batched into large clusters and the
block size is 4096 bytes.
blocks can be written directly from the page cache system by the affected files are not in the cache yet (as occurs during boot-
exploiting the buffer_head structure. ing). In case you are reading files that are already cached in
Descriptor blocks describe other metadata journal blocks so RAM, the difference is almost negligible for Ext2, Ext3,
the recovery mechanism can copy the metadata back to the ReiserFS and XFS.
main file system. They are written before any change to the Among all journal file systems, ReiserFS is the only one that
journal metadata is done. is included in the standard Linus tree since 2.4.1 and SuSE
Finally, the header blocks describe the head and tail of the supports it for more than two years now. However, Ext3 is
journal plus a sequence number to guarantee write ordering going the be the standard file system for Red Hat and XFS is
during recovery. being used in large servers, specially in the Hollywood indus-
try, due mainly to the influence of SGI in that market. IBM has
4 Performance and Conclusions to put a lot of efforts on JFS in they want to see it in the main-
Different benchmarks (see Resources below) have shown stream, although is a valid alternative for migrating AIX and
that XFS and ReiserFS have a very good performance com- OS/2 installation to Linux.
pared to the well-tested and optimised Ext2fs. Ext3 showed
that is slower but getting closer to Ext2 performance. We expect Resources
that the performance will improve considerably over the • Ext3 architecture:
ftp://ftp.kernel.org/pub/linux/kernel/people/sct/ext3/
following months. On the other hand, JFS get the worst results
• Introduction to Linux Journal File Systems:
in all benchmarks, not only in performance but it has also some http://www.linuxgazette.com/issue55/florido.html
stability problems in the Linux port. • XFS Home Page: http://oss.sgi.com/projects/xfs/
XFS, ReiserFS and Ext3 have demonstrated they are excel- • JFS Home Page:
lent and reliable file systems. There is an important area where http://oss.software.ibm.com/developerworks/opensource/jfs/
• ReiserFS Home Page: http://www.namesys.com/
XFS has higher performance: I/O operation on large files,
• Storage Foundry: http://sourceforge.net/foundry/storage/
specially compared to its closer competitor, ReiserFS. This is • OS News http://www.osnews.com/story.php?news_id=69
understandable and subjected to change over the time,
ReiserFS uses the 2.4 generic read and write Linux, while XFS Benchmarks
has ported sophisticated IRIX I/O operations to Linux, the most • Ext2, Ext3, ReiserFS, XFS and JFS benchmarks:
important is the extent based allocation and direct I/O opera- http://www.mandrakeforum.org/article. php?sid=1212
• Ext2, ReiserFS and XFS Benchmarks:
tions. Furthermore, the current version of ReiserFS does a
http://bulmalug.net/body.phtml?nIdNoticia=642 à Pruebas con
complete tree traversal for every 4 KB block it write, and then XFS, ReiserFS, Ext2FS, y FAT32:
inserts one pointer at a time, which introduces an important http://bulmalug.net/body.phtml?nIdNoticia= 626
overhead of balancing the tree while it copies data around. • Namesys benchmarks:
For operation on small files, normally between 100 and http://www.namesys.com/benchmarks/benchmark-results.html
• Ext2, XFS, ReiserFS and JFS Mongo Benchmarks:
10.000 bytes, ReiserFS has shown that has the best results, if
http://bulmalug.net/body.phtml?nIdNoticia= 648