Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
13 views8 pages

Filesystem Notes

UPGRADE is a European online magazine for IT professionals, focusing on Open Source and Free Software, with contributions from various guest editors. The December 2001 issue discusses the evolving landscape of Free Software in business, the implications of software patents, and the advantages of Open Source in financial services. It also covers technical topics related to Linux file systems and the performance of different journaling file systems.

Uploaded by

nideham547
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views8 pages

Filesystem Notes

UPGRADE is a European online magazine for IT professionals, focusing on Open Source and Free Software, with contributions from various guest editors. The December 2001 issue discusses the evolving landscape of Free Software in business, the implications of software patents, and the advantages of Open Source in financial services. It also covers technical topics related to Linux file systems and the performance of different journaling file systems.

Uploaded by

nideham547
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

The European Online Magazine for the IT Professional

http://www.upgrade-cepis.org
Vol. II, No. 6, December 2001

UPGRADE is the European Online Magazine


for the Information Technology Professional,
Open Source / Free Software: Towards Maturity
published bimonthly at Guest Editors: Joe Ammann, Jesús M. González-Barahona, Pedro de las Heras Quirós
http://www.upgrade-cepis.org/
Publisher Joint issue with NOVÁTICA and INFORMATIK/INFORMATIQUE
UPGRADE is published on behalf of CEPIS (Council of
European Professional Informatics Societies,
http://www.cepis.org/) by Novática (http://www.ati.es/novatica/) 2 Presentation – Joe Ammann, Jesús M. González-Barahona,
and Informatik/Informatique (http://www.svifsi.ch/revue/)
Pedro de las Heras Quirós, Guest Editors
Chief Editors
François Louis Nicolet, Zurich <[email protected]> 4 Free Software Today
Rafael Fernández Calvo, Madrid <[email protected]> – Pedro de las Heras Quirós and Jesús M. González-Barahona
Editorial Board The position of many major companies with regard to Free Software is changing. New
Peter Morrogh, CEPIS President
companies are becoming giants. It is vital for the data on which we base this idea to be right
Prof. Wolffried Stucky, CEPIS President-Elect
Fernando Sanjuán de la Rocha and up to date. Any impression based on data from a few months ago will very possibly be wrong.
Rafael Fernández Calvo, ATI 12 Should Business Adopt Free Software?
Prof. Carl August Zehnder and François Louis Nicolet, SVI/FSI
– Gilbert Robert and Frédéric Schütz
English Editors: Mike Andersson, Richard Butchart, David
Cash, Arthur Cook, Tracey Darch, Laura Davies, Nick Dunn, We explain what Free Software is, and what its advantages are for users, and provide an
Rodney Fennemore, Hilary Green Roger Harris, Michael Hird, overview of its status in business, in particular by looking at the obstacles which still stand in
Jim Holder, Alasdair MacLeod, Pat Moody, Adam David Moss, the way of its use.
Phil Parkin, Brian Robson
Cover page designed by Antonio Crespo Foix, © ATI 2001
20 Harm from The Hague – Richard Stallman
The proposed Hague Treaty threatens to subject software developers in Europe to U.S.
Layout: Pascale Schürmann
software patents. The consequence is that you could be sued about information you distributed
E-mail addresses for editorial correspondence:
<[email protected]> and <[email protected]>
under the laws of any country, and the judgement would be inforced by your country.
E-mail address for advertising correspondence:
23 Software Patentability with Compensatory Regulation: a Cost Evaluation – Jean
<[email protected]> Paul Smets and Hartmut Pilch
Copyright The European Patent Office has proposed to remove limitations on patentability, such as the
© Novática and Informatik/Informatique. All rights reserved. exclusion of computer programs. The French Academy of Technologies suggests additional
Abstracting is permitted with credit to the source. For copying,
reprint, or republication permission, write to the editors. regulation measures in order to reduce potential abuses of software patents.
The opinions expressed by the authors are their exclusive 33 Open Source in a Major Swiss Bank
responsibility. – Klaus Bucka-Lassen and Jan Sorensen
This article highlights which advantages and disadvantages of Open Source Software are of
significance for a financial services provider. It describes the problems that arose, and what
convinced management to use Struts for Web application developments.
36 European Initiatives Concerning the Use of Free Software in the Public Sector
– Juan Jesús Muñoz Esteban
The European Commission is beginning to make use of Free Software for some of their
strategic initiatives. A study of the use of Free Software in several administrations of different
countries analyses the reasons for adopting it.
41 GNU Enterprise Application Software – Neil Tiffin and Reinhard Müller
GNUe is a set of integrated business applications and tools to support accounting, supply
chain, human resources, sales, manufacturing, and other business processes. We describe the
project, the idea and motivation for developers and users behind it.
45 The Debian GNU/Linux Project – Javier Fernández-Sanguino Peña
The Debian GNU/Linux project is one of the most ambitious Free Software projects, involving
a large number of developers creating a totally free operating system.
50 Journal File Systems in Linux – Ricardo Galli
Linux buffer/cache is really impressive and affected, positively, all the figures of my
compilations, copies and random reads and writes.
57 The Crisis of Free Scientific Software – David Santo Orcero
The scientific world was among the pioneers in creating Free Software. In the 1990s Free
Software started to spread into other areas. In certain fields this reached a point where there
are either no free tools available, or no more free tools are being actively developed.
60 Counting Potatoes: the Size of Debian 2.2
– Jesús M. González-Barahona, Miguel A. Ortuño Pérez, Pedro de las Heras
Quirós, José Centeno González and Vicente Matellán Olivera
Coming issue: Debian is the largest Free Software distribution, with more than 4,000 source packages in the
release currently in preparation. We show that the Debian development model is at least as
“Knowledge Management” capable as other development methods to manage distributions of this size.

1
Open Source / Free Software: Towards Maturity

Journal File Systems in Linux


Ricardo Galli

First of all, there is no a clear winner, XFS is better in some aspects or cases, ReiserFS in others, and both
are better than Ext2 in the sense that they are comparable in performance (again, sometimes faster,
sometimes slightly slower) but they a journaling file systems, and you already know what are their
advantages… And perhaps the most important moral, is that Linux buffer/cache is really impressive and
affected, positively, all the figures of my compilations, copies and random reads and writes. So, I would say,
buy memory and go journaled ASAP…

Keywords: Linux, Operating Systems, Page Cache, Buffer 2 The Linux Virtual File System
Cache, Journal File Systems, ext3, xfs, reiserfs, jfs. A File is a very important abstraction in programming
field. Files serve for storing data permanently and they offer a
1 Introduction few simple but powerful primitives to the programmers. Files
The paragraph above was the first one in an article are normally organised in a tree-like hierarchy where interme-
published in the Balearic Islands Linux User Group Web1 diate nodes are directories, which in turns are capable of group-
which described a second run of benchmarks over the Linux ing files and sub-directories. The file systems is the way the
journaled file systems in comparison against traditional Unix operating system organises, manages and maintains the file
file systems. Although somehow informal, both benchmarks hierarchy into mass-storage devices, normally hard disks.
covered FAT32, Ext2, ReiserFS, XFS, and JFS for several cas- Every modern operating system supports several different,
es: Hans Reiser’s Mongo, file copying, kernel compilation and disparate, file systems. In order to maintain the operating
a small test program that tried to simulate the access pattern of system modular, and to provide applications with a uniform
database systems. programming interface (API), a higher layer that implements
Both articles were among the first published and our server the common functionality of those underlying file systems is
was overwhelmed due the slashdot effect derived from their implemented in the kernel: the Virtual File System.
publication in slashdot.org. They mainly served to de-mystify
the common believe that journaling file systems are significant-
ly slower in comparison with traditional Unix file systems Dr. Ricardo Galli, born in 1965, is an associated professor at
(UFS) and derived, namely ext2, which has been the standard the University of Balearic Islands. He holds a Computer Engi-
file system in Linux. Since those days, more benchmarks have neering Degree (Argentina) and a Ph.D. in Computer Science
been published, but the truth is still the same: there is no a clear (University of Balearic Islands). His PhD was entitled “Data Con-
winner. Some systems perform better than other in some cases, sistency Methods for Collaborative 3D Editing”.
He works in the area of Computer Graphics since 1991 and
for example ReiserFS is really good for reading small to medi-
CSCW (Computer Supported Collaborative Work) since 1995. He
um size files, while XFS behaves better for large files, and JFS
has published more than 40 papers in National and International
is said to facilitate the migration of existing system running in conferences, magazines, including IEEE Computer Graphics and
OS/2 Warp and AIX systems. Application and IEEE Multimedia and books’ chapters. He also
This article presents the all journal file systems available for has participated and led several R&D European Projects, and he
Linux: Ext3, ReiserFS, XFS and JFS. We also introduce to the is also a coordinator of the Socrates/Erasmus European Program.
basic concepts of file systems, buffer-cache, and page-cache He was a co-founder and shareholder of Atlas-IAP S.L., a local
implemented in the Linux kernel. The performance of the ISP. He has directed or coordinated the development of the digital
different file systems is strongly affected by those optimization (Internet) version of six local newspapers: www.diaridebalears.
techniques. Indeed, not only the performance is affected, but com, www.ultimahora.es, www.majorcadailybulletin.es, www.
also the implementation and porting of the different file mallorcamagazin.net, etc., and several e-commerce sites. He is
collaborating in some Open Source projects, such as Alternative
systems. SGI introduced a new module, pagebuf, that serves as
PHP Cache, and he is member of the Balearic Islands Linux User
the interface between their own XFS buffering techniques and
Group. Currently, he is coordinating the European R&D Project
the Linux page cache. “e-Content: MNM – Minority News Paper to Multimedia” and
also the CTO of the ISA3D, a European wide enterprise (Portugal,
The Netherlands, Spain), spin-off of the M3D European Project.
<[email protected]>, <m3d.uib.es/gallir/>
1. http://bulmalug.net/body.phtml?nIdNoticia=642

50 UPGRADE Vol. II, No. 6, December 2001 © Novática and Informatik/Informatique


Open Source / Free Software: Towards Maturity

The file systems supported by the Linux VFS fall into three buffer refers to a single arbitrary block on the hard disk, and it
categories: consists of a header and an area of memory equal to the block
1. Disk based, including hard disk, floppy disk and CD-ROM, size of the associated device.
including ext2fs, ReiserFS, XFS, ext3fs, UFS, iso9660, etc. To minimise management overhead, all buffers are held in
2. Network based, including NFS, Coda, and SMB. one of several linked lists. Each linked list contains buffers in
3. Special file systems, including /proc, ramfs, and devfs. the same state: unused, free, clean, dirty, locked, etc. Every
The common file model can be viewed as object-oriented, time a read occurs, the buffer cache sub-system must find in the
with objects being software constructs (data structures and target block is already in cache. To find it quickly, a hash table
associated methods/functions) of the following types: is maintained of all the buffers present in the cache. The buffer
• Super block: stores information relating to a mounted file cache is also used to improve writing performance. Instead of
system. It is represented by a file system control block stored carrying out all writes immediately, the kernel stores data tem-
on disk (for disk based file systems). porally in the buffer cache, waiting to see if it is possible to
• i-node: stores information relating to a single file. It group several writes together. A buffer that contains data that is
corresponds to a file system control block stored on disk. waiting to be written to disk is termed dirty.
Each i-node holds the meta-information of the file: owner,
group, creation time, access-time and a set of pointer to the The Page Cache
disk block that store the file date. The page cache, instead, holds full pages virtual memory
• File: stores information relating to the interaction of an open pages (4 KB in x86 Linux platform). The pages come from files
file and a process. This object only exists while a process is in the file system, and, in fact, page cache entries are partially
interacting with a file. indexed by the file i-node number and its offset within the file.
• Dentry: links a directory entry (pathname) with its corre- A page is almost invariably larger than a single disk logical
sponding file. Recently used dentry objects are held in a block, and the blocks that make up a single page cache entry
dentry cache to speed up the translation from a pathname to may not be contiguous on the disk.
the inode of the corresponding file. The page cache is largely used to interface the requirements
All modern Unix systems allow file system data to be ac- of the virtual memory subsystem, which uses fixed 4 KB size
cessed using two mechanisms (Figure 1). pages, to the VFS subsystem, which uses variable size blocks
1. Memory mapping with mmap: The mmap() system call or other types of techniques, such as extents in XFS and JFS.
gives the application direct memory-mapped access to the
kernel’s page cache data. The purpose of mmap is to map a 2.2 Integration of page and buffer cache
file data into a VMS address space, so data in the file can be The above two mechanisms operated semi-independently of
treated as a standard in-memory array or structure. File data each other. The operating system has to take special care to
is read into the page cache lazily as processes attempt to synchronise the two caches in order to prevent applications
access the mappings created with mmap() and generate page from receiving invalid data. Furthermore, if the system become
faults.
2. Direct block I/O system call such as read and write: The
read() system call reads data from block devices into the
kernel cache (avoided for CD and DVD reading by means of
O_DIRECT ioctl parameter), then, copies data from the
kernel cached copy onto the application address space. The
write() system call copies data in the opposite direction,
from the application address space into the kernel cache and
eventually, in a near future, writing the data from the cache
to disk. These interfaces are implemented using either the
buffer cache or the page cache to store the data in the kernel.

2.1 Linux Page-cache and Buffer-cache


In older Linux versions (and general UNIX-like operating
systems), memory-mapping requests were handled by the
virtual memory management subsystem (VM or MM), while
I/O calls were handled independently by the I/O subsystem.
For example, in Linux up to version 2.2.x, the VM subsystem
and I/O subsystem each have their own data caching mecha-
nisms to improve performance: buffer cache and page cache.

The Buffer Cache


The buffer cache holds individual disk blocks copies. The Fig. 1: Buffer cache and page cache
device and block numbers indexes the cache entries. Each

© Novática and Informatik/Informatique UPGRADE Vol. II, No. 6, December 2001 51


Open Source / Free Software: Towards Maturity

Page-cache Unification
Following the Unified I/O and Memory Caching Subsystem
for NetBSD, Linus Torvalds wanted to change the page-buffer
cache behaviour for Linux. In May 4th, 2001, in a message to
the Linux-kernel developer’s list, he wrote the following:
I do want to re-write block_read/write to use the page
cache, but not because it would impact anything in this
discussion. I want to do it early in 2.5.x, because:
• it will speed up accesses
• it will re-use existing code better and conceptual-
ize things more cleanly (i.e. it would turn a disk into a
_really_ simple filesystem with just one big file;).
• it will make MM handling much better for things
like fsck - the memory pressure is designed to work on
page cache things.
• it will be one less thing that uses the buffer cache
as a cache (I want people to think of, and use, the buff-
er cache as an _IO_ entity, not a cache).
It will not make the cache at bootup thing change at all
(because even in the page cache, there is no commonal-
ity between a virtual mapping of a _file_ (or metadata)
Fig. 2: Data is shared by page cache and buffer cache and a virtual mapping of a _disk_).
Although these desirable changes were not expected until
2.5.x, finally Linus decided to integrate a bunch of Andrea
short on memory, the system has to take hard decisions on Arcangeli patches, plus changes on his own, and released
whether reclaim memory from the page cache or buffer cache. 2.4.10, which finally unified the page and buffer cache (Figure
The page cache tends to be easier to deal with, since it more 3). Thus, an important improvement in I/O operations is
directly represents the concepts used in higher levels of the expected, altogether with a better VM tuning, especially for
kernel code. The buffer cache also has the limitation that memory shortage situation.
cached data must always be mapped into kernel virtual space,
which puts an additional artificial limit on the amount of data 3 Journaling File Systems
which can be cached since modern hardware can easily have The standard file system for Linux for a long time was
more RAM than kernel virtual address space. ext2fs. Ext2 was designed by Wayne Davidson with the collab-
Thus, over time, parts of the kernel have shifted over from oration of Stephen Tweedie and Theodore Ts’o. It is an
using the buffer cache to using the page cache. The individual improvement of the previous ext file system designed by Rémy
blocks of a page cache entry are still managed through the buff- Card. The ext2fs is an i-node based file-system, the i-node
er cache. But accessing the buffer cache directly can create
confusion between the two levels of caching.
This lack of integration led to inefficient overall system
performance and a lack of exibility. To achieve good perform-
ance it is important for the virtual memory and I/O subsystems
to be highly integrated.
The approach taken by Linux to reduce the inefficiencies of
double copies is to store the file data only in the page cache
(Figure 2).
Temporary mappings of page cache pages to support read()
and write() normally are not needed since Linux maps perma-
nently all of physical memory into the kernel virtual address
space. One interesting twist that Linux adds is that the device
block numbers where a page is stored on disk are cached with
the page in the form of a list of buffer_head structures. When a
modified page is to be written back to disk, the I/O requests can
be sent to the device driver right away, without needing to read
any indirect blocks to determine where the data must be
written.
Fig. 3: Unified to page cache

52 UPGRADE Vol. II, No. 6, December 2001 © Novática and Informatik/Informatique


Open Source / Free Software: Towards Maturity

maintains the metadata of the file and the pointer to the actual ReiserFS is the only one included in the standard Linux
data blocks. kernel tree, the others are planned to be included in version 2.5,
To speed up the performance of I/O operations, data is tem- although XFS and JFS are fully functional, officially released
porally allocated in RAM memory by means of the buffer- as kernel patches and in production quality.
cache and page-cache subsystem. The problem appears if there Ext3, written by Stephen C. Tweedie, is an extension to ext2.
is a system crash of electric outage before the modified data in It adds two independent modules, a transaction and a logging
the cache (dirty buffers) have been written to disk. This would module. Ext3 is close to is final version, RedHat 7.2 already
cause an inconsistency of the whole file system, for example a include it as option and will be the official file system of Red
new file that wasn’t create in the disk or files that were remove Hat distributions.
but their i-nodes and data blocks still remain in the disk.
The fsck (file system check) was the common recovering tool B-Trees
to resolve the inconsistencies. But fsck has to scan the whole The basic tool for improving performance compared to tradi-
disk partition and check the interdependencies among i-nodes, tional UNIX file systems is to avoid the use of linked lists or
data blocks and directory contents. With the enlargement of bitmaps for free blocks, directory entries and data block
disk’s capacity, restoring the consistency of the file system has addressing – that have inherent scalability problem (typical
become a very time consuming task, which creates serious complexity for search is O(n)) and are not adequate for new,
problems of availability for large servers. This is the main rea- vary large capacity disks. All the new systems use Balanced
son for the file systems to inherit database transaction and Trees (B-Trees) or variation of them (B+Trees).
recover technologies, and thus the appearance of Journaling Balanced tree is an well studied structure, they are more
File Systems or Journal File Systems. robust in their performance but at the same time the manage-
A journaling file system is a fault-resilient file system in ment and balancing algorithm are more complex. The B+Tree
which data integrity is ensured because updates to files’ meta- structure has been used on databases indexing structures for a
data is written to a serial log on disk before the original disk log long time. This structure provided databases with a scalable
is updated. In the event of a system failure, a full journaling file and fast manner to access their records. The + sign means that
system ensures that the file system consistency is restored. The the B-Tree is a modified version of the original that:
most common approach is a method of journaling or logging • Place all keys at the leaves.
the metadata of files. With logging, whenever something • Leaves nodes can be linked together.
changes in the metadata of a file, this new attribute information • Internal nodes and leaves can be of different sizes.
will be logged into a reserved area of the file system. The file • Never needs to change parent if a key in a leaf is deleted.
system will write the actual data to the disk only after the write • Makes sequential operations easier and cheaper.
of the metadata to the log is complete. When a system crash
occurs, the system recovery code will analyse the metadata log 3.1 ReiserFS
and try to clean up only those inconsistent files by replaying the ReiserFS is based on fast balanced trees (B+Tree) to organise
log file. file system objects. File systems objects are the structures used
The earliest journaling file systems, created in the mid- to maintain file information: access time, file permissions, etc.
1980s, included Veritas (VxFS), Tolerant, and IBM’s JFS. With In other words, the information contained within an i-node,
increasing demands being placed on file systems to support directories and the files’ data. ReiserFS calls those objects, stat
terabytes of data, thousands upon thousands of files per direc- data items, directory items and direct/indirect items, respec-
tory and 64-bit capability, the interest in journaling file system tively. ReiserFS only provides metadata journaling. In the case
for Linux has growth over the last years. of a non-planned reboot, data in blocks that were being used at
Linux has three new contenders in the journaling file systems the time of the crash could have been corrupted; thus ReiserFS
in the past few months: ReiserFS from Namesys2, XFS from does not guarantee the file contents themselves are uncorrupt-
SGI3, JFS from IBM4 and Ext3 developed by Stephen Tweedie, ed.
co-creator of ext25. Unformatted nodes are logical blocks with no given format,
While ReiserFS is a complete new file system written from used to store file data, and the direct items consist of file data
scratch, XFS, JFS and Ext3 are derived from commercial itself. Also, those items are of variable size and stored within
products or existing file systems. XFS is based and partially the leaf nodes of the tree, sometimes with others in case there
shares the same code from the system developed by SGI for its is enough space within the node. File information is stored
workstations and servers. JFS was designed and developed by close to file data, since the file system always tries to put stat
IBM for its OS/2 warp, which is itself derived from the AIX file data items and the direct/indirect items of the same file togeth-
system. er. Opposed to direct items, the file data pointed by indirect
items is not stored within the tree. This special management of
direct items is due to small file support: tail packing.
2. http://www. namesys.com Tail packing is a special ReiserFS feature. Tails are files that
3. http//oss.sgi.com/projects/xfs/ are smaller than a file system block or the trailing portions of
4. http://oss.software.ibm.com/developerworks/opensource/jfs/ files that do not fill up a complete file system block. To save
5. http://e2fsprogs.sourceforge.net/ext2. html disk space, ReiserFS uses tail packing to hold tails into as small

© Novática and Informatik/Informatique UPGRADE Vol. II, No. 6, December 2001 53


Open Source / Free Software: Towards Maturity

XFS uses an extent based space allocation (Figure 5), and it


has features like delayed allocation, space pre-allocation and
space coalescing on deletion, and goes to great lengths in
attempting to layout files using the largest extents possible. To
make the management of large amounts of contiguous space in
a file efficient, XFS uses very large extent descriptors in the file
extent map. Each descriptor can describe up to two million file
system blocks. Describing large numbers of blocks with a
single extent descriptor eliminates the CPU overhead of
scanning entries in the extent map to determine whether blocks
in the file are contiguous, it can simply read the length of the
extent rather than looking at each entry to see if it is contiguous
with the previous entry.
XFS allows variable sized blocks, from 512 bytes to 64 kilo-
Fig. 4: Block based allocation bytes on a per file system basis. Changing the file system block
size can vary fragmentation. File systems with large numbers
of small files typically use smaller block sizes in order to avoid
a space as possible. Generally, this allows a ReiserFS to hold wasting space via internal fragmentation. File systems with
around 5% more than an equivalent ext2 file system. The direct large files tend to make the opposite choice and use large block
items are intended to keep small file data and even the tails of sizes in order to reduce external fragmentation of the file
the files. Therefore, several tails could be kept within the same system and their files’ extents.
leaf node. XFS is complex chunk of code on IRIX and very IRIX-
ReiserFS has an excellent small-file performance because it centric, so in porting to Linux this interface was redesigned and
is able to incorporate these tails into its B-Tree so that they are rewritten from scratch. The result is the Linux pagebuf module,
really close to the stat data. Since tails do not fill up a complete it provides the interface between XFS and the virtual memory
block, they can waste disk space. subsystem and also between XFS and the Linux block device
The problem is that using this technique of keeping the file’s layer.
tails together would increase external fragmentation, since the XFS support ACL’s (integrated with the Samba server) and
file data is now further from the file tail. Moreover, the task of transactional quotes. On Linux it supports quotes per-group
packing tails is time-consuming and leads to performance instead of IRIX per-project quota, as this is the way quota are
penalties. This is a consequence of the memory shifts needed implemented in Linux file systems (and Linux has no equiva-
when someone appends data to a file. Namesys realised this lent concept to “projects”). More esoteric features of XFS on
problem and allow system administrator to disable the tail IRIX that provide file system services customized for specific
packing by specifying the notail option at the time the file demanding applications (e.g. real-time video serving) have not
system is mounted or event remounted. been ported to Linux yet.
ReiserFS uses fixed size block (4KB) oriented allocation The normal mode of operation for XFS is to use an asynchro-
(Figure 4) that affects negatively to the performance of I/O nously written log. It still ensures that the write ahead logging
operations of large files. The other weakness of ReiserFS is the protocol is followed in that modified data cannot be used to
sparse file significantly worse performance compared to ext2,
although Namesys is working on optimising this case.

3.2 XFS
On May 1 2001, SGI made available Release 1.0 of its jour-
naling XFS file system for Linux. XFS, is recognised by its
support for large disk farm and very high I/O throughput (tested
up to 7GB/sec). XFS was developed for the IRIX 5.3 SGI Unix
operating system, its first version was introduced in December
1994. The target of the file system was to support vary large
files and high throughput for real time video recording and
playing.
To increase the scalability of the file system XFS uses of
B+Trees extensively. They are used for tracking free extents,
index directories and to keep track of dynamically allocated
i–nodes scattered throughout the file system. In addition, XFS
uses an asynchronous write ahead logging scheme for protect-
ing metadata updates and allowing fast file system recovery.
Fig. 5: Extent based allocation

54 UPGRADE Vol. II, No. 6, December 2001 © Novática and Informatik/Informatique


Open Source / Free Software: Towards Maturity

disk until the data is committed to the on-disk log. XFS gains JFS dynamically allocates space for disk i-nodes as required,
two things by writing the log asynchronously: freeing the space when it is no longer needed. Two different
1. Multiple updates can be batched into a single log write. This directory organizations are provided.
increases the efficiency of the log writes with respect to the 1. The first organization is used for small directories and stores
underlying disk array. the directory contents within the directory’s i-node. This
2. The performance of metadata updates is normally made eliminates the need for separate directory block I/O as well
independent of the speed of the underlying drives. This as the need to allocate separate storage.
independence is limited by the amount of buffering dedicat- 2. The second organization is used for larger directories and
ed to the log, but it is far better than the synchronous updates represents each directory as a B+Tree keyed on name. It
of older file systems. provides faster directory lookup, insertion, and deletion
XFS also has a fairly extensive set of userspace tools for capabilities.
dumping, restoring, repairing, growing, snapshotting, tools for JFS supports both sparse and dense files, on a per-file system
using ACLs and disk quotas, etc. basis. Sparse files allow data to be written to random locations
within a file without instantiating others unwritten file blocks.
3.3 JFS The file size reported is the highest byte that has been written
IBM introduced its UNIX file system as the Journaled File to, but the actual allocation of any given block in the file does
System (JFS) with the initial release of AIX Version 3.1. It has not occur until a write operation is performed on that block.
now introduced a second file system that is to run on AIX
systems called Enhanced Journaled File System (JFS2), which 3.4 Ext3
is available in AIX Version 5.0 and later versions. The JFS Ext3 is a journaling file system developed by Stephen
Open Source code on originated from that currently shipping Tweedie. It is compatible to ext2, actually it is an ext2fs with a
with the OS/2 Warp Server for e-business. journal file. Ext3 is the half of a journaling file system
JFS is tailored primarily for the high throughput and reliabil- mentioned, it is a layer atop the traditional ext2 file system that
ity requirements of servers. JFS uses extent-based addressing does keep a journal file of disk activity so that recovery from an
structures (Figure 5), along with clustered block allocation improper shutdown is much quicker than that of ext2 alone.
policies (4), to produce compact, efficient, and scalable struc- But, because it is tied to ext2, it suffers some of the limitations
tures for mapping logical offsets within files to physical of the older ext2 system and therefore does not exploit all the
addresses on disk. An extent is a sequence of contiguous blocks potential of the pure journaling file systems, for example, it is
allocated to a file as a unit and is described by a triple, consist- still block based (Figure 4) and uses sequential search of file
ing of <logical offset, length, physical>. The ad- names in directories. Its major advantages are:
dressing structure is a B+Tree populated with extent descrip- • Ext3 journal and maintain order consistency in both, the data
tors, rooted in the i-node and keyed by logical offset within the and the metadata. Differently to the above journal file
file. systems, consistency is assured for the content of the file as
JFS logs are maintained in each file system and used to well.
record information about operations on metadata. The log has • Ext3 partitions do not have a file structure different from
a format that also is set by the file system creation utility. ext2, so porting or backing out to the old system, by choice
JFS logging semantics are such that, when a file system or in the event the journal file were to become corrupted, is
operation involving meta-data changes returns a successful straightforward.
return code, the effects of the operation have been committed Ext3 reserves one of the special ext2 i-nodes for storing the
to the file system and will be seen even if the system crashes. journal log, but the journal can be on any i-node in any file
The old logging style introduced a synchronous write to the system or it can be on any arbitrary sub-range, set of contigu-
log disk into each i-node or VFS operation that modifies meta- ous blocks on any block device. It is possible to have multiple
data. In terms of performance, it is a performance disadvantage file systems sharing the same journal.
when compared to other journaling file systems, such as Veritas The journal file job is to record the new contents of file
VxFS and XFS, which use different logging styles and lazily system metadata blocks while it is in the process of committing
write log data to disk. When concurrent operations are transactions. The only other requirement is that the system
performed, this performance cost is reduced by group commit, must assure that can commit the transactions atomically.
which combines multiple synchronous write operations into a Three type of data blocks are written to the journal:
single write operation. JFS logging style has been improved 1. Metadata,
and it currently provides asynchronous logging, which increas- 2. Descriptor blocks, and
es performance of the file system. 3. Header blocks.
JFS supports block sizes of 512, 1024, 2048, and 4096 bytes A journal metadata block contains the entire single block of
on a per-file system basis. Smaller block sizes reduce the file system metadata as updated by a transaction. Whenever a
amount of internal fragmentation. However, small blocks can small change is done to the file systems, an entire journal block
increase path length since block allocation activities may occur has to be written. However, it is relatively cheap because jour-
more often than if a large block size were used. The default nal I/O operations can be batched into large clusters and the
block size is 4096 bytes.

© Novática and Informatik/Informatique UPGRADE Vol. II, No. 6, December 2001 55


Open Source / Free Software: Towards Maturity

blocks can be written directly from the page cache system by the affected files are not in the cache yet (as occurs during boot-
exploiting the buffer_head structure. ing). In case you are reading files that are already cached in
Descriptor blocks describe other metadata journal blocks so RAM, the difference is almost negligible for Ext2, Ext3,
the recovery mechanism can copy the metadata back to the ReiserFS and XFS.
main file system. They are written before any change to the Among all journal file systems, ReiserFS is the only one that
journal metadata is done. is included in the standard Linus tree since 2.4.1 and SuSE
Finally, the header blocks describe the head and tail of the supports it for more than two years now. However, Ext3 is
journal plus a sequence number to guarantee write ordering going the be the standard file system for Red Hat and XFS is
during recovery. being used in large servers, specially in the Hollywood indus-
try, due mainly to the influence of SGI in that market. IBM has
4 Performance and Conclusions to put a lot of efforts on JFS in they want to see it in the main-
Different benchmarks (see Resources below) have shown stream, although is a valid alternative for migrating AIX and
that XFS and ReiserFS have a very good performance com- OS/2 installation to Linux.
pared to the well-tested and optimised Ext2fs. Ext3 showed
that is slower but getting closer to Ext2 performance. We expect Resources
that the performance will improve considerably over the • Ext3 architecture:
ftp://ftp.kernel.org/pub/linux/kernel/people/sct/ext3/
following months. On the other hand, JFS get the worst results
• Introduction to Linux Journal File Systems:
in all benchmarks, not only in performance but it has also some http://www.linuxgazette.com/issue55/florido.html
stability problems in the Linux port. • XFS Home Page: http://oss.sgi.com/projects/xfs/
XFS, ReiserFS and Ext3 have demonstrated they are excel- • JFS Home Page:
lent and reliable file systems. There is an important area where http://oss.software.ibm.com/developerworks/opensource/jfs/
• ReiserFS Home Page: http://www.namesys.com/
XFS has higher performance: I/O operation on large files,
• Storage Foundry: http://sourceforge.net/foundry/storage/
specially compared to its closer competitor, ReiserFS. This is • OS News http://www.osnews.com/story.php?news_id=69
understandable and subjected to change over the time,
ReiserFS uses the 2.4 generic read and write Linux, while XFS Benchmarks
has ported sophisticated IRIX I/O operations to Linux, the most • Ext2, Ext3, ReiserFS, XFS and JFS benchmarks:
important is the extent based allocation and direct I/O opera- http://www.mandrakeforum.org/article. php?sid=1212
• Ext2, ReiserFS and XFS Benchmarks:
tions. Furthermore, the current version of ReiserFS does a
http://bulmalug.net/body.phtml?nIdNoticia=642 à Pruebas con
complete tree traversal for every 4 KB block it write, and then XFS, ReiserFS, Ext2FS, y FAT32:
inserts one pointer at a time, which introduces an important http://bulmalug.net/body.phtml?nIdNoticia= 626
overhead of balancing the tree while it copies data around. • Namesys benchmarks:
For operation on small files, normally between 100 and http://www.namesys.com/benchmarks/benchmark-results.html
• Ext2, XFS, ReiserFS and JFS Mongo Benchmarks:
10.000 bytes, ReiserFS has shown that has the best results, if
http://bulmalug.net/body.phtml?nIdNoticia= 648

56 UPGRADE Vol. II, No. 6, December 2001 © Novática and Informatik/Informatique

You might also like