Fds Notes
Fds Notes
STRUCTURE
Structure
2.0 Objectives
2.1 Introduction
2.2 File Organisation
2.2.1 Sequential File Organisation
2.2.2 Indexed Sequential File
2.3 Random Access File Organisation
2.4 Multi-key File Organisation
2.4.1 Multilist File Organisation
2.4.2 Inverted File Organisation
2.5 Summary
2.6 Answers to Self Check Exercises
2.7 Keywords
2.8 References and Further Reading
2.0 OBJECTIVES
In the preceding Unit of this block you have learnt about the Database Concepts
and various types of databases and database models. You have seen that the
files are main ingredient of databases. In this Unit you will learn the concepts
of files in the computer environment and also how such files are organised.
After the completion of this unit, you will be able to:
define what is file organisation and discuss its different types;
understand various file organisation techniques; and
discuss various types of indexes used in file organisation.
2.1 INTRODUCTION
Generally speaking, a file consists of a collection of records. A key element in
file management is concerned with the ways in which the records themselves
are organised inside the file, since this affects system performances heavily as
far as record finding and access are concerned. Here, by “organisation”, we
refer to the logical arrangement of the records in a file (their ordering or, more
generally, the presence of ``closeness’’ relations between them based on their
content), and not to the physical layout of the file as stored on a storage media.
However, access method of records in a file is dependent upon the physical
medium on which the files, records are stored. Magnetic tape is sequential by
its very nature. To read a record you must start at the beginning of the tape and
sequentially read each record one after another (sequentially) until you get to
28 the one you want just like when you listen a song recorded in a audio tape.
With disks, of course, random access of records is possible. It is the same as
the difference between audio cassette and audio compact disc. In audio tape, File Concepts and File
Structure
you have to start at the beginning and run the tape forward until you get to the
song you want to hear. With compact disc you can play the songs in random
order or go directly to the track you want to hear.
In this unit, we will be discussing about the ways data are represented for files
on external storage devices so that required functions (e.g., retrieval, update)
may be carried out efficiently. A particular organisation method most suitable
for any application will depend upon such factors, as the kind of external storage
available, types of queries allowed, number of keys, mode of retrieval and
mode of update.
Characters
Fields
Records
File
○
Record 2
○
○
○
○
○
○
○
○
○
Record n-1
○
○
Record N
○
○
End of file
Index File
Primary Key Block
Value pointer
K P
M 105
M 109
M 113
32
Fig. 2.3: A Primary Index File and Data File
A primary index is an ordered file with two fields. The first field is of same File Concepts and File
Structure
data type as the primary key field and the second is a pointer to a disk block –
a block address (Fig. 2.3). An index file contains one entry (record) for each
block in the data file. To find a record using the primary index, first the primary
key value is located and then the indicated address is used to find the record.
Thus, whenever a record is accessed, first it looks at the index, identifies the
block and then it searches sequentially within the block.
A clustering index stores data similar to a phone directory where all people
with the same last name are grouped together. A clustering index is specified
on a field that does not have a distinct value, for each record. These records are
then stored in ascending or descending order according to the data values in
this field. A table/database can have only one clustering index. e.g., if a
clustering index is build on “state” columns of the “Authors” table in the
Publisher database, the data will be ordered based on the values of “state” –
in either ascending or descending order.
A third type of index, called a secondary index, can be specified on any field
other than the primary key of the file. A secondary key is any field other than
the primary key that is used to uniquely identify a record in a table.
Use of indexed files
Indexed files are used mainly in areas where timeliness of information is highly
critical. Examples are found in airline reservation systems, job banks, military
data systems, and other inventory type applications. Here data are rarely
processed sequentially, except for the occasional, stock taking.
When some information is obtained, e.g., a free seat on a certain flight, the
data should be correct at that point of time, and if the information is updated,
e.g., a seat is sold on that flight, this fact should be immediately known
throughout the system.
Indexed files are also desirable at places where data are highly variable and
dynamic.
Self Check Exercise
1) What is an index? Describe various types of indexes.
Note: i) Write your answer in the space given below.
ii) Check your answer with the answers given at the end of this Unit.
................................................................................................................................................
................................................................................................................................................
................................................................................................................................................
................................................................................................................................................
Structure of Index Sequential Files
In Indexed-sequential organisation the records are grouped into fixed-length
data block and each record is identified by a primary key. The physical sequence
of the file is in ascending order according to the primary key. An index is
built, separate from the data records. This index contains key values together
with pointers to the data records themselves. Thus the indexed sequential file 33
Database Concepts organisation allows both sequential and random processing. This index permits
accessing individual records at random without accessing other records. The
entire file can also be accessed sequentially in an indexed sequential
organisation. For example, the white pages of an ordinary telephone directory
represent an indexed sequential organisation. In the upper left-hand corner of
each page is the name of the first person listed on that page. By using this
block index, one can easily locate the page that contains a particular name.
One can then scan the names on that page until the desired name is located.
An index-sequential file is not a single file but it consists of the data plus one
or more levels of indexes. The data file contains the actual data items. When
inserting a record, we have to maintain the sequence of records and this may
necessitate shifting subsequent records. For a large file this is a costly and
inefficient process. Instead, an overflow area is provided so that the records
that overflow their logical area are shifted into a designated overflow area and
a pointer is provided to it to the overflow location. This is illustrated below
(Fig. 2.4). Record 615 is inserted in the original logical block causing a record
to be moved to an overflow block.
.......
.......
.......
.......
.......
.......
.......
611 612 614 615 618 624
Original logical block Overflow block
Fig. 2.4: Overflow of Record
An index-sequential file is made up of the following components:
a) A primary data storage area. This contains the data records of the file.
b) Overflow area(s). This is used for records that are added to the file but
will not fit in the prime area.
c) Indexes. These indexes enable access to any given record.
There are two access method for the indexed sequential files.
i) Index Sequential Access Method (ISAM )
ii) Virtual Storage Access Method (VSAM )
These two also represent the two basic implementation techniques of the
Indexed sequential organisation. ISAM is hardware dependent and VASM is
hardware independent technique.
Index Sequential Access Method (ISAM )
The ISAM file is a special type of indexed file. In the indexed-sequential file
the records are physically stored on the disk in groups. Within each group the
records are stored sequentially by primary key. When each group corresponds
to a physical subdivision of the disk e.g., a track, the file type is called Index
Sequential Access Method (ISAM).
A record field or a combination of fields is called a key. A key may be unique
or non-unique. It may be primary or secondary. Usually a file may have only
one primary key, giving unique names to the records of the file. Any key
34 other than the primary key is a secondary key. There may be one or more
secondary keys defined for a file. The unique primary key is the basis for the File Concepts and File
Structure
Direct and ISAM files, while secondary keys are the basis for any other type
of index.
Records in an ISAM file may be indexed in two different ways:
1) The entire file is read/searched sequentially using the primary key; and
2) the record is accessed via the index using the key value provided in each
index entry, the address is located within the index, and the record is
retrieved. A combination of the two types of search may also be used, as
with CDS-ISIS and WINISIS database searches.
Indexes may be unique or non-unique depending upon the nature of the field
or fields used for indexing. There may be several associated indexes in a file
to meet the needs of the user. An index on a particular key has the following
main functions: to provide random access to the records of the file, and to
retrieve all the records in the file in the sequence based on the key.
ISAM file combines some of the good features of indexed and sequential files.
The ISAM file requires relatively smaller disk storage space, as it maintains
records in the sequence on each track of the file, as mentioned above. A primary
key is the basis of constructing an ISAM file. There may be several secondary
indexes defined on an ISAM file.
The virtual storage access method (VSAM) is IBM’s advanced version of the
index-sequential organisation that avoids these disadvantages mentioned above.
It is more powerful and flexible than ISAM.
The VSAM files are made up of two components: the index and data. However,
overflows are handled in a different manner. In a VSAM file, the basic indexed
data block is called a control interval (or a virtual track). Each time a data
block overflows it is divided into two blocks. Appropriate changes are made
to the indexes to reflect this division.
Indexed-sequential files of the basic type discussed above are in common use
in modern commercial processing. They are used especially in on-line or
terminal-oriented access, where the files have to be updated within very short
time frame. An indexed-sequential file can, for instance, be used to produce
an inventory listing on a daily basis. Indexed-sequential files are also commonly
used to handle inquiries, like billing inquiries based on account numbers.
Record I
Record N
Record 2
Record N –
Record I
Record I – 1
Record 3
Thus, hashing is a method for converting primary key values to disk addresses.
This is different from the direct addressing approach, because here the record
addresses are not linearly related to the key values, in fact it is a random function
36 of the key.
The random-access method is fast, since it avoids intermediate file operations, File Concepts and File
Structure
but the method forces the data to be located according to single key attribute.
In a random access, file insertions and deletions are more easily handled. It
provides rapid access to individual records since it is not necessary to search
indexes. The records are not stored in primary key sequence. However, one
disadvantage is that disk space is not as efficiently utilised and periodic
reorganisation is required.
Random access files are stored in main memory, or on direct access storage
devices, such as magnetic disks.
Eff index
Occupation index Salary index
510 analyst B.C. 9,000 E
620 programmer A.D.E 10,000 A
750 12,000 C,D
800 15,000 B
Sex index
950
female B.C.D
male A.E
38
Fig. 2.7: Indexes for Fully Inverted File
Inverted files may also result in space saving compared with other file structures File Concepts and File
Structure
when record retrieval does not require retrieval of key fields. In this case, the
key fields may be deleted from the records.
Both inverted files and multilist files have:
An index for each secondary key.
An index entry for each distinct value of the secondary key.
The index may be tabular or tree-structured.
The entries in an index may or may not be sorted.
The pointers to data records may be direct or indirect.
The indexes differ in that
An entry in an inverted index has a pointer to each data record with that
value.
An entry in a multilist index has a pointer to the first data record with that
value.
Thus an inverted index may have variable-length entries whereas a multilist
index has fixed-length entries.
Some of the implications of these differences are the following:
Index management is easier in the multilist approach because entries are
fixed in length.
The inverted file approach tends to exhibit better inquiry performance.
Many types of queries can be answered by accessing inversion indexes
without necessitating access to data records, thereby reducing I/O-access
requirements.
Inversion of a file can be transparent to a programmer who accesses that
file but does not use the inversion indexes, while a multilist structure
affects the file’s record layout. The multilist pointers can be made
transparent to a programmer if the data manager does not make them
available for programmer use and stores them at the end of each record.
Self Check Exercises
3) Define Inverted File.
4) Describe the similarities of inverted files and multilist files.
Note: i) Write your answer in the space given below.
ii) Check your answer with the answers given at the end of this Unit.
................................................................................................................................................
................................................................................................................................................
................................................................................................................................................
................................................................................................................................................
................................................................................................................................................ 39
Database Concepts
2.5 SUMMARY
In this Unit you have learnt some basic concepts of file organisation. Not only
the medium must allow for random access to records, but the file itself must
support going directly to the record you want to retrieve. This characteristic of
the file is called “file organisation.” We discussed four fundamental file
organisation techniques. These are sequential, indexed sequential, relative (or
random access) and multikey file organisation. The selection of the appropriate
organisation for a file in an information system is important to the performance
of that system. The criteria to be considered while choosing a file organisation,
for achievement of good performance with respect to the most likely usage of
the file, are fast access to single record or collection of related records, easy
record adding/updating/removal, storage efficiency, redundancy as a warranty
against data corruption. You also studied various types of indexes like primary
index, secondary index, and clustering index.
42