File Organization and Indexing
Data on External Storage
Disks: Can retrieve random page at fixed cost
But reading several consecutive pages is much cheaper than
reading them in random order
Tapes: Can only read pages in sequence
Cheaper than disks; used for archival storage
File organization: Method of arranging a file of records on
external storage.
Record id (rid) is sufficient to physically locate record
Indexes are data structures that allow us to find the record
ids of records with given values in index search key fields
Architecture: Buffer manager stages pages from external
storage to main memory buffer pool. File and index layers
make calls to the buffer manager.
Alternative File Organizations
Many alternatives exist, each ideal for some situations, and not
so good in others:
Heap (random order) files: Suitable when typical access is
a file scan retrieving all records.
Sorted Files: Best if records must be retrieved in some
order, or only a `range’ of records is needed.
Indexes: Data structures to organize records via trees or
hashing.
Like sorted files, they speed up searches for a subset
of records, based on values in certain (“search key”)
fields
Updates are much faster than in sorted files.
Internal Schema Design
DBMS
request stored
stored record
record returned
File Manager
request stored
stored block
block returned
Disk Manager
disk I/O data read
operation from disk
Stored Database
Unordered Files
Also called a heap or a pile file.
New records are inserted at the end of the file.
A linear search through the file records is necessary
to search for a record.
This requires reading and searching half the file blocks on
the average, and is hence quite expensive.
Record insertion is quite efficient.
Reading the records in order of a particular field
requires sorting the file records.
Ordered Files
Also called a sequential file.
File records are kept sorted by the values of an ordering field.
Insertion is expensive: records must be inserted in the correct
order.
It is common to keep a separate unordered overflow (or
transaction) file for new records to improve insertion
efficiency; this is periodically merged with the main ordered
file.
A binary search can be used to search for a record on its
ordering field value.
This requires reading and searching log of the file blocks on
2
the average, an improvement over linear search.
Reading the records in order of the ordering field is quite
efficient.
Ordered Files
Average Access Times
The following table shows the average access time to
access a specific record for a given type of file
Sequential File Organization
Suitable for applications that require sequential
processing of the entire file
The records in the file are ordered by a search-key
Sequential File Organization (Cont.)
Deletion – use pointer chains
Insertion –locate the position where the record is to be
inserted
if there is free space insert there
if no free space, insert the record in an overflow block
In either case, pointer chain must be updated
Need to reorganize the file
from time to time to restore
sequential order
Multitable Clustering File
Organization
Store several relations in one file using a multitable clustering
file organization
department
instructor
multitable clustering
of department and
instructor
Multitable Clustering File Organization (cont.)
good for queries involving department instructor, and
for queries involving one single department and its
instructors
bad for queries involving only department
results in variable size records
Can add pointer chains to link records of a particular
relation
Data Dictionary Storage
The Data dictionary (also called system catalog)
stores metadata; that is, data about data, such as
Information about relations
names of relations
names, types and lengths of attributes of each relation
names and definitions of views
integrity constraints
User and accounting information, including passwords
Statistical and descriptive data
number of tuples in each relation
Physical file organization information
How relation is stored (sequential/hash/…)
Physical location of relation
Information about indices
Index structures/files
Dense, Sparse, Primary,
Secondary,
Clustered, Un-clustered files
I/O Cost based Analysis model
Introduction
Issue
How to get required records efficiently
Example
SELECT * from R;
SELECT * from R where A=10;
Index is a data structure that lets us find
quickly records with given ‘search key’ value
without having to look at more than a fraction
of all records
An index takes a value for search key and
finds records with the matching value
Indexing
Can we do anything else to improve query performance other
than selecting a good file organization?
Yes, the answer lies in indexing
Index - a data structure that allows the DBMS to locate
particular records in a file more quickly
Very similar to the index at the end of a book to locate various
topics covered in the book
Types of Index
Primary index – one primary index per file
Clustering index – one clustering index per file – data file is ordered
on a non-key field and the index file is built on that non-key field
Secondary index – many secondary indexes per file
Sparse index – has only some of the search key values in the
file
Dense index – has an index corresponding to every search key
value in the file
16
An index file takes much less space than the
corresponding data file
An index is especially advantageous if it can
fit in memory
A record can be found with only one disk I/O
An index itself can be too large to fit in the
memory
Multi-level indexes
Only part of index in memory
Purposes of Data Indexing
What is Data Indexing?
Why is it important?
How DBMS Accesses Data?
The operations read, modify, update, and
delete are used to access data from
database.
DBMS must first transfer the data temporarily
to a buffer in main memory.
Data is then transferred between disk and
main memory into units called blocks.
Time Factors
The transferring of data into blocks is a
very slow operation.
Accessing data is determined by the
physical storage device being used.
More Time Factors
Querying data out of a database
requires more time.
DBMS must search among the blocks of
the database file to look for matching
tuples.
Purpose of Data Indexing
It is a data structure that is added to a
file to provide faster access to the data.
It reduces the number of blocks that
the DBMS has to check.
Properties of Data Index
It contains a search key and a pointer.
Search key - an attribute or set of attributes
that is used to look up the records in a file.
Pointer - contains the address of where the
data is stored in memory.
It can be compared to the card catalog
system used in public libraries of the past.
Two Types of Indices
Ordered index (Primary index or
clustering index) – which is used to
access data sorted by order of values.
Hash index (secondary index or non-
clustering index ) - used to access data
that is distributed uniformly across a
range of buckets.
Index
Mechanism for efficiently locating row(s)
without having to scan entire table
Based on a search key: rows having a
particular value for the search key attributes
can be quickly located
Don’t confuse candidate key with search key:
Candidate key: set of attributes; guarantees
uniqueness
Search key: sequence of attributes; does not
guarantee uniqueness –just used for search
Indexes
Sometimes need to retrieve records by the values in
one or more fields, e.g.,
Find all students in the “IS” department
Find all students with a gpa > 3
An index on a file is a:
Disk-based data structure
Speeds up selections on the search key fields for the index.
Any subset of the fields of a relation can be index search key
Search key is not the same as (candidate) key
(e.g. doesn’t have to be unique).
An index
Contains a collection of index and data entries
Supports efficient retrieval of all records with a given search
key value k.
Basic Concepts
Indexing is used to speed up access to desired data.
E.g. author catalog in library
A search key is an attribute or set of attributes used to look up
records in a file. Unrelated to keys in the db schema.
An index file consists of records called index entries.
An index entry for key k may consist of
An actual data record (with search key value k)
A pair (k, rid) where rid is a pointer to the actual data record
A pair (k, bid) where bid is a pointer to a bucket of record pointers
Index files are typically much smaller than the original file if the
actual data records are in a separate file.
If the index contains the data records, there is a single file with
a special organization.
Indexing and Hashing 27
Types of index structures
Simple indexes on sorted files
Usually, created on primary key
Secondary indexes on unsorted files
Clustered indexes
B-trees, a commonly used structure
Hash table
Types of Indices
The records in a file may be unordered or ordered
sequentially by some search key.
A file whose records are unordered is called a heap file.
If an index contains the actual data records or the records
are sorted by search key in a separate file, the index is
called clustering (otherwise non-clustering).
In an ordered index, index entries are sorted on the
search key value. Other index structures include trees and
hash tables.
Indexing and Hashing 29
Primary Indexes (On sorted
files)
The simplest structure
The data file is a sequential file
The data file is sorted on a key, usually
primary key
The index file consists of <key,pointer> pairs
Types of indexes
Dense: every record has an entry in the index
Sparse: only some of the data records have
entries in the index
Types of Single-Level Indexes
Primary Index
Defined on an ordered data file
The data file is ordered on a key field
Includes one index entry for each block in the data file;
the index entry has the key field value for the first
record in the block, which is called the block anchor
A similar scheme can use the last record in a block.
A primary index is a nondense (sparse) index, since it
includes an entry for each disk block of the data file and
the keys of its anchor record rather than for every
search value.
Primary index
on the
ordering key
field of the file
Index Structure
Contains:
Index entries
Can contain the data tuple itself (index and table are
integrated in this case); or
Search key value and a pointer to a row having that value;
table stored separately in this case – unintegrated index
Location mechanism
Algorithm + data structure for locating an index entry with a
given search key value
Index entries are stored in accordance with the
search key value
Entries with the same search key value are stored together
(hash, B- tree)
Entries may be sorted on search key value (B-tree)
Index Structure
S
Search key
value
Location Mechanism
Location mechanism
facilitates finding
index entry for S
S Index entries
Once index entry is
found, the row can
be directly accessed S, …….
Dense indexes
Every key from the data file is represented
Entries are in the same order as that of the file
Binary search can be used to find the required
<key, pointer>
No.of blocks searched ‘log n’ instead of n/2 on an
average
Example: 1,000,000 tuples, 10 tuples/4096 byte
block, key field 30 bytes, pointer 8 bytes
Data file takes 400MB space
Index file will take 10,000 blocks with100 entries/block
Search will involve at most log10000 = 13 blocks in
MM
Memory can also be optimized by keeping only
most searched blocks in memory
Hence a record can be retrieved with less than 14
disk I/Os
Sparse indexes
Useful if dense index is too large
Uses less space at the cost of possibly more time
to search
Generally a record, usually the first, per block is
represented
Sparse index for previous example would take only
1000 blocks, 4MB
But, it can not give quick answer to query ‘does
there exist a record with key value K?”
It requires one disk I/O with searching in the
block
Search K: find entry with largest key K
Sparse Vs Dense Index
Dense index: index entry for each data
record
Unclustered index must be dense
Clustered index need not be dense
Sparse index: index entry for each block
of data file
Sparse Vs. Dense Index
Id Name Dept
Sparse,
clustered
index sorted
on Id
data file sorted Dense,
on Id unclustered
index sorted
on Name
Clustered vs. Unclustered Index
Clustered (main) index: index entries and rows
are ordered in the same way
An integrated storage structure is always clustered
There can be at most one clustered index on a table
Unclustered (secondary) index: index entries and
rows are not sorted on the same search key
An index file might be clustered or unclustered with
respect to the storage structure it references
There can be many secondary indices on a table
Clustering and Non-clustering
Non-clustering indices have to be dense.
Indices offer substantial benefits when searching for
records.
When a file is modified, every index on the file must
be updated. Updating indices imposes overhead on
database modification.
Sequential scan using clustering index is efficient, but
a sequential scan using a non-clustering index is
expensive – each record access may fetch a new
block from disk.
Indexing and Hashing 41
Clustered Index
Good for range searches
Use location mechanism to locate index
entry at start of range
This locates first data record.
Subsequent data records are contiguous if
index is clustered (not so if unclustered)
Minimizes page transfers and maximizes
likelihood of cache hits
Sparse Index Files
A clustering index may be sparse.
Index records for only some search-key values.
To locate a record with search-key value k we:
Find index record with largest search-key value < k
Search file sequentially starting at the record to which
the index record points
Less space and less maintenance overhead for insertions
and deletions.
Generally slower than dense index for locating records.
Good tradeoff: sparse index with an index entry for every
block in file, corresponding to least search-key value in the
block.
Indexing and Hashing 43
Types of Single-Level Indexes
Secondary Index
A secondary index provides a secondary means of
accessing a file for which some primary access already
exists.
The secondary index may be on a field which is a
candidate key and has a unique value in every record, or
a nonkey with duplicate values.
The index is an ordered file with two fields.
The first field is of the same data type as some
nonordering field of the data file that is an indexing
field.
The second field is either a block pointer or a record
pointer. There can be many secondary indexes (and
hence, indexing fields) for the same file.
Includes one entry for each record in the data file;
hence, it is a dense index
A dense secondary
index (with block
pointers) on a
nonordering key
field of a file.
Secondary indexes
SELECT name, address
FROM MovieStar
WHERE birthdate=DATE ‘1952-01-01’
CREATE INDEX BDIndex ON MovieStar(birthdate);
Secondary indexes are always ‘dense’
Second level index could be ‘sparse’
Secondary indexes are usually with duplicates
Secondary Indices Example
Secondary index on balance field of account
Index record points to a bucket that contains
pointers to all the actual records with that particular
search-key value.
Multi-level indexes
When an index is too large with even binary
search taking too many disk I/Os
Define second level index: index on index
This can continue to multi-level index structure
Second and higher level indexes must be sparse
Second level index in previous example would
take only 10 blocks, 40KB
Search involves 2 disk I/Os and searching in the
block
Multilevel Index
If an index does not fit in memory, access becomes
expensive.
To reduce number of disk accesses to index records,
treat the index kept on disk as a sequential file and
construct a sparse index on it.
outer index – a sparse index on main index
inner index – the main index file
If even outer index is too large to fit in main
memory, yet another level of index can be created,
and so on.
Indices at all levels must be updated on insertion or
deletion from the file.
49
Multilevel Index (Cont.)
outer index inner index
Data
Index Block 0
Block 0
M
Data
Block 1
M
Index
Block 1
M
M
CIS552 Indexing and Hashing 50
Secondary indexes
SELECT name, address
FROM MovieStar
WHERE birthdate=DATE ‘1952-01-01’
CREATE INDEX BDIndex ON MovieStar(birthdate);
Secondary index does not determine the
location of the record
Secondary indexes are always ‘dense’
Second level index could be ‘sparse’
Secondary indexes are usually with duplicates
20
Secondary index 40
10 10
10 20
20
20 50
30
20
30 10
40 50
50
60
20
Pointers in one index block may refer to
multiple data blocks
Results in more number of Disk I/Os
Unavoidable problem
Using ‘bucket file’ between index file and data
file
Single entry <k,p> for each value ‘k’ where p
points to location in bucket file containing all
other pointers of records with value ‘k’
Avoids wastage of space due to multiple storage
of same value ‘k’
Definition of Bucket
Bucket - another form of a storage unit
that can store one or more records of
information.
Buckets are used if the search key value
cannot form a candidate key, or if the
file is not stored in search key order.
20
40
10 10
20 20
30
40 50
30
50
60 10
50
60
Index file 20
Bucket file Data file
Application of ‘bucket file’
It can help answer queries efficiently using
intersection of pointer sets
Example
SELECT title
FROM Movie
WHERE StudioName=‘Disney’ AND year=1995;
This reduces number of Disk I/Os
Movie Tuples
Buckets for studio Buckets for year
Disney 1995
Studio index Year index
Estimating Costs
For simplicity we estimate the cost of an operation by
counting the number of blocks that are read or
written to disk.
We ignore the possibility of blocked access which
could significantly lower the cost of I/O.
We assume that each relation is stored in a separate
file with B blocks and R records per block.
CIS552 Indexing and Hashing 58
Choosing Indexing Technique
Five Factors involved when choosing the
indexing technique:
access type
access time
insertion time
deletion time
space overhead
Indexing Definitions
Access type is the type of access being used.
Access time - time required to locate the
data.
Insertion time - time required to insert the
new data.
Deletion time - time required to delete the
data.
Space overhead - the additional space
occupied by the added data structure.
Index Evaluation Metrics
Access time for:
Equality searches – records with a specified
value in an attribute
Range searches – records with an attribute
value falling within a specified range.
Insertion time
Deletion time
Space overhead
61
Primary and Secondary Indices
o Indices offer substantial benefits when searching for
records.
o BUT: Updating indices imposes overhead on database
modification --when a file is modified, every index on
the file must be updated,
o Sequential scan using primary index is efficient, but a
sequential scan using a secondary index is expensive
o Each record access may fetch a new block from
disk
o Block fetch requires about 5 to 10 micro seconds,
versus about 100 nanoseconds for memory access