Indexing
Dr. K. M. Azharul Hasan
Dept. of CSE, KUET
Indexing
Indexes are used to quickly locate data
without having to search every row in a
database every time a database table is
accessed.
Indexes can be created using one or more
columns of a database table, providing the
basis for both rapid random lookups and
efficient access of ordered records.
Storage and Indexing?
DB design using logical models (ER/Relational).
Appropriate level for designers to begin with
Provide independence from implementation details
Performance: another major factor in user
satisfaction
Depends on
Efficient data structures for data representation
Efficiency of system operation on those structures
Disks contains data files and system files including
dictionary and index files
Disk access: one of the most critical factor in
performance.
Storage Hierarchy
DBMS stores information on some storage medium
Primary storage: can be operated directly by CPU.
Secondary storage:
larger capacity, lower cost, slower access
cannot be operated directly by CPU – must be copied
to primary storage
Secondary storage has major implications for DBMS
design
READ: transfer data to main memory
WRITE: transfer data from main memory.
Both transfers are high-cost operations, relative to in-
memory operations, so must be planned carefully
Why Not Store Everything in Main
Memory?
Cost and size
Main memory is volatile: What’s the problem?
You know!!!
Typical storage hierarchy:
Factors: access speed, cost per unit, reliability
Cache and main memory (RAM) for currently used
data: fast but costly
Flash memory: limited number of writes (and
slow), non-volatile, disk-substitute in embedded
systems
Disk for the main database (secondary storage).
Tapes for archiving older versions of the data
(tertiary storage).
Disks
Secondary storage device of choice.
Data is stored and retrieved in units
called disk blocks or pages.
Unlike RAM, time to retrieve a disk page
varies depending upon location on disk.
Therefore, relative placement of pages on disk
has major impact on DBMS performance!
Components of a Disk
Spindle
Tracks
The platters spin Disk head
The arm assembly is moved in
or out to position a head on a Sector
desired track. Tracks under
heads make a cylinder
(imaginary!).
Only one head
reads/writes at any one
time.
Block size is a multiple Platters
Arm movement
of sector size (which is
fixed).
Arm assembly
Accessing a Disk Page
Time to access (read/write) a disk block:
seek time (moving arms to position disk head on track)
rotational delay (waiting for block to rotate under
head)
transfer time (actually moving data to/from disk
surface)
Seek time and rotational delay dominate.
Key to lower I/O cost: reduce seek/rotation
delays
Basic Concepts
9
Indexing mechanisms used to speed up access to desired
data.
E.g., author catalog in library
Search Key - attribute to set of attributes used to look up
records in a file.
An index file consists of records (called index entries) of the
form search-key pointer
Index files are typically much smaller than the original file
Two basic kinds of indices:
Ordered indices: search keys are stored in sorted order
Hash indices: search keys are distributed uniformly
across “buckets” using a “hash function”.
06/05/2025
Index Evaluation Metrics
10
Access types supported efficiently. e.g.,
records with a specified value in the
attribute
or records with an attribute value falling in
a specified range of values.
Access time
Insertion time
Deletion time
Space overhead
06/05/2025
Ordered Indices
11
In an ordered index, index entries are stored sorted on
the search key value. E.g., author catalog in library.
Primary index: in a sequentially ordered file, the index
whose search key specifies the sequential order of the file.
Also called clustering index
The search key of a primary index is usually but not
necessarily the primary key.
Secondary index: an index whose search key specifies an
order different from the sequential order of the file. Also
called
non-clustering index.
Index-sequential file: ordered sequential file with a
primary index.
06/05/2025
Dense Index Files
12
Dense index — Index record appears for every search-key value
in the file.
06/05/2025
Sparse Index Files
13
Sparse Index: contains index records for only some search-
key values.
Applicable when records are sequentially ordered on
search-key
To locate a record with search-key value K we:
Find index record with largest search-key value < K
Search file sequentially starting at the record to which
the index record points
06/05/2025
Sparse Index Files (Cont.)
14
Compared to dense indices:
Less space and less maintenance overhead for
insertions and deletions.
Generally slower than dense index for locating
records.
Good tradeoff: sparse index with an index entry for
every block in file, corresponding to least search-key
value in the block.
06/05/2025
Multilevel Index
15
If primary index does not fit in memory, access
becomes expensive.
Solution: treat primary index kept on disk as a
sequential file and construct a sparse index on it.
outer index – a sparse index of primary index
inner index – the primary index file
If even outer index is too large to fit in main
memory, yet another level of index can be
created, and so on.
Indices at all levels must be updated on insertion
or deletion from the file.
06/05/2025
Multilevel Index (Cont.)
16
06/05/2025
Index Classification
17
Summery
Primary vs. secondary: If search key contains same
order or not.
Clustered vs. unclustered: If order of data records
is the same as order of data entries or not.
Dense vs. sparse: If there is an entry in the index
for each key value or not .
Single level vs. multi level:
06/05/2025
Hash-Based Indexes
18
Good for equality selections.
Index is a collection of buckets. Bucket = primary
page plus zero or more overflow pages.
Hashing function h: h(r) = bucket in which
record r belongs. h looks at the search key fields
of r.
Buckets may contain the data records or just
the rids.
Hash-based indexes are best for equality
selections. Cannot support range searches
So what is difference between hashing and
indexing?
06/05/2025
Index Update: Deletion
19
If deleted record was the only record in the file with its
particular search-key value, the search-key is deleted from the
index also.
Single-level index deletion:
Dense indices – deletion of search-key: similar to file record
deletion.
Sparse indices –
if an entry for the search key exists in the index, it is
deleted by replacing the entry in the index with the next
search-key value in the file (in search-key order).
If the next search-key value already has an index entry, the
entry is deleted instead of being replaced.
06/05/2025
Index Update: Insertion
20
Single-level index insertion:
Perform a lookup using the search-key value
appearing in the record to be inserted.
Dense indices – if the search-key value does not
appear in the index, insert it.
Sparse indices – if index stores an entry for each
block of the file, no change needs to be made to
the index unless a new block is created.
If a new block is created, the first search-key value
appearing in the new block is inserted into the index.
Multilevel insertion (as well as deletion) algorithms
are simple extensions of the single-level algorithms
06/05/2025
Secondary Indices
21
Frequently, one wants to find all the records
whose values in a certain field (which is not the
search-key of the primary index) satisfy some
condition.
Example 1: In the account relation stored
sequentially by account number, we may want
to find all accounts in a particular branch
Example 2: as above, but where we want to
find all accounts with a specified balance or
range of balances
We can have a secondary index with an index
record for each search-key value
06/05/2025
Secondary Indices Example
Secondary index on balance field of account
Index record points to a bucket that contains pointers
to all the actual records with that particular search-key
value.
Secondary indices have to be dense
Primary and Secondary Indices
23
Indices offer substantial benefits when searching
for records.
Updating indices imposes overhead on database
modification --when a file is modified, every index
on the file must be updated.
Sequential scan using primary index is efficient,
but a sequential scan using a secondary index is
expensive
Each record access may fetch a new block from
disk
Block fetch requires about 5 to 10 micro
seconds, versus about 100 nanoseconds for
memory access 06/05/2025
B+-Tree Index Files
24
Disadvantage of indexed-sequential files
performance degrades as file grows, since many
overflow blocks get created.
Periodic reorganization of entire file is required.
Advantage of B+-tree index files:
automatically reorganizes itself with small, local,
changes, in the face of insertions and deletions.
Reorganization of entire file is not required to maintain
performance.
(Minor) disadvantage of B+-trees:
extra insertion and deletion overhead, space overhead.
Advantages of B+-trees outweigh disadvantages
B+-trees are used extensively
06/05/2025
B+-Tree Index Files
25
B+-tree indices are an alternative to indexed-sequential files.
06/05/2025
B+-Tree Index Files (Cont.)
26
B+-tree is a rooted tree satisfying the following properties
All paths from root to leaf are of the same length
Each node that is not a root or a leaf has between n/2 and
n children.
A leaf node has between (n–1)/2 and n–1 values
Special cases:
If the root is not a leaf, it has at least 2 children.
If the root is a leaf (that is, there are no other nodes in
the tree), it can have between 0 and (n–1) values.
06/05/2025
B+ Tree Example
27
To Records
06/05/2025
B+-Tree Node Structure
28
Typical node
Ki are the search-key values
Pi are pointers to children (for non-leaf nodes) or pointers
to records or buckets of records (for leaf nodes).
The search-keys in a node are ordered
K1 < K2 < K3 < . . . < Kn–1
06/05/2025
Leaf Nodes in B+-Trees
29
Properties of a leaf node:
For i = 1, 2, . . ., n–1, pointer Pi either points to a file record with
search-key value Ki, or to a bucket of pointers to file records, each
record having search-key value Ki.
If Li, Lj are leaf nodes and i < j, Li’s search-key values are less than
Lj’s search-key values
Pn points to next leaf node in search-key order
06/05/2025
Non-Leaf Nodes in B+-Trees
30
Non leaf nodes form a multi-level sparse index on the leaf
nodes. For a non-leaf node with m pointers:
All the search-keys in the subtree to which P points are
1
less than K1
For 2 i n – 1, all the search-keys in the subtree to
which Pi points have values greater than or equal to Ki–1
and less than Ki
All the search-keys in the subtree to which Pn points have
values greater than or equal to Kn–1
06/05/2025
Sample non-leaf
31
120
150
180
to keys to keys to keys
< 120 120 k<150 150k<180 180
06/05/2025
Sample leaf node
32
From non-leaf node
to next leaf
in sequence
120
130
with key 120
with key 130
To record
To record
06/05/2025
3
5
11
30
30
35
100
101
110
B+ Tree Example
33
100
To Records
120
130
150
156 120
179 150
180
180
200
06/05/2025
B+ Tree
34
Suppose a key value is 9 byte, page size is
512 bytes and a pointer (both page pointer
and record pointer) is 7 bytes. How many key
values you can enter in a leaf and non leaf
node of a B+ tree?
HT
06/05/2025
Insert into B+ tree
35
First lookup the proper leaf
(a) simple case
leaf not full: just insert (key, pointer-to-record)
(b) leaf overflow
(c) non-leaf overflow
(d) new root
06/05/2025
(a) Insert key = 32
36
n=3
100
30
11
30
31
32
3
5
06/05/2025
(b) Insert key = 7
37
n=3
100
30
7
57
11
30
31
3
5
06/05/2025
100
160
150
(c) Insert key = 160
156 120
179 150
180
38
160
179
180
180
n=3
200
06/05/2025
(d) New root, insert 45 n=3
39
Height grows at root
30
new root => balance maintained
10
20
30
40
10
12
20
25
30
32
40
40
45
1
2
3
06/05/2025
Deletion from B+ tree
40
Again, first lookup the proper leaf;
(a): Simple case: no underflow;
(b): Borrow keys from an adjacent sibling
(if it doesn't become too empty);
(c): Underflow
06/05/2025
(b) Delete 50
=> min # of keys
41
in a leaf = 5/2 = 2
n=4
40 35
100
10
35
10
20
30
35
40
50
06/05/2025
(c) Leaf Underflow Delete 50
n=4
42
100
20
40
40
20
30
40
50
06/05/2025
(d) Non-leaf underflow Delete 37
=> min # of keys in a
non-leaf =
(n+1)/2 - 1=3-1= 2
n=4
25
new root
40
25
10
20
30
40
30
30
37
10
14
20
22
25
26
40
45
1
3
43 06/05/2025
Home task
• Construct a B+ tree having n= 4 or 5 up to
level 3 to insert random keys considering
the cases.
• How can you perform range key query in a
B+ tree ?
44 06/05/2025
Queries on B+-Trees (Cont.)
45
If there are K search-key values in the file, the
height of the tree is no more than logn/2(K)
A node is generally the same size as a disk block,
typically 4 kilobytes
and n is typically around 100 (40 bytes per index entry).
With 1 million search key values and n = 100
at most log (1,000,000) = 4 nodes are accessed in a
50
lookup.
Contrast this with a balanced binary tree with 1
million search key values — around 20 nodes are
accessed in a lookup
above difference is significant since every node access
may need a disk I/O, costing around 20 milliseconds
06/05/2025
B-Tree Index Files
46
Similar to B+-tree, but B-tree allows search-key values to appear only
once; eliminates redundant storage of search keys.
Search keys in nonleaf nodes appear nowhere else in the B-tree; an
additional pointer field for each search key in a nonleaf node must be
included.
Generalized B-tree leaf node vs B+ tree
Nonleaf node – pointers Bi are the bucket or file
record pointers.
06/05/2025
B-Tree Index File Example
47
B-tree (above) and B+-tree (below) on
same data
06/05/2025
B-Tree Index Files (Cont.)
48
Advantages of B-Tree indices:
May use less tree nodes than a corresponding B +-Tree.
Sometimes possible to find search-key value before
reaching leaf node.
Disadvantages of B-Tree indices:
Only small fraction of all search-key values are found early
Non-leaf nodes are larger, so fan-out is reduced. Thus, B-
Trees typically have greater depth than corresponding B+-
Tree
Insertion and deletion more complicated than in B+-Trees
Implementation is harder than B+-Trees.
Range key search is difficult.
Typically, advantages of B-Trees do not out weigh
disadvantages. 06/05/2025
Index Definition in SQL
49
Create an index
create index <index-name> on <relation-name>
(<attribute-list>)
E.g.: create index b-index on branch(branch_name)
Use create unique index to indirectly specify and
enforce the condition that the search key is a
candidate key.
Not really required if SQL unique integrity constraint is
supported
To drop an index
drop index <index-name>
06/05/2025
Index Selection Guidelines
Attributes in WHERE clause are candidates for
index keys.
Exact match condition suggests cluster/sparse/hash
index.
Range query suggests tree index.
Clustering is especially useful for range queries;
can also help on equality queries if there are
many duplicates.
Multi-attribute search keys should be considered
when a WHERE clause contains several conditions.
Try to choose indexes that benefit as many queries
as possible.
If only one index can be clustered per relation,
choose it based on important queries that would
benefit the most from clustering.
Index Selection Guidelines(Cont..)
SELECT E.dno
FROM Emp E
WHERE E.age>40
B+ tree index on E.age can be used to get
qualifying tuples.
Things to consider
How selective is the condition?
If 99% are over 40, index is less useful
If 10%, an index is useful
Index Selection Guidelines(Cont..)
SELECT E.dno, COUNT (*)
FROM Emp E
WHERE E.age>20
GROUP BY E.dno
Consider the GROUP BY query: using age as an
index ---- is it effective?
If many tuples have E.age > 20, using E.age index and
sorting the retrieved tuples may be costly.
Especially bad if this index is not clsutered
Clustered E.dno index may be better!
Indexes with Composite Search
Keys
Composite Search Keys: Examples of composite key
Search on a combination indexes using lexicographic order.
of fields.
11,80 11
Equality query: Every field 12,10 12
value is equal to a constant 12,20 name age sal 12
value. E.g. wrt <sal,age> 13,75 bob 12 10 13
index: <age, sal> cal 11 80 <age>
age=12 and sal =75 joe 12 20
Range query: Some field 10,12 sue 13 75 10
value is not a constant. E.g.: 20,12 Data records 20
age =12; or age=12 and sal 75,13 sorted by name 75
> 10 80,11 80
Data entries in index <sal, age> <sal>
Data entries in index Data entries
sorted by search key to sorted by <sal,age> sorted by <sal>
support range queries.
Composite Search Keys
To retrieve Emp records with age=30 AND
sal=4000, an index on <age,sal> would be
better than an index on age or an index on sal.
If condition is: 20<age<30 AND
3000<sal<5000:
Clustered index on <age,sal> or <sal,age> is best.
If condition is: age=30 AND 3000<sal<5000:
Clustered <age,sal> index much better than <sal,age>
index!
Composite indexes are larger, updated more
often.
Exercise to solve
Emp (eid: int, salary:int, age: real, did: int)
eid is the key, and there’s a clustered
index on eid and an unclustered index on
age
1. Give an example of a query that can be
speeded up because of the available
indexes.
2. Give an example that is neither speeded up
nor slowed down by the indexes.
3. Can there be an update that can be slowed
down because of the indexes?
56
Thank You
06/05/2025