Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4K views32 pages

DBMS (R20) Unit - 5

This document provides an overview of transaction concepts and indexing techniques in database management systems. It discusses transaction states, ACID properties for ensuring consistency during concurrent transactions, and serializable schedules. Transaction concepts covered include concurrency control, isolation levels, and anomalies that can occur due to interleaved execution such as write-read, read-write, and write-write conflicts. Indexing techniques discussed include B+ trees for search, insert, delete operations and file organization, as well as hash-based indexing.

Uploaded by

RONGALI CHANDINI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4K views32 pages

DBMS (R20) Unit - 5

This document provides an overview of transaction concepts and indexing techniques in database management systems. It discusses transaction states, ACID properties for ensuring consistency during concurrent transactions, and serializable schedules. Transaction concepts covered include concurrency control, isolation levels, and anomalies that can occur due to interleaved execution such as write-read, read-write, and write-write conflicts. Indexing techniques discussed include B+ trees for search, insert, delete operations and file organization, as well as hash-based indexing.

Uploaded by

RONGALI CHANDINI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Aditya College of Engineering & Technology

Aditya Nagar, ADB Road, Surampalem - 533437

DATABASE MANAGEMENT SYSTEMS

UNIT V: Transaction Concept & Indexing Techniques

Syllabus:
Transaction Concept: Transaction State, Implementation of Atomicity and Durability,
Concurrent Executions, Serializability, Recoverability, Implementation of Isolation, Testing for
Serializability, Failure Classification, Storage, Recovery and Atomicity, Recovery algorithm.
Indexing Techniques: B+ Trees: Search, Insert, Delete algorithms, File Organization and
Indexing, Cluster Indexes, Primary and Secondary Indexes , Index data Structures, Hash Based
Indexing: Tree base Indexing ,Comparison of File Organizations, Indexes and Performance
Tuning.

Objectives:
After studying this unit, you will be able to:
 Discuss the different types of ACID properties and its implementation.
 Describe concurrent Execution, Serializability and Recoverability.
 Understand physical design of a database system, by discussing Database indexing
techniques and storage techniques
 Examine issues in data storage and query processing and can formulate appropriate
solutions
DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

5.1. Introduction
• A transaction is a unit of program execution that accesses and possibly updates
various data items.
• The transaction consists of all operations executed between the statements begin
and end of the transaction
• Transaction operations: Access to the database is accomplished in a transactionby
the following two operations:
 read (X): Performs the reading operation of data item X from the database
 write (X): Performs the writing operation of data item X to the database
• A transaction must see a consistent database
• During transaction execution the database may be inconsistent
• When the transaction is committed, the database must be consistent
• Two main issues to deal with:
 Failures, e.g. hardware failures and system crashes
 Concurrency, for simultaneous execution of multiple transactions
5.2 ACID Properties
 To preserve integrity of data, the database system must ensure:
• Atomicity: Either all operations of the transaction are properly reflected in the
database or none are
• Consistency: Execution of a transaction in isolation preserves the consistency of the
database
• Isolation: Although multiple transactions may execute concurrently, each transaction
must be unaware of other concurrently executing transactions; intermediate
transaction results must be hidden from other concurrently executed transactions
• Durability: After a transaction completes successfully, the changes it has made to
the database persist, even if there are system failures
Example of Fund Transfer: Let Ti be a transaction that transfers 50 from account A to B. This
transaction can be illustrated as follows

Transfer $50 from account A to B:


Ti : read(A)
A := A – 50
write(A)
read(B)
B := B + 50

write(B)

• Consistency: the sum of A and B is unchanged by the execution of the transaction.


• Atomicity: if the transaction fails after step 3 and before step 6, the system should
ensure that its updates are not reflected in the database, else an inconsistency will

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 2


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

result.
• Durability: once the user has been notified that the transaction has completed, the
updates to the database by the transaction must persist despite failures.
• Isolation: between steps 3 and 6, no other transaction should access the partially
updated database, or else it will see an inconsistent state (the sum A + B will be less
than it should be).

5.3 Transaction and Schedules

• A transaction is seen by the DBMS as a series, or list of actions. We therefore establish


a simple transaction model named as transaction states.

Transaction State: A transaction must be one of the following states:

• Active, the initial state; the transaction stays inthis


state while it is executing
• Partially committed, after the final statementhas
been executed.
• Committed, after successful completion.
• Failed: after the discovery that normal execution
can no longer proceed.
• Aborted: after the transaction has been rolled back
and the database restored to its state prior to the start
of the transaction.

5.4 Concurrent Execution and Schedules

Concurrent execution: executing transactions simultaneously has the followingadvantages:

■ increased processor and disk utilization, leading to better throughput


■ one transaction can be using the CPU while another is reading from orwriting to
the disk
■ reduced average response time for transactions: short transactions neednot wait
behind long ones
Concurrency control schemes: these are mechanisms to achieve isolation

■ to control the interaction among the concurrent transactions in order toprevent


them from destroying the consistency of the database
Schedules: sequences that indicate the chronological order in which instructionsof concurrent
transactions are executed

 a schedule for a set of transactions must consist of all instructions of thosetransactions


 must preserve the order in which the instructions appear in each individual transaction

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 3


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

Example Schedules
• Let T1 transfer $50 from A to B, and T2 transfer 10% of the balance from A to B. The
following is a serial schedule (Schedule 1 in the text), in which T1 is followed by T2.

Schedule 1

• Let T1 and T2 be the transactions defined previously. The following schedule is not a
serial schedule, but it is equivalent to above Schedule.

Schedule 2

• The following concurrent schedule does not preserve thevalue of the sum A + B

Schedule 3

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 4


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

Serializable Schedule
 A serializable schedule over a set S of committed transactions is a schedule whose
effect on any consistent database is guaranteed to be identical to that of some
complete serial schedule over S. i.e., even though the actions of transactions are
interleaved, the result of executing transactions serially in different order may
produce different results.
 Example: The schedule shown in the following figure is serializable.

T1 T2
R(A)
W(A)
R(A)
W(A)
R(B)
W(B)
R(B)
W(A)
Commi
t
Commi
t

Even though the actions of T1 and T2 are interleaved, the result of this schedule is
equivalent to first running T1 entirely and then running and T2 entirely. Actually T1‘s
read and write of B is not influenced by T2‘s actions on B, and the net effect is the same if
these actions are the serial schedule First T1, then T2. This schedule is also serializable if
first T2, then T1. Therefore if T1 and T2 are submitted concurrently to a DBMS, either of
these two schedules could be chosen as first
 A DBMS might sometimes execute transactions which is not a serial execution i.e., not
serializable.
 This can be happen for two reasons:
 First the DBMS might use a concurrency control method that ensures the
executed schedule itself.
 Second, SQL gives programmers the authority to instruct the DBMS tochoose
non-serializable schedule.
Anomalies due to Interleaved execution
 There are three main situations when the actions of two transactions T1 and T2
conflict with each other in the interleaved execution on the same data object.
■ Write-Read (WR) Conflict: Reading Uncommitted data.
■ Read-Write (RW) Conflict: Unrepeatable Reads
■ Write-Write (WW) Conflict: Overwriting Uncommitted Data.

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 5


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

 Reading Uncommitted Data (WR Conflicts)


■ Dirty Read: The first source of anomalies is that a transaction T2 could read a
database object A that has been just modified by another transaction T1, which
has not yet committed, such a read is called a dirty read.
■ Example: Consider two transactions T1 and T2, where T1 stands for
transferring $100 from A to B and T2 stands for incrementing both A and B by
6% of their accounts. Suppose that their actions are interleaved as follows:
(i) T1 deducts $100 from account A, then immediately
(ii) T2 reads accounts of A and B adds 6% interest to each, and then,
(iii) T1 adds $100 to account B.
This corresponding schedule is illustrated as follows:

T1 T2
R(A)
A: = A -
100 W(A)
R(A)
A: = A + 0.06
A W(A)
R(B)
B:= B+.06
R(B) BW(B)
B: = B + Commit
100W(B)
Commit

The problem here is T2 has added incorrect 6% interest to each A and B.


Because before commitment that $100 is deducted from A, it has added 6% to
account A before commitment that $100 is credited to B, it has added 6% to
account B. thus, the result of this schedule is different from the result of the
other schedule which is serializable: first T1 then T2.

 Unrepeatable Reads (RW Conflicts)


■ The second source of anomalies is that a transaction T2 could change the value
ofan object A that has been read by a transaction T1 and T1 is still in progress.
This situation causes a problem that, if T1 tries to read the value of A again, it
will get a different result, even though it has not modified A in the meantime.
But, this situation could not arise in a serial execute of two transactions: this, it
is called as unrepeatable read.
■ Example: Suppose that both T1 and T2 reads the same value of A, say 5. Then
T1 has incremented A value to 6 but before commitment as A value 6, T2 has
decremented A value from 5 to 4. Thus, instead of answer of A value as 5, i.e.,
from to 5 we got an answer 4 which is incorrect.

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 6


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

 Overwriting Uncommitted Data (WW Conflicts)


■ The third source of anomalies is that a transaction T2 could overwrite the value
of an object A, which has already been modified by a transaction T1, while
T1 isstill in progress.
■ Example: Suppose that A and B are two employees, and their salaries must be
kept equal. Transaction T1 sets their salaries to $1000 and transaction T2 sets
their salaries to $2000.

The following interleaving of the actions T1 and T2 occurs:


i) T1 sets A’s salary to $1000, at the same time, T2 sets B’s salary to $2000.
ii) T1 sets B’s salary is set to to $2000, at the same time, T2 sets A’s salary to
$2000.
As a result A’s salary is set to $2000 and B’s salary is set to $1000, i.e., the result
is not identical

■ Blind-Write: Neither transaction reads a value before writing it-such awrite is


called a blind-write.
The above example is the best example of blind write because T1 and T2

are concentrating only on writing but not on reading.

Schedules involving aborted Transactions


 All transactions of aborted transactions are to be undone, and we can therefore
imagine that they were never carried out to begin with.
 Example: Suppose that transaction T1 deducts $100 from account A then immediately
before committing A’s new value the transaction T2 reads the current values of
accounts A and B and adds 6% interest to each, then commits, but incidentally T1 is
aborted. So, we get incorrect result of transaction T2 because T1 was aborted in the
middle of the process and T2 has taken incorrect value of A by T1 and added 6%. We
say that such a schedule is Unrecoverable Schedule. The corresponding schedule is
shown as follows:

T1 T2
R(A)
A: = A -
100 W(A)
R(A)
A: = A +
0.06 A W(A)
R(B)
B:= B+.06
Abort B W(B)
Commit

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 7


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

 Whereas, a recoverable schedule is one in which transactions read only the changes of
committed transactions.

5.5 Serializability
 Basic Assumption – Each transaction, on its own, preserves database consistency
• i.e. serial execution of transactions preserves database consistency
 A (possibly concurrent) schedule is serializable if it is equivalent to a serial
schedule
 Different forms of schedule equivalence give rise to the notions of conflict
serializability and view serializability
 Simplifying assumptions:
• ignore operations other than read and write instructions
• assume that transactions may perform arbitrary computations on data inlocal buffers
between reads and writes
• simplified schedules consist only of reads and writes

 Conflict Serializability
 Instructions li and lj of transactions Ti and Tj respectively, conflict if and only if there
exists some item Q accessed by both li and lj, and at least one of these instructions
wrote Q.
1. li = read(Q), lj = read(Q). li and lj don’t conflict.
2. li = read(Q), lj = write(Q). They conflict.
3. li = write(Q), lj = read(Q). They conflict
4. li = write(Q), lj = write(Q). They conflict
 Intuitively, a conflict between li and lj forces a (logical) temporal order between
them
 If li and lj are consecutive in a schedule and they do not conflict, their results
would remain the same even if they had been interchanged in the ordering
 If a schedule S can be transformed into a schedule S´ by a series of swaps of non-
conflicting instructions, we say that S and S´ are conflict equivalent.
 We say that a schedule S is conflict serializable if it is conflict equivalent to a
serial schedule

• Example of a schedule that is not conflict serializable:

We are unable to swap instructions in the above schedule to obtain


either the serial schedule < T3, T4 >,or the serial schedule < T4, T3 >.
• Schedule 3 below can be transformed into Schedule 1,a
serial schedule where T2 follows T1, by series of swaps of

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 8


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

non-conflicting instructions.
Therefore Schedule 3 is conflict serializable.
 View Serializability
• Let S and S´ be two schedules with the same set of transactions. S and S´ are view
equivalent if the following three conditions are met, where Q is a data item and Ti is a
transaction:
1. If Ti reads the initial value of Q in schedule S, then Ti must, in schedule
S´, also read the initial value of Q
2. If Ti executes read(Q) in schedule S, and that value was produced by
transaction Tj (if any), then transaction Ti must in schedule S´ also read the
value of Q that was produced by transaction Tj
3. The transaction (if any) that performs the final write(Q) operation inschedule S
(for any data item Q) must perform the final write(Q) operationin schedule S´
NB: View equivalence is also based purely on reads and writes
• A schedule S is view serializable it is view equivalent to a serial schedule
• Every conflict serializable schedule is also view serializable
• Schedule 9 (from book) — a schedule which is view-serializable but not conflict
serializable

Every view serializable schedule that is not conflict serializable has blind writes
Other Notions of Serializability
• This schedule produces the same outcome as the serial schedule < T1, T5 >
• However it is not conflict equivalent or view equivalent to it
• Determining such equivalence requires analysis of operations other than read and
write

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 9


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

 Testing for Serializability


• Consider some schedule of a set of transactions T1, T2, ..., Tn
• Precedence graph: a directed graph where the vertices are transaction names
• We draw an arc from Ti to Tj if the two transaction conflict, and Ti accessed thedata
item before Tj
• We may label the arc by the item that was accessed
• Example:

• Example Schedule and Precedence Graph

• A schedule is conflict serializable if and only if its precedence graph is acyclic


■ Cycle-detection algorithms exist which take order n2 time, where n is the number of
vertices in the graph
■ If precedence graph is acyclic, the serializability order can be obtained by a topological
sorting of the graph. This is a linear order consistent with the partial order of the
graph. For example, a serializability order for this graph is T2  T1
 T3  T4  T5

• The precedence graph test for conflict serializability must be modified to apply to a
test for view serializability
■ The problem of checking if a schedule is view serializable is NP-complete. Thus
existence of an efficient algorithm is unlikely. However practical algorithms that just
check some sufficient conditions for view serializability can still be used

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 10


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

Example of an acyclic precedence graph

 Concurrency Control vs. Serializability Tests


• Goal – to develop concurrency control protocols that will ensure serializability
• These protocols will impose a discipline that avoids nonseralizable schedules
• A common concurrency control protocol uses locks
■ while one transaction is accessing a data item, no other transaction can modify it
■ require a transaction to lock the item before accessing it
■ two standard lock modes are “shared” (read-only) and “exclusive” (read-write)

5.6 Recoverability
 Need to address the effect of transaction failures on concurrently running
transactions.
 Recoverable schedule: if a transaction Tj reads a data item previously written by a
transaction Ti , the commit operation of Ti appears before the commit operation of
Tj
 The following schedule (Schedule 11) is not recoverable if T9 commits immediately
after the read

• If T8 should abort, T9 would have read (and possibly shown to the user) an
inconsistent database state. Hence database must ensure that schedules are
recoverable
• Cascading rollback – a single transaction failure leads to a series of transaction rollbacks
• Consider the following schedule where none of the transactions has yet committed
(so the schedule is recoverable)
• If T10 fails, T11 and T12 must also be rolled back
• Can lead to the undoing of a significant amount of work

• Cascadeless schedules — cascading rollbacks cannot occur; for each pair of transactions
Ti and Tj such that Tj reads a data item previously written by Ti, the commit
operation of Ti appears before the read operation of Tj

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 11


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

• Every cascadeless schedule is also recoverable


• It is desirable to restrict the schedules to those that are cascadeless

5.7 Implementation of Atomicity and Durability


 The recovery-management component of a database system implements the
support for atomicity and durability.

E.g. the shadow-database scheme:


all updates are made on a shadow copy of the database
db_pointer is made to point to the updated shadow copy after
the transaction reaches partial commit and all updated pages have been flushed to disk.
db_pointer always points to the current consistent copy of the database.
In case transaction fails, old consistent copy pointed to by db_pointer can be used, and the
shadow copy can be deleted.
The shadow-database scheme:
19Assumes that only one transaction is active at a time. Assumes disks do not fail
Does not handle concurrent transactions

5.8 Storage Structure

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 12


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

Stable-Storage Implementation

5.9 File Organization


A file is organized logically as a sequence of records. These records are mapped onto disk
blocks. Files are provided as a basic construct in operating systems, so we shall assume the
existence of an underlying file system.
Each file is also logically partitioned into fixed-length storage units called blocks, which
are the units of both storage allocation and data transfer.
Most databases use block sizes of 4 to 8 kilobytes by default, but many databases allow
the block size to be specified when a database instance is created.
Two possible approaches to store records: • Record size is fixed • Record size is variable

Fixed-Length Records
As an example, let us consider a file of instructor records for our university database.
Each record of this file is defined (in pseudocode) as:

We allocate the maximum number of bytes that each attribute can hold. Then, the
instructor record is 53 bytes long.

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 13


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

A simple approach is to use the first 53 bytes for the first record, the next 53 bytes for the
second record, and so on as shown below figure.

There are two problems with this simple approach:


1. Unless the block size happens to be a multiple of 53 (which is unlikely), some records will
cross block boundaries. That is, part of the record will be stored in one block and part in
another. It would thus require two block accesses to read or write such a record.
2. It is difficult to delete a record from this structure. The space occupied by the record to be
deleted must be filled with some other record of the file, or we must have a way of marking
deleted records so that they can be ignored.

To avoid the first problem, we allocate only as many records to a block as would fit
entirely in the block (this number can be computed easily by dividing the block size by the
record size, and discarding the fractional part). Any remaining bytes of each block are left
unused.

To avoid the second problem, When a record is deleted, we could move the record that
came after it into the space formerly occupied by the deleted record, and so on, until every
record following the deleted record has been moved ahead shown in below figure.

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 14


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

Such an approach requires moving a large number of records. It might be easier simply
to move the final record of the file into the space occupied by the deleted record shown in
below figure.

It is undesirable to move records to occupy the space freed by a deleted record, since
doing so requires additional block accesses. Since insertions tend to be more frequent than
deletions, it is acceptable to leave open the space occupied by the deleted record, and to wait
for a subsequent insertion before reusing the space.
A simple marker on a deleted record is not sufficient, since it is hard to find this
available space when an insertion is being done. Thus, we need to introduce an additional
structure.
At the beginning of the file, we allocate a certain number of bytes as a file header. The
header will contain a variety of information about the file. For now, all we need to store there
is the address of the first record whose contents are deleted.
We use this first record to store the address of the second available record, and so on.
Intuitively, we can think of these stored addresses as pointers, since they point to the location of
a record. The deleted records thus form a linked list, which is often referred to as a free list.
Below figure shows with the free list, after records 1, 4, and 6 have been deleted. On
insertion of a new record, we use the record.

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 15


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

Variable-Length Records
Variable-length records arise in database systems in several ways:
• Storage of multiple record types in a file.
• Record types that allow variable lengths for one or more fields.
• Record types that allow repeating fields, such as arrays or multisets.
Different techniques for implementing variable-length records exist. Two different problems
must be solved by any such technique:
• How to represent a single record in such a way that individual attributes can be extracted
easily.
• How to store variable-length records within a block, such that records in a block can be
extracted easily.

Representation of variable-length record


The representation of a record with variable-length attributes typically has two parts: an
initial part with fixed length attributes, followed by data for variable length attributes. Fixed-
length attributes, such as numeric values, dates, or fixed length character strings are allocated
as many bytes as required to store their value. Variable-length attributes, such as varchar
types, are represented in theinitial part of the record by a pair (offset, length), where offset
denotes where the data for that attribute begins within the record, and length is the length in
bytes of the variable-sized attribute. The values for these attributes are stored consecutively,
after the initial fixed-length part of the record. Thus, the initial part of the record stores a fixed
size of information about each attribute, whether it is fixed-length or variable-length.
An example of such a record representation is shown in below figure. The figure shows
an instructor record,whose first three attributes ID, name, and dept name are variable-length
strings, and whose fourth attribute salary is a fixed-sized number. We assume that the offset
and length values are stored in two bytes each, for a total of 4 bytes per attribute. The salary
attribute is assumed to be stored in 8 bytes, and each string takes as many bytes as it has
characters.

The figure also illustrates the use of a null bitmap, which indicates which attributes of the
record have a null value. In this particular record, if the salary were null, the fourth bit of the
bitmap would be set to 1, and the salary value stored in bytes 12 through 19 would be ignored.

Storing variable-length records in a block


The slotted-page structure is commonly used for organizing records within a block,
There is a header at the beginning of each block, containing the following information:
1. The number of record entries in the header.
2. The end of free space in the block.
3. An array whose entries contain the location and size of each record.
The actual records are allocated contiguously in the block, starting from the end of the
block. The free space in the block is contiguous, between the final entry in the header array,

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 16


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

and the first record. If a record is inserted, space is allocated for it at the end of free space, and
an entry containing its size and location is added to the header.
If a record is deleted, the space that it occupies is freed, and its entry is set to deleted (its
size is set to −1, for example). Further, the records in the block before the deleted record are
moved, so that the free space created by the deletion gets occupied, and all free space is again
between the final entry in the header array and the first record. The end-of-free-space pointer
in the header is appropriately updated as well. Records can be grown or shrunk by similar
techniques, as long as there is space in the block. The cost of moving the records is not too
high, since the size of a block is limited: typical values are around 4 to 8 kilobytes.

5.10 Organization of Records in Files


A relation is a set of records. Given a set of records, the next question is how to organize
them in a file. Several of the possible ways of organizing records in files are:
• Heap file organization. Any record can be placed anywhere in the file where there is space
for the record. There is no ordering of records. Typically, there is a single file for each
relation.
• Sequential file organization. Records are stored in sequential order, according to the value
of a “search key” of each record.
• Hashing file organization. A hash function is computed on some attribute of each record.
The result of the hash function specifies in which block of the

Sequential File Organization

A sequential file is designed for efficient processing of records in sorted order based on
some search key. A search key is any attribute or set of attributes; it need not be the primary
key, or even a superkey. To permit fast retrieval of records in search-key order, we chain
together records by pointers. The pointer in each record points to the next record in search-key
order. Furthermore, to minimize the number of block accesses in sequential file processing, we
store records physically in search-key order, or as close to search-key order as possible.
Blow figure shows a sequential file of instructor records taken from our university example.
In that example, the records are stored in search-key order, using ID as the search key.

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 17


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

It is difficult, however, to maintain physical sequential order as records are inserted and
deleted, since it is costly to move many records as a result of a single insertion or deletion. We
can manage deletion by using pointer chains, as we saw previously. For insertion, we apply
the following rules:
1. Locate the record in the file that comes before the record to be inserted in search-key order.
2. If there is a free record (that is, space left after a deletion) within the same block as this
record, insert the new record there. Otherwise, insert the new record in an overflow block. In
either case, adjust the pointers so as to chain together the records in search-key order.
Below figure shows the record after the insertion of the record (32222, Verdi, Music, 48000).

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 18


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

5.11 Introduction to Indexing

An index for a file in a database system works in much the same way as the index in this
textbook. If we want to learn about a particular topic (specified by a word or a phrase) in this
textbook, we can search for the topic in the index at the back of the book, find the pages where
it occurs, and then read the pages to find the information for which we are looking.
Database-system indices play the same role as book indices in libraries. For example, to
retrieve a student record given an ID, the database system would look up an index to find on
which disk block the corresponding record resides, and then fetch the disk block, to get the
appropriate student record.
There are two basic kinds of indices:
• Ordered indices. Based on a sorted ordering of the values.
• Hash indices. Based on a uniform distribution of values across a range of buckets. The
bucket to which a value is assigned is determined by a function, called a hash function.

Index Evaluation Metrics


• Access types: The types of access that are supported efficiently. Access types can include
finding records with a specified attribute value and finding records whose attribute values
fall in a specified range.
• Access time: The time it takes to find a particular data item, or set of items, using the
technique in question.
• Insertion time: The time it takes to insert a new data item. This value includes the time it
takes to find the correct place to insert the new data item, as well as the time it takes to
update the index structure.
• Deletion time: The time it takes to delete a data item. This value includes the time it takes
to find the item to be deleted, as well as the time it takes to update the index structure.
• Space overhead: The additional space occupied by an index structure. Provided that the
amount of additional space is moderate, it is usually worthwhile to sacrifice the space to
achieve improved performance.

An attribute or set of attributes used to look up records in a file is called a search key.

5.12 Ordered Indices


To gain fast random access to records in a file, we can use an index structure. Each
index structure is associated with a particular search key. Just like the index of a book or a
library catalog, an ordered index stores the values of the search keys in sorted order, and
associates with each search key the records that contain it.
A file may have several indices, on different search keys. If the file containing the
records is sequentially ordered, a clustering index is an index whose search key also defines
the sequential order of the file.
Clustering indices are also called primary indices; the term primary index may appear
to denote an index on a primary key, but such indices can in fact be built on any search key.
Indices whose search key specifies an order different from the sequential order of the
file are called nonclustering indices, or secondary indices.

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 19


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

Dense and Sparse Indices


An index entry, or index record, consists of a search-key value and pointers to one or more
records with that value as their search-key value. The pointer to a record consists of the
identifier of a disk block and an offset within the disk block to identify the record within the
block.
There are two types of ordered indices that we can use:
• Dense index: In a dense index, an index entry appears for every search-key value in the file.
In a dense clustering index, the index record contains the search-key value and a pointer to the
first data record with that search-key value. The rest of the records with the same search-key
value would be stored sequentially after the first record, since, because the index is a
clustering one, records are sorted on the same search key. In a dense nonclustering index, the
index must store a list of pointers to all records with the same search-key value.
• Sparse index: In a sparse index, an index entry appears for only some of the search-key
values. Sparse indices can be used only if the relation is stored in sorted order of the search
key, that is, if the index is a clustering index. As is true in dense indices, each index entry
contains a search-key value and a pointer to the first data record with that search-key value. To
locate a record, we find the index entry with the largest search-key value that is less than or
equal to the search-key value for which we are looking. We start at the record pointed to by
that index entry, and follow the pointers in the file until we find the desired record.

Dense index

Sparse index

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 20


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

Multilevel Indices
If an index is small enough to be kept entirely in main memory, the search time to find
an entry is low. However, if the index is so large that not all of it can be kept in memory, index
blocks must be fetched from disk when required. (Even if an index is smaller than the main
memory of a computer, main memory is also required for a number of other tasks, so it may
not be possible to keep the entire index in memory.) The search for an entry in the index then
requires several disk-block reads.
We treat the index just as we would treat any other sequential file, and construct a
sparse outer index on the original index, which we now call the inner index, as shown in
below figure. Note that the index entries are always in sorted order, allowing the outer index
to be sparse. To locate a record, we first use binary search on the outer index to find the record
for the largest search-key value less than or equal to the one that we desire. The pointer points
to a block of the inner index. We scan this block until we find the record that has the largest
search-key value less than or equal to the one that we desire. The pointer in this record points
to the block of the file that contains the record for which we are looking.

Two level sparse index

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 21


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

Index Update
Regardless of what form of index is used, every index must be updated whenever a record is
either inserted into or deleted from the file.
We first describe algorithms for updating single-level indices.
• Insertion. First, the system performs a lookup using the search-key value that appears in the
record to be inserted. The actions the system takes next depend on whether the index is dense
or sparse:
◦ Dense indices:
1. If the search-key value does not appear in the index, the system inserts an index entry
with the search-key value in the index at the appropriate position.
2. Otherwise the following actions are taken:
a. If the index entry stores pointers to all records with the same search key value,
the system adds a pointer to the new record in the index entry.
b. Otherwise, the index entry stores a pointer to only the first record with the
search-key value. The system then places the record being inserted after the other
records with the same search-key values.
◦ Sparse indices: We assume that the index stores an entry for each block. If the system
creates a new block, it inserts the first search-key value (in search-key order) appearing
in the new block into the index. On the other hand, if the new record has the least
search-key value in its block, the system updates the index entry pointing to the block;
if not, the system makes no change to the index.
• Deletion. To delete a record, the system first looks up the record to be deleted. The actions
the system takes next depend on whether the index is dense or sparse:
◦ Dense indices:
1. If the deleted record was the only record with its particular search-key value, then the
system deletes the corresponding index entry from the index.
2. Otherwise the following actions are taken:
a. If the index entry stores pointers to all records with the same search key value,
the system deletes the pointer to the deleted record from the index entry.
b. Otherwise, the index entry stores a pointer to only the first record with the
search-key value. In this case, if the deleted record was the first record with the
search-key value, the system updates the index entry to point to the next record.
◦ Sparse indices:
1. If the index does not contain an index entry with the search-key value of the deleted
record, nothing needs to be done to the index.
2. Otherwise the system takes the following actions:
a. If the deleted record was the only record with its search key, the system
replaces the corresponding index record with an index record for the next search-
key value (in search-key order). If the next search-key value already has an index
entry, the entry is deleted instead of being replaced.
b. Otherwise, if the index entry for the search-key value points to the record
being deleted, the system updates the index entry to point to the next record with
the same search-key value.

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 22


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

Secondary Indices
Secondary indices must be dense, with an index entry for every search-key value, and a
pointer to every record in the file. A clustering index may be sparse, storing only some of the
search-key values, since it is always possible to find records with intermediate search-key
values by a sequential access to a part of the file, as described earlier. If a secondary index
stores only some of the search-key values, records with intermediate search-key values may be
anywhere in the file and, in general, we cannot find them without searching the entire file.
We can use an extra level of indirection to implement secondary indices on search keys
that are not candidate keys. The pointers in such a secondary index do not point directly to the
file. Instead, each points to a bucket that contains pointers to the file.
Below figure shows the structure of a secondary index that uses an extra level of
indirection on the instructor file, on the search key salary.

5.13 B+ Tree Index Files


The main disadvantage of the index-sequential file organization is that performance
degrades as the file grows, both for index lookups and for sequential scans through the data.
The B+ tree index structure is the most widely used of several index structures that
maintain their efficiency despite insertion and deletion of data.
A B+ tree index takes the form of a balanced tree in which every path from the root of the
tree to a leaf of the tree is of the same length. Each non leaf node in the tree has between n/2
and n children, where n is fixed for a particular tree.

Structure of a B+ Tree

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 23


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 24


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

Queries on B+ Trees

Let us consider how we process queries on a B+-tree. Suppose that we wish to find
records with a search-key value of V.
Intuitively, the function starts at the root of the tree, and traverses the tree down until it
reaches a leaf node that would contain the specified value if it exists in the tree. Specifically,
starting with the root as the current node, the function repeats the following steps until a leaf
node is reached. First, the current node is examined, looking for the smallest i such that search-
key value Ki is greater than or equal to V. Suppose such a value is found; then, if Ki is equal to
V, the current node is set to the node pointed to by Pi+1, otherwise Ki > V, and the current node
is set to the node pointed to by Pi. If no such value Ki is found, then clearly V > Km−1, where Pm
is the last non null pointer in the node. In this case the current node is set to that pointed to by
Pm. The above procedure is repeated, traversing down the tree until a leaf node is reached.
At the leaf node, if there is a search-key value equal to V, let Ki be the first such value;
pointer Pi directs us to a record with search-key value Ki. The function then returns the leaf
node L and the index i. If no search-key with value V is found in the leaf node, no record with
key value V exists in the relation, and function find returns null, to indicate failure.

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 25


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

Updates on B+ Trees
When a record is inserted into, or deleted from a relation, indices on the relation must
be updated correspondingly.

Insertion

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 26


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 27


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

Deletion

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 28


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 29


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

5.14 Hash Organization


For a huge database structure, it can be almost next to impossible to search all the index
values through all its level and then reach the destination data block to retrieve the desired
data. Hashing is an effective technique to calculate the direct location of a data record on the
disk without using index structure.
Hashing uses hash functions with search keys as parameters to generate the address of a data
record.
Hash Organization
• Bucket − A hash file stores data in bucket format. Bucket is considered a unit of
storage. A bucket typically stores one complete disk block, which in turn can store one
or more records.
• Hash Function − A hash function, h, is a mapping function that maps all the set of
search-keys K to the address where actual records are placed. It is a function from
search keys to bucket addresses.
Static Hashing
In static hashing, when a search-key value is provided, the hash function always computes
the same address. For example, if mod-4 hash function is used, then it shall generate only 5
values. The output address shall always be same for that function. The number of buckets
provided remains unchanged at all times.

Operation
• Insertion − When a record is required to be entered using static hash, the hash
function h computes the bucket address for search key K, where the record will be
stored.
Bucket address = h(K)
• Search − When a record needs to be retrieved, the same hash function can be used to
retrieve the address of the bucket where the data is stored.
• Delete − This is simply a search followed by a deletion operation.
Bucket Overflow
The condition of bucket-overflow is known as collision. This is a fatal state for any static hash
function. In this case, overflow chaining can be used.
• Overflow Chaining − When buckets are full, a new bucket is allocated for the same
hash result and is linked after the previous one. This mechanism is called Closed
Hashing.

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 30


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

• Linear Probing − When a hash function generates an address at which data is already
stored, the next free bucket is allocated to it. This mechanism is called Open Hashing.

Dynamic Hashing
The problem with static hashing is that it does not expand or shrink dynamically as the size of
the database grows or shrinks. Dynamic hashing provides a mechanism in which data
buckets are added and removed dynamically and on-demand. Dynamic hashing is also
known as extended hashing.
Hash function, in dynamic hashing, is made to produce a large number of values and only a
few are used initially.

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 31


DATABASE MANAGEMENT SYSTEMS UNIT – IV : NORMALIZATION

5.15 COMPARISON OF THREE FILE ORGANIZATIONS


Refer text book
Database Management Systems, 3/e, Raghurama Krishnan, Johannes Gehrke, TMH
Chapter 8 File Organizations & Indexes
Pages from 232 to 236.

Review Questions

1. Explain about the measures that are to be considered for comparing the performance of
various file organization techniques.
2. Explain in detail B+ tree file organization.
3. Write short notes on: i) Primary index ii) Clustered index iii) Secondary index.
4. Explain various anomalies that arise due to interleaved execution of transactions with
suitable examples.
5. What is static hashing? What rules are followed for index selection?
6. Define transaction and explain desirable properties of transactions.
7. What is database Recovery? Explain Shadow paging in detail.
8. Explain about Conflict Serializability and view serializability.
9. Explain the following a) Concurrent executions, b) Transaction states.

References:

• Raghurama Krishnan, Johannes Gehrke, Database Management Systems, 3rd Edition, Tata
McGraw Hill.
• C.J. Date, Introduction to Database Systems, Pearson Education.
• Elmasri Navrate, Fundamentals of Database Systems, Pearson Education.

ADITYA COLLEGE OF ENGINEERING AND TECHNOLOGY 32

You might also like