DBMS Unit 4
DBMS Unit 4
1. Objective
To impart the knowledge of Locking based protocols.
To impart the knowledge of Concurrency control techniques.
To Impart the knowledge of Timestamp-Based Concurrency Control.
To Impart the knowledge of Locking protocols
To Impart the knowledge of Concurrency control technique and optimistic concurrency
control technique
To Impart the knowledge of Failure occurring and recovery concepts
To Impart the knowledge of ARIES recovery and its algorithms
To Impart the knowledge of Indexing, single level indexing and multi-level indexing
To Impart the knowledge of Structured, Semi-Structured and Unstructured Data
3. 2D Mapping of ILOs with Knowledge Dimension and Cognitive Learning Levels of RBT
1
Conceptual A,B
Procedural C,D,E
Meta Cognitive
4. Teaching Methodology
5. Evocation
2
6. Deliverables
LOCK MANAGEMENT:
It relies on the process of message passing where transactions and lock managerexchangemessages to
handle the locking and unlocking of data items.
It ensures that database transactions are performed concurrently and accurately toproducecorrect
results without violating data integrity of the database.
To enforce Isolation.
To resolve read-write and write-write conflict issues.
To preserve database consistency.
Concurrency control helps to ensure serializability.
3
Lock based protocols help to eliminate the concurrency problem in DBMS forsimultaneoustransactions
by locking or isolating a particular transaction to a single user.
All lock requests are made to the concurrency-control manager. Transactions proceedonlyonce the
lock request is granted.
Note: All the data items must be accessed in a mutually exclusive manner.There
1. Shared lock
2. Exclusive lock
Shared lock:
A Shared lock is also called a Read-only lock. With the shared lock, the data item can besharedbetween
transactions. It is denoted by S.
For example, If a transaction T1 has obtained a shared-lock on item Q, then T1 can read, but cannot
write Q. If any other transactions T2 wants to read the data item Q the database will let them read by
placing shared lock.
More than one transaction can hold shared locks on same data item simultaneously.
Exclusive lock:
A Exclusive lock is also called write lock. With the exclusive lock, the data item cannot be shared
between transactions.
It is denoted by X.
4
Can read and update data item.
For example, If a transaction T1 has obtained an exclusive-lock on item Q, then T2 can both read and
write Q. If any other transaction T2 wants to read or write on item Q exclusive lock prevent this
operation.
This is exclusive and can’t held concurrently on the same data item. Transactions mayunlock the
data item after finishing the write operation.
T1 T2
Lock-S(B) Read(B)
Lock-X(B)
SharedUnlock(B)
Exclusive Read(B)
lock Balance := Balance-50
Write(B)
Unlock(B)
5
lock
Here the transaction T1 and T2 want to access the same data item B. Transaction T1 has applied
Exclusive lock so the transaction T2 cannot access the data item B until the transaction T1 releases the
lock.
6
Problem with simple locking
Consider the above execution phase. Now, T1 holds an Exclusive lock over data item B, and T2 holds a Shared
lock Over data item A. If T2 requests for lock on B, while T1 requests lockon A. This leads to Deadlock and none
can proceed with further execution.
When a transaction needs to wait for an indefinite period to acquire lock then it leads to
Starvation. It is mainly because the waiting scheme for locks is not managed properly.
Summary:
7
transaction wait for first transaction release lock the it leads to deadlock
Two-phase locking helps to reduce the amount of concurrency in a schedule but just like the two sides of a
coin two-phase locking has a few cons too. The protocol raises transaction processing costs and may have
unintended consequences. The likelihood of establishing deadlocks is one bad result.
Growing Phase: In the growing phase, the transaction only obtains the lock. The transaction can not
release the lock in the growing phase. Only when the data changes are committed the transaction starts
the Shrinking phase.
Shrinking Phase: Neither are locks obtained nor they are released in this phase. When all the data
changes are stored, only then the locks are released
8
Two-Phase Locking Types (2PL types)
Two-phase Locking is further classified into three types :
The transaction can release the shared lock after the lock point.
The transaction cannot release any exclusive lock until the transaction commits.
In strict two-phase locking protocol, if one transaction rollback then the other transaction should
also have to roll back. The transactions are dependent on each other. This is called Cascading
schedule.
The transaction cannot release either of the locks, i.e., neither shared lock nor exclusive lock.
Serailizability is guaranteed in a Rigorous two-phase locking protocol.
Deadlock is not guaranteed in rigorous two-phase locking protocol.
The transaction must lock all the data items it requires in the transaction before the transaction
begins.
If any of the data items are not available for locking before execution of the lock, then no data
items are locked.
The read-and-write data items need to know before the transaction begins. This is not possible
normally.
Conservative two-phase locking protocol is deadlock-free.
Conservative two-phase locking protocol does not ensure a strict schedule.
9
Cascading Roll Back in 2PL
Deadlock in 2PL
T1 T2
Lock-X(R1) Lock-X(R2)
Read(R1) Read(R2)
Lock-X(R2) Lock-X(R1)
10
Lecture -39 concurrency control techniques
Several problems that arise when numerous transactions execute simultaneously in a random manner are
referred to as Concurrency Control Problems.
The dirty read problem in DBMS occurs when a transaction reads the data that has been updated by
another transaction that is still uncommitted. It arises due to multiple uncommitted transactions executing
simultaneously.
Example: Consider two transactions A and B performing read/write operations on a data DT in the
database DB. The current value of DT is 1000: The following table shows the read/write operations in A
and B transactions.
Time A B
T1 READ(DT) ------
T2 DT=DT+500 ------
T3 WRITE(DT) ------
T4 ------ READ(DT)
T5 ------ COMMIT
T6 ROLLBACK ------
Transaction A reads the value of data DT as 1000 and modifies it to 1500 which gets stored in the
temporary buffer. The transaction B reads the data DT as 1500 and commits it and the value of DT
permanently gets changed to 1500 in the database DB. Then some server errors occur in transaction A and
it wants to get rollback to its initial value, i.e., 1000 and then the dirty read problem occurs.
The unrepeatable read problem occurs when two or more different values of the same data are read during
the read operations in the same transaction.
Example: Consider two transactions A and B performing read/write operations on a data DT in the
database DB. The current value of DT is 1000: The following table shows the read/write operations in A
and B transactions.
Time A B
T1 READ(DT) ------
T2 ------ READ(DT)
T3 DT=DT+500 ------
T4 WRITE(DT) ------
T5 ------ READ(DT)
Transaction A and B initially read the value of DT as 1000. Transaction A modifies the value of DT from
1000 to 1500 and then again transaction B reads the value and finds it to be 1500. Transaction B finds two
different values of DT in its two different read operations.
11
Phantom Read Problem
In the phantom read problem, data is read through two different read operations in the same transaction. In
the first read operation, a value of the data is obtained but in the second operation, an error is obtained
saying the data does not exist.
Example: Consider two transactions A and B performing read/write operations on a data DT in the
database DB. The current value of DT is 1000: The following table shows the read/write operations in A
and B transactions.
Time A B
T1 READ(DT) ------
T2 ------ READ(DT)
T3 DELETE(DT) ------
T4 ------ READ(DT)
Transaction B initially reads the value of DT as 1000. Transaction A deletes the data DT from the database
DB and then again transaction B reads the value and finds an error saying the data DT does not exist in the
database DB.
The Lost Update problem arises when an update in the data is done over another update but by two
different transactions.
Example: Consider two transactions A and B performing read/write operations on a data DT in the
database DB. The current value of DT is 1000: The following table shows the read/write operations in A
and B transactions.
Time A B
T1 READ(DT) ------
T2 DT=DT+500 ------
T3 WRITE(DT) ------
T4 ------ DT=DT+300
T5 ------ WRITE(DT)
T6 READ(DT) ------
Transaction A initially reads the value of DT as 1000. Transaction A modifies the value of DT from 1000
to 1500 and then again transaction B modifies the value to 1800. Transaction A again reads DT and finds
1800 in DT and therefore the update done by transaction A has been lost.
The Incorrect summary problem occurs when there is an incorrect sum of the two data. This happens when
a transaction tries to sum two data using an aggregate function and the value of any one of the data get
changed by another transaction.
12
Example: Consider two transactions A and B performing read/write operations on two data DT1 and DT2
in the database DB. The current value of DT1 is 1000 and DT2 is 2000: The following table shows the
read/write operations in A and B transactions.
Time A B
T1 READ(DT1) ------
T2 add=0 ------
T3 add=add+DT1 ------
T4 ------ READ(DT2)
T5 ------ DT2=DT2+500
T6 READ(DT2) ------
T7 add=add+DT2 ------
Transaction A reads the value of DT1 as 1000. It uses an aggregate function SUM which calculates the
sum of two data DT1 and DT2 in variable add but in between the value of DT2 get changed from 2000 to
2500 by transaction B. Variable add uses the modified value of DT2 and gives the resultant sum as 3500
instead of 3000.
To avoid concurrency control problems and to maintain consistency and serializability during the execution
of concurrent transactions some rules are made. These rules are known as Concurrency Control Protocols.
Lock-Based Protocols
Time-based Protocols
According to this protocol, every transaction has a timestamp attached to it. The timestamp is based on the
time in which the transaction is entered into the system. There is read and write timestamps associated with
every transaction which consists of the time at which the latest read and write operations are performed
respectively.
The timestamp ordering protocol uses timestamp values of the transactions to resolve the conflicting pairs
of operations. Thus, ensuring serializability among transactions. Following are the denotations of the terms
used to define the protocol for transaction A on the data item DT:
Terms Denotations
Timestamp of transaction A TS(A)
Read time-stamp of data-item
R-timestamp(DT)
DT
Write time-stamp of data-item
W-timestamp(DT)
DT
13
TS(A) < W-timestamp(DT): Transaction will rollback. If the timestamp of transaction A at which it
has entered in the system is less than the write timestamp of DT that is the latest time at which DT
has been updated then the transaction will roll back.
TS(A) < R-timestamp(DT): Transaction will rollback. If the timestamp of transaction A at which it
has entered in the system is less than the read timestamp of DT that is the latest time at which DT
has been read then the transaction will rollback.
TS(A) < W-timestamp(DT): Transaction will rollback. If the timestamp of transaction A at which it
has entered in the system is less than the write timestamp of DT that is the latest time at which DT
has been updated then the transaction will rollback.
Thomas' Write Rule: The rule alters the timestamp-ordering protocol to make the schedule view
serializable. For the case TS(A) < W-timestamp(DT), in the timestamp-ordering protocol, the transaction
will get rollback but according to Thomas Write Rule, whenever the write operation comes up, it will get
ignored.
Read phase: In this phase, the transaction stores all the values of data in its local buffer that occurs after
the execution of every operation in the transaction. There is no modification done in the database.
Validation phase: In this phase, validation tests are performed that check whether the values of data
present in the local buffer can replace the original value of the database without causing any harm to
serializability.
Validation Test: Validation tests have performed on transaction A executing concurrently with transaction
B such that TS(A)<TS(B). The transactions must follow one of the following conditions:
Finish(A)<Start(B): The operations in transaction A are finished its execution before transaction B
starts. Consider two transactions A and B executing its operations. Hence serializability order is
maintained.
Start(B)<Finish(A)<Validate(B): The list of data items written by transaction A during its write
operation should not intersect with the read of the transaction B.
14
Write phase: If the transaction passes the tests of the validation phase, then the values get copied to the
database, otherwise the transaction rolls back.
Example: Consider two transactions A and B performing read/write operations on two data DT1 and DT2
in the database DB. The current value of DT1 is 1000 and DT2 is 2000: The following table shows the
read/write operations in A and B transactions.
Time A B
T1 READ(DT1) ------
T2 ------ READ(DT1)
T3 ------ DT1=DT1-100
T4 ------ READ(DT2)
T5 ------ DT2=DT2+500
T6 ------ READ(DT2)
T7 ------
PRINT(DT2-
T8 ------
DT1)
T9 ------
T10 ------ WRITE(DT1)
T11 ------ WRITE(DT2)
The schedule passes the validation test of the validation phase due to the timestamp transaction B being
less than transaction A. It should be observed that the write operations are implemented after the validation
of both transactions. All the operations before the final write are performed in the local buffer.
15
Lecture -40 optimistic concurrency control technique
It is a concurrency control method applied to transactional systems such as relational database
management systems and software transactional memory. Optimistic concurrency control transactions
involve these phases:
16
Transaction processing, concurrency control and recovery issues have played a major role in
conventional databases. Generally optimistic control demonstrates a few improvements over pessimistic
concurrency controls like two–phase locking or time stamp based protocol.
Phases
The optimistic concurrency control has three phases, which are explained below
−
Read Phase
Various data items are read and stored in temporary variables (local copies). Alloperations are performed in these
variables without updating the database.
Validation Phase
All concurrent data items are checked to ensure serializability will not be validated if the transaction
updates are actually applied to the database. Any changes in the value cause the transaction rollback.
The transaction timestamps are used and the write-sets and read-sets are maintained.
To check that transaction A does not interfere with transaction B the followingmust hold −
TransB completes its write phase before TransA starts the read phase.
17
Lecture -41 failures occurring in transactions and recovery
Failure in terms of a database can be defined as its inability to execute the specified transaction or loss of
data from the database. A DBMS is vulnerable to several kinds of failures and each of these failures
needs to be managed differently. There are many reasons that can cause database failures such as
network failure, system crash, natural disasters, carelessness, sabotage(corrupting the data intentionally),
Transaction Failure:
If a transaction is not able to execute or it comes to a point from where the transaction becomes
Logical error: A logical error occurs if a transaction is unable to execute because of some
System error: Where the termination of an active transaction is done by the database system
18
itself due to some system issue or because the database management system is unable to proceed
with the transaction. For example– The system ends an operating transaction if it reaches a
System Crash:
A system crash usually occurs when there is some sort of hardware or software breakdown. Some other
problems which are external to the system and cause the system to abruptly stop or eventually crash
include failure of the transaction, operating system errors, power cuts, main memory crash, etc.
These types of failures are often termed soft failures and are responsible for the data losses in the volatile
memory. It is assumed that a system crash does not have any effect on the data stored in the non-volatile
Data-transfer Failure:
When a disk failure occurs amid data-transfer operation resulting in loss of content from disk storage
then such failures are categorized as data-transfer failures. Some other reason for disk failures includes
disk head crash, disk unreachability, formation of bad sectors, read-write errors on the disk, etc.
In order to quickly recover from a disk failure caused amid a data-transfer operation, the backup copy of
the data stored on other tapes or disks can be used. Thus it’s a good practice to backup your data
frequently.
19
Example:
Social media - whats app.
- Whats app is the best example, for database recovery. It backups togoogle drive sothat it helps
on tracking with current status of media, texts,doc,etc..
- If whats app has been uninstalled it can recover all the previous datafrom lastbackup from
google drive whenever re-installed.
If database inconsistent: (need to undo and redo changes by using before &
after-image)
- If a transaction crashes, then the recovery manager may undo
transactionsi.e; reverse the operations of a transaction.
Advantages:
1) Any changes made to the data by a transaction are first recorded in a log fileand
applied to the database on commit.
Disadvantages:
1) Whenever any transaction is executed, the updates are not made
immediatelyto the database.
2) Increased time taken to recover in case of a system failure.
Advantages:
1) Whenever any transaction is executed, the updates are made directly to the
database and the log file is also maintained which contains both old and newvalues.
Disadvantages:
1) Frequent I/O operations while the transaction is active.
21
Maintain two –page tables during life of transaction
- Current page table
- Shadow page table
Shadow page table is used when the transaction starts which is
copying currentpage table. After this, shadow page table gets saved
on disk .
Current page table is used for transaction. After transaction, bothtables
becomeidentical.
When a transaction begins, all the entries of the current page table arecopied
to the shadow page table. In simple words, the ith entry of the current page
table and shadowpage table points to the same address or data.
Advantages:
1) Whenever any failure occurs,the recovery is faster.
2) No need of undo/ redo operations performed.
Disadvantages:
1) It do not supports to extend algorithms to allow transaction to run concurrently.
2) Data is fragmented(the process or state of breaking or being broken into parts).
22
Lecture -43 check pointing
The Checkpoint is used to declare a point before which the DBMS was in a consistent state, and all
transactions were committed. During transaction execution, such checkpoints are traced. After
execution, transaction log files will be created. Upon reaching the savepoint/checkpoint, the log file is
destroyed by saving its update to the database. Then a new log is created with upcoming execution
operations of the transaction and it will be updated until the next checkpoint and the process continues.
Whenever transaction logs are created in a real-time environment, it eats up lots of storage space. Also
keeping track of every update and its maintenance may increase the physical space of the system.
Eventually, the transaction log file may not be handled as the size keeps growing. This can be
addressed with checkpoints. The methodology utilized for removing all previous transaction logs and
The behavior when the system crashes and recovers when concurrent transactions are executed is
shown below:
23
Transactions and operations of the above diagram:
Transaction 1 Transaction 2 Transaction 3 Transaction 4
(T1) (T2) (T3) (T4)
START
START
COMMIT
START
COMMIT
START
FAILURE
The recovery system reads the logs backward from the end to the last checkpoint i.e. from T4
to T1.
Whenever there is a log with instructions <Tn, start>and <Tn, commit> or only <Tn, commit>
then it will put that transaction in Redo List. T2 and T3 contain <Tn, Start> and <Tn, Commit>
whereas T1 will have only <Tn, Commit>. Here, T1, T2, and T3 are in the redo list.
Whenever a log record with no instruction of commit or abort is found, that transaction is put
to Undo List <Here, T4 has <Tn, Start> but no <Tn, commit> as it is an ongoing transaction.
All the transactions in the redo list are deleted with their previous logs and then redone before saving
their logs. All the transactions in the undo list are undone and their logs are deleted.
Types of Checkpoints
1. Automatic Checkpoint
2. Manual Checkpoint
1. Automatic Checkpoint: These checkpoints occur very frequently like every hour or every day.
24
These intervals are set by the database administrator. They are generally used by heavy databases as
they are frequently updated, and we can recover the data easily in case of failure.
2. Manual Checkpoint: These are the checkpoints that are manually set by the database
administrator. Manual checkpoints are generally used for smaller databases. They are updated very
less frequently only when they are set by the database administrator.
Relevance of Checkpoints
for recovery if there is an unexpected shutdown in the database. Checkpoints work on some intervals
and write all dirty pages (modified pages) from logs relay to data file from i.e from a buffer to a
physical disk. It is also known as the hardening of dirty pages. It is a dedicated process and runs
automatically by SQL Server at specific intervals. The synchronization point between the database and
Advantages of Checkpoints
Checkpoints help us in recovering the transaction of the database in case of a random shutdown
of the database.
It enhancing the consistency of the database in case when multiple transactions are executing
Checkpoints work as a synchronization point between the database and the transaction log file
in the database.
Checkpoint records in the log file are used to prevent unnecessary redo operations.
Since dirty pages are flushed out continuously in the background, it has a very low overhead
25
and can be done frequently.
Checkpoints provide the baseline information needed for the restoration of the lost state in the
A database checkpoint keeps track of change information and enables incremental database
backup.
A database storage checkpoint can be mounted, allowing regular file system operations to be
performed.
Database checkpoints can be used for application solutions which include backup, recovery or
database modifications.
Disadvantages of Checkpoints
1. Database storage checkpoints can only be used to restore from logical errors (E.g. a human error).
2. Because all the data blocks are on the same physical device, database storage checkpoints cannot be
2. Performance Optimization
3. Auditing
A checkpoint is one of the key tools which helps in the recovery process of the database. In case of a
system failure, DBMS can find the information stored in the checkpoint to recover the database till its
The speed of recovery in case of a system failure depends on the duration of the checkpoint set by the
database administrator. For Example, if the checkpoint interval is set to a shorter duration, it helps in
26
faster recovery and vice-versa. If more frequent checkpoint has to be written to disk, it can also impact
the performance.
Checkpoint plays an essential role in the Recovery of the database. Still, it also plays a vital role in
improving the performance of DBMS, and this can be done by reducing the amount of work that
should be done during recovery. It can discard any unnecessary information which helps to keep the
Another way in which checkpoint is used to improve the performance of the database is by reducing
the amount of data that is to be read from the disk in case of recovery. Analyzing the checkpoints
clearly helps in minimizing the data that is to be read from the disk, which improves the recovery time.
Checkpoints can be used for different purposes like Performance Optimization, it can also be used for
Auditing Purposes. Checkpoints help view the database’s history and identify any problem that had
In case of any type of failure, database administrators can use the checkpoint to determine when it has
27
Lecture -44 Aries recovery
Algorithm for Recovery and Isolation Exploiting
Semantics (ARIES)
Algorithm for Recovery and Isolation Exploiting Semantics (ARIES) is based on the Write
Ahead Log (WAL) protocol. Every update operation writes a log record which is one of the
following:
1. Undo-only log record: Only the before image is logged. Thus, an undo operation can be
done to retrieve the old data.
2. Redo-only log record: Only the after image is logged. Thus, a redo operation can be
attempted.
3. Undo-redo log record: Both before images and after images are logged.
In it, every log record is assigned a unique and monotonically increasing log sequence number
(LSN). Every data page has a page LSN field that is set to the LSN of the log record
corresponding to the last update on the page. WAL requires that the log record corresponding
to an update make it to stable storage before the data page corresponding to that update is
written to disk. For performance reasons, each log write is not immediately forced to disk. A
log tail is maintained in main memory to buffer log writes. The log tail is flushed to disk when
it gets full. A transaction cannot be declared committed until the commit log record makes it to
disk.
Once in a while the recovery subsystem writes a checkpoint record to the log. The checkpoint
record contains the transaction table and the dirty page table. A master log record is maintained
separately, in stable storage, to store the LSN of the latest checkpoint record that made it to
disk. On restart, the recovery subsystem reads the master log record to find the checkpoint’s
LSN, reads the checkpoint record, and starts recovery from there on.
28
The recovery process actually consists of 3 phases:
1. Analysis:
The recovery subsystem determines the earliest log record from which the next pass must
start. It also scans the log forward from the checkpoint record to construct a snapshot of
what the system looked like at the instant of the crash.
2. Redo:
Starting at the earliest LSN, the log is read forward and each update redone.
3. Undo:
The log is scanned backward and updates corresponding to loser transactions are undone.
Most crash recovery systems are built around a STEAL/NO-FORCE approach, accepting the risks of
writing possibly uncommitted data to memory to gain the performance improvements of not having to
force all commits to disk. The STEAL policy imposes the need to UNDO transactions and the NO-
FORCE policy imposes the need to REDO transactions. Databases rely on a log that stores transaction
data and system state with enough information to make it possible to undo or redo transactions to
ensure atomicity and durability of the database.
A database log is a file on disk that stores a sequential list of operations on the databas e. Each entry in
the log is called a log record. During normal database operation, one or more log entries are written to
the log file for each update to the database performed by a transaction. Each entry has a unique
sequence number called the Log Sequence Number, or LSN that uniquely identifies each record in the
log. A second type of log record is a checkpoint. Periodically, these records are written to describe the
overall state of the database at a certain point in time and may contain information about the contents
of the buffer pool, active transactions, or other details depending on the implementation of the
recovery system. The log file contains enough information that we are able to undo the effects of any
incomplete or aborted transactions, and that we are able to redo any effects of committed transactions
29
that haven’t been flushed to disk.
Log records are only valuable if they can ensure recovery from failure. Write Ahead Logging (WAL) is
a protocol stating that the database must write to disk the log file records corresponding to database
changes before those changes to the database can be written to the main database files. The WAL
protocol ensures that:
1. All log records updating a page are written to non-volatile storage before the page itself is over-
written in non-volatile storage.
2. A transaction is not considered committed until all of tis log records have been written to non -
volatile storage.
WAL ensures that all log records for an updated page are written to non-volatile storage before the
page itself is allowed to be over-written in non-volatile storage. This ensures that UNDO information
required by a STEAL policy will be present in the log in the event of a crash. It also ensures that a
transaction is not considered committed until all of its log records (including its commit record) have
been written to non-volatile storage. This ensures that REDO information required by NO-FORCE
policy will be present in the log.
To highlight the performance advantages achieved with write-ahead logging, imagine that a
transaction makes an update 1000 objects in the database and that each of those objects reside on
different pages on disk. Without write ahead logging, the DBMS would need to write 1000 pages to
disk to successfully complete the transaction. With WAL and a STEAL/NO-FORCE policy, we can
update the data in-memory and write a single log record to disk that includes all information necessary
to REDO the 1000 object update. Writing the remaining 1000 pages to disk can be done
asynchronously without impacting running user-facing transactions.
30
Lecture -46 single level indexing.
Indexing is used to quickly retrieve particular data from the database. Formally we can define Indexing
as a technique that uses data structures to optimize the searching time of a database query in DBMS.
Indexing reduces the number of disks required to access a particular data by internally creating an
index table.
Index usually consists of two columns which are a key-value pair. The two columns of the index
table(i.e., the key-value pair) contain copies of selected columns of the tabular data of the database.
Here, Search Key contains the copy of the Primary Key or the Candidate Key of the database table.
Generally, we store the selected Primary or Candidate keys in a sorted manner so that we can reduce
the overall query time or search time(from linear to binary).
Data Reference contains a set of pointers that holds the address of the disk block. The pointed disk
block contains the actual data referred to by the Search Key. Data Reference is also called Block
Pointer because it uses block-based addressing.
Indexing Attributes
31
Indexing is a data structure technique to efficiently retrieve records from the database files based on
some attributes on which the indexing has been done. Indexing in database systems is similar to what
we see in books.
Indexing is defined based on its indexing attributes. Indexing can be of the following types −
Primary Index − Primary index is defined on an ordered data file. The data file is ordered on
a key field. The key field is generally the primary key of the relation.
Secondary Index − Secondary index may be generated from a field which is a candidate key
and has a unique value in every record, or a non-key with duplicate values.
Clustering Index − Clustering index is defined on an ordered data file. The data file is ordered
on a non-key field.
32
Ordered Indexing is of two types −
Dense Index
Sparse Index
Dense Index
In dense index, there is an index record for every search key value in the database. This makes
searching faster but requires more space to store index records itself. Index records contain search key
value and a pointer to the actual record on the disk.
Sparse Index
In sparse index, index records are not created for every search key. An index record here contains a
search key and an actual pointer to the data on the disk. To search a record, we first proceed by index
record and reach at the actual location of the data. If the data we are looking for is not where we
directly reach by following the index, then the system starts sequential search until the desired data is
found.
33
Lecture -47 Multilevel Indexes
Index records comprise search-key values and data pointers. Multilevel index is stored on the disk
along with the actual database files. As the size of the database grows, so does the size of the indices.
There is an immense need to keep the index records in the main memory so as to speed up the search
operations. If single-level index is used, then a large size index cannot be kept in memory which leads
to multiple disk accesses.
Multi-level Index helps in breaking down the index into several smaller indices in order to make the
outermost level so small that it can be saved in a single disk block, which can easily be accommodated
anywhere in the main memory.
B+ Tree
34
A B+ tree is a balanced binary search tree that follows a multi-level index format. The leaf nodes of a
B+ tree denote actual data pointers. B+ tree ensures that all leaf nodes remain at the same height, thus
balanced. Additionally, the leaf nodes are linked using a link list; therefore, a B+ tree can support
random access as well as sequential access.
Structure of B+ Tree
Every leaf node is at equal distance from the root node. A B+ tree is of the order n where n is fixed for
every B+ tree.
Internal nodes –
Internal (non-leaf) nodes contain at least ⌈n/2⌉ pointers, except the root node.
At most, an internal node can contain n pointers.
Leaf nodes −
Leaf nodes contain at least ⌈n/2⌉ record pointers and ⌈n/2⌉ key values.
At most, a leaf node can contain n record pointers and n key values.
Every leaf node contains one block pointer P to point to next leaf node and forms a linked list.
35
B+ Tree Insertion
B+ trees are filled from bottom and each entry is done at the leaf node.
If a leaf node overflows −
o Split node into two parts.
o Partition at i = ⌊(m+1)/2⌋.
o First i entries are stored in one node.
o Rest of the entries (i+1 onwards) are moved to a new node.
th
o i key is duplicated at the parent of the leaf.
If a non-leaf node overflows −
o Split node into two parts.
o Partition the node at i = ⌈(m+1)/2⌉.
o Entries up to i are kept in one node.
o Rest of the entries are moved to a new node.
B+ Tree Deletion
36
Lecture -48 Structured, Semi-Structured and Unstructured Data
Types of data:
i. Structured data
ii. Semi-Structured data
iii. Unstructured data
Structured data
Structured data is data whose elements are addressable for effective analysis.
It has been organized into a formatted repository that is typically a database.
It concerns all data which can be stored in database SQL in a table with rows
andcolumns.
They have relational keys and can easily be mapped into pre-designed fields.
Structured data depends on the existence of a data model – a model of how data
canbe stored, processed and accessed.
Because of a data model, each field is discrete and can be accesses separately
orjointly along with data from other fields.
This makes structured data extremely powerful (i.e. it is possible to quickly
aggregatedata from various locations in the database.)
Structured data is is considered the most ‘traditional’ form of data
storage.Example: Relational data, Excel files or SQL databases.
Semi-Structured data
Semi-structured data is information that does not reside in a relational database butthat
has some organizational properties that make it easier to analyze.
With some processes, you can store them in the relation database (it could be veryhard
for some kind of semi-structured data), but Semi-structured exist to ease space.
Contain tags or other markers to separate semantic elements and enforce hierarchies
37
ofrecords and fields within the data.
This reduces the complexity to analyse structured data, compared to unstructured data.
It is also known as self-describing
structure.Example: JSON and XML data
Unstructured data
Unstructured data is a data which is not organized in a predefined manner or does not
have a predefined data model.
Thus it is not a good fit for a mainstream relational database.
So for Unstructured data, there are alternative platforms for storing and managing, it is
increasingly prevalent in IT systems and is used by organizations in a variety of business
intelligence and analytics applications.
Unstructured information is typically text-heavy, but may contain data such as dates,
numbers, and facts as well.
This results in irregularities and ambiguities that make it difficult to understand using
traditional programs as compared to data stored in structured databases.
Example: Word, PDF, Text, Media logs ,audio, video
.Difference between structured ,semi structured , unstructured data
Properties Structured data Semi-structureddata Unstructureddata
Technology It is based on It is based on It is based on
Relational database XML/RDF(Resource character and
table Description binary data
Framework).
Transaction Matured transaction Transaction is No transaction
management and various adapted from DBMS management and
concurrency not matured no concurrency
techniques
Version management Versioning over Versioning over Versioned as a
tuples,row,tables tuples or graph is whole
possible
Flexibility It is more flexible It is more flexible
It is than structured data and there is
schema but less flexible than absence of
dependent unstructured data schema
and less
flexible
Scalability It is very difficult to It’s scaling is simpler It is more
scale DB schema than structured data scalable.
Very robust New technology, not —
Robustness very spread
Query Structured query Queries over Only textual
performance allow complex anonymous nodes are queries are
joining possible possible
Means of Structured Data is While in case of Semi On other hand in
get organized by Structured Data is case of
Data Organization
the means of partially organized by Unstructured Data
38
Relational Database. the data is
means of XML/RDF based on simple
Example: character and
binary data.
7. Keywords
Transaction
Locking
2PL,Strict 2PL,
Concurrency Control
WR,WW,RW conflicts
ARIES
WAL Protocol
Indexing
Structures
Log Record
8. Sample Questions
Remember:
1. Define Locking
2. List out the Locking protocols.
3. Define 2 PL.
4. What is Time Stamp?
5. Define RTS.
6. Define ARIES
7. List the types of Indexing
8. Define the purpose LSN in log record
9. What is Sparse index.
10. Define Dense index
11. List out the types of data
Understand:
1. Explain Lock based protocols.
2. Explain briefly Concurrency Control Techniques with example.
3. Explain types of Conflicts in Concurrency Scheduling.
4. Explain Optimistic concurrency control technique.
5. Explain ARIES algorithm in detail.
6. Explain different types Indexing with example.
7. Explain checkpoints in detail.
8. Explain the insertion and deletion operation in B tree with example.
39
9. Stimulating Question (s)
-----
At the end of this session, the facilitator (Teacher) shall randomly pick-up few students to
summarize the deliverables.
40
-------------------------
41
42