Distributed Database Systems-Chhanda Ray
Distributed Database Systems-Chhanda Ray
Copyright
Preface
1. Overview of Relational DBMS
1.1 Concepts of Relational Databases
1.2. Integrity Constraints
1.3. Normalization
1.3.1. Functional Dependencies
1.3.2. Normal Forms
1.4. Relational Algebra
1.4.1. Selection Operation
1.4.2. Projection Operation
1.4.3. Union Operation
1.4.4. Set Difference Operation
1.4.5. Cartesian Product Operation
1.4.6. Intersection Operation
1.4.7. Join Operation
1.4.8. Division Operation
1.5. Relational Database Management System
Chapter Summary
2. Review of Database Systems
2.1. Evolution of Distributed Database System
2.2. Overview of Parallel Processing System
2.2.1. Parallel Databases
2.2.2. Benefits of Parallel Databases
2.2.3. Parallel Database Architectures
Shared-memory architecture
Shared-disk architecture
Shared-nothing architecture
Hierarchical architecture
2.3. Parallel Database Design
2.3.1. Data Partitioning
Round-robin
Hash partitioning
Range partitioning
Chapter Summary
3. 3. Distributed Database Concepts
1. 3.1. Fundamentals of Distributed Databases
2. 3.2. Features of a Distributed DBMS
3. 3.3. Advantages and Disadvantages of Distributed DBMS
4. 3.4. An Example of Distributed DBMS
5. 3.5. Homogeneous and Heterogeneous Distributed DBMSs
6. 3.6. Functions of Distributed DBMS
7. 3.7. Components of a Distributed DBMS
8. 3.8. Date’s 12 Objectives for Distributed Database Systems
9. Chapter Summary
4. 4. Overview of Computer Networking
1. 4.1. Introduction to Networking
2. 4.2. Types of Computer Networks
3. 4.3. Communication Schemes
4. 4.4. Network Topologies
5. 4.5. The OSI Model
6. 4.6. Network Protocols
1. 4.6.1. TCP/IP (Transmission Control Protocol/Internet Protocol)
2. 4.6.2. SPX/IPX (Sequence Packet Exchange/Internetwork Packet Exchange)
3. 4.6.3. NetBIOS (Network Basic Input/Output System)
4. 4.6.4. APPC (Advanced Program-to-Program Communications)
5. 4.6.5. DECnet
6. 4.6.6. AppleTalk
7. 4.6.7. WAP (Wireless Application Protocol)
7. 4.7. The Internet and the World-Wide Web (WWW)
8. Chapter Summary
5. 5. Distributed Database Design
1. 5.1. Distributed Database Design Concepts
1. 5.1.1. Alternative Approaches for Distributed Database Design
2. 5.2. Objectives of Data Distribution
1. 5.2.1. Alternative Strategies for Data Allocation
3. 5.3. Data Fragmentation
1. 5.3.1. Benefits of Data Fragmentation
2. 5.3.2. Correctness Rules for Data Fragmentation
3. 5.3.3. Different Types of Fragmentation
1. Horizontal fragmentation
2. Vertical fragmentation
3. Mixed fragmentation
4. Derived fragmentation
5. No fragmentation
4. 5.4. The Allocation of Fragments
1. 5.4.1. Measure of Costs and Benefits for Fragment Allocation
1. Horizontal fragments
2. Vertical fragments
5. 5.5. Transparencies in Distributed Database Design
1. 5.5.1. Data Distribution Transparency
2. 5.5.2. Transaction Transparency
3. 5.5.3. Performance Transparency
4. 5.5.4. DBMS Transparency
6. Chapter Summary
6. 6. Distributed DBMS Architecture
1. 6.1. Introduction
2. 6.2. Client/Server System
1. 6.2.1. Advantages and Disadvantages of Client/Server System
2. 6.2.2. Architecture of Client/Server Distributed Systems
3. 6.2.3. Architectural Alternatives for Client/Server Systems
3. 6.3. Peer-to-Peer Distributed System
1. 6.3.1. Reference Architecture of Distributed DBMSs
2. 6.3.2. Component Architecture of Distributed DBMSs
3. 6.3.3. Distributed Data Independence
4. 6.4. Multi-Database System (MDBS)
1. 6.4.1. Five-Level Schema Architecture of federated MDBS
1. Reference architecture of tightly coupled federated MDBS
2. Reference architecture of loosely coupled federated MDBS
5. Chapter Summary
7. 7. Distributed Transaction Management
1. 7.1. Basic Concepts of Transaction Management
2. 7.2. ACID Properties of Transactions
3. 7.3. Objectives of Distributed Transaction Management
4. 7.4. A Model for Transaction Management in a Distributed System
5. 7.5. Classification of Transactions
6. Chapter Summary
8. 8. Distributed Concurrency Control
1. 8.1. Objectives of Distributed Concurrency Control
2. 8.2. Concurrency Control Anomalies
3. 8.3. Distributed Serializability
4. 8.4. Classification of Concurrency Control Techniques
5. 8.5. Locking-based Concurrency Control Protocols
1. 8.5.1. Centralized 2PL
2. 8.5.2. Primary Copy 2PL
3. 8.5.3. Distributed 2PL
4. 8.5.4. Majority Locking Protocol
5. 8.5.5. Biased Protocol
6. 8.5.6. Quorum Consensus Protocol
6. 8.6. Timestamp-Based Concurrency Control Protocols
1. 8.6.1. Basic Timestamp Ordering (TO) Algorithm
2. 8.6.2. Conservative TO Algorithm
3. 8.6.3. Multi-version TO Algorithm
7. 8.7. Optimistic Concurrency Control Technique
8. Chapter Summary
9. 9. Distributed Deadlock Management
1. 9.1. Introduction to Deadlock
2. 9.2. Distributed Deadlock Prevention
3. 9.3. Distributed Deadlock Avoidance
4. 9.4. Distributed Deadlock Detection and Recovery
1. 9.4.1. Centralized Deadlock Detection
2. 9.4.2. Hierarchical Deadlock Detection
3. 9.4.3. Distributed Deadlock Detection
4. 9.4.4. False Deadlocks
5. Chapter Summary
10. 10. Distributed Recovery Management
1. 10.1. Introduction to Recovery Management
2. 10.2. Failures in a Distributed Database System
3. 10.3. Steps Followed after a Failure
4. 10.4. Local Recovery Protocols
1. 10.4.1. Immediate Modification Technique
2. 10.4.2. Shadow Paging
3. 10.4.3. Checkpointing and Cold Restart
5. 10.5. Distributed Recovery Protocols
1. 10.5.1. Two-Phase Commit Protocol (2PC)
1. Termination protocols for 2PC
1. Coordinator
2. Participant
2. Recovery protocols for 2PC
1. Coordinator failure
2. Participant failure
3. Communication schemes for 2PC
2. 10.5.2. Three-Phase Commit Protocol
1. Termination protocols for 3PC
1. Coordinator
2. Participant
2. Recovery protocols for 3PC
3. Election protocol
6. 10.6. Network Partition
1. 10.6.1. Pessimistic Protocols
2. 10.6.2. Optimistic Protocols
7. Chapter Summary
11. 11. Distributed Query Processing
1. 11.1. Concepts of Query Processing
2. 11.2. Objectives of Distributed Query Processing
3. 11.3. Phases in Distributed Query Processing
1. 11.3.1. Query Decomposition
1. Normalization
2. Analysis
3. Simplification
4. Query restructuring
2. 11.3.2. Query Fragmentation
1. Reduction for horizontal fragmentation
2. Reduction for vertical fragmentation
3. Reduction for derived fragmentation
4. Reduction for mixed fragmentation
3. 11.3.3. Global Query Optimization
1. Search space
2. Optimization strategy
3. Distributed cost model
1. Cost functions
2. Database statistics
3. Cardinalities of intermediate results
4. 11.3.4. Local Query Optimization
1. INGRES algorithm
4. 11.4. Join Strategies in Fragmented Relations
1. 11.4.1. Simple Join Strategy
2. 11.4.2. Semijoin Strategy
5. 11.5. Global Query Optimization Algorithms
1. 11.5.1. Distributed INGRES Algorithm
2. 11.5.2. Distributed R* Algorithm
3. 11.5.3. SDD-1 Algorithm
6. Chapter Summary
12. 12. Distributed Database Security and Catalog Management
1. 12.1. Distributed Database Security
2. 12.2. View Management
1. 12.2.1. View Updatability
2. 12.2.2. Views in Distributed DBMS
3. 12.3. Authorization and Protection
1. 12.3.1. Centralized Authorization Control
2. 12.3.2. Distributed Authorization Control
4. 12.4. Semantic Integrity Constraints
5. 12.5. Global System Catalog
1. 12.5.1. Contents of Global System Catalog
2. 12.5.2. Catalog Management in Distributed Systems
6. Chapter Summary
13. 13. Mobile Databases and Object-Oriented DBMS
1. 13.1. Mobile Databases
1. 13.1.1. Mobile DBMS
2. 13.2. Introduction to Object-Oriented Databases
3. 13.3. Object-Oriented Database Management Systems
1. 13.3.1. Features of OODBMS
2. 13.3.2. Benefits of OODBMS
3. 13.3.3. Disadvantages of OODBMS
4. Chapter Summary
14. 14. Distributed Database Systems
1. 14.1. SDD-1 Distributed Database System
2. 14.2. General Architecture of SDD-1 Database System
1. 14.2.1. Distributed Concurrency Control in SDD-1
1. Conflict graph analysis
2. Timestamp-based protocols
2. 14.2.2. Distributed Query Processing in SDD-1
1. Access planning
2. Distributed execution
3. 14.2.3. Distributed Reliability and Transaction Commitment in SDD-1
1. Guaranteed delivery
2. Transaction control
3. The write rule
4. 14.2.4. Catalog Management in SDD-1
3. 14.3. R* Distributed Database System
4. 14.4. Architecture of R*
5. 14.5. Query Processing in R*
6. 14.6. Transaction Management in R*
1. 14.6.1. The Presumed Abort Protocol
2. 14.6.2. The Presumed Commit Protocol
7. Chapter Summary
15. 15. Data Warehousing and Data Mining
1. 15.1. Concepts of Data Warehousing
1. 15.1.1. Benefits of Data warehousing
2. 15.1.2. Problems in Data Warehousing
3. 15.1.3. Data Warehouses and OLTP Systems
2. 15.2. Data Warehousing Architecture
1. 15.2.1. Operational Data Source
2. 15.2.2. Load Manager
3. 15.2.3. Query Manager
4. 15.2.4. Warehouse Manager
5. 15.2.5. Detailed Data
6. 15.2.6. Summarized Data
7. 15.2.7. Archive/Backup Data
8. 15.2.8. Metadata
9. 15.2.9. End-User Access Tools
10. 15.2.10. Data Warehouse Background Processes
3. 15.3. Data Warehouse Schema
1. 15.3.1. Star Schema
2. 15.3.2. Snowflake Schema
3. 15.3.3. Fact Constellation Schema
4. 15.4. Data Marts
5. 15.5. Online Analytical Processing
1. 15.5.1. OLAP Tools
6. 15.6. Introduction to Data Mining
1. 15.6.1. Knowledge Discovery in Database (KDD) Vs. Data Mining
7. 15.7. Data Mining Techniques
1. 15.7.1. Predictive Modeling
2. 15.7.2. Clustering
3. 15.7.3. Link Analysis
4. 15.7.4. Deviation Detection
8. Chapter Summary
16. Bibliography
Copyright
This book is sold subject to the condition that it shall not, by way of trade or otherwise, be lent, resold, hired
out, or otherwise circulated without the publisher’s prior written consent in any form of binding or cover other
than that in which it is published and without a similar condition including this condition being imposed on
the subsequent purchaser and without limiting the rights under copyright reserved above, no part of this
publication may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or
by any means (electronic, mechanical, photocopying, recording or otherwise), without the prior written
permission of both the copyright owner and the above-mentioned publisher of this book.
ISBN 978-81-317-2718-8
First Impression
Published by Dorling Kindersley (India) Pvt. Ltd., licensees of Pearson Education in South Asia
Head Office: 7th Floor, Knowledge Boulevard, A-8 (A), Sector 62, NOIDA, 201 309, UP, India.
Registered Office: 14 Local Shopping Centre, Panchsheel Park, New Delhi 110 017, India
Typeset by Integra Software Services Pvt. Ltd., Pondicherry, India.
Printed in India at Kumar Offset Printers.
Preface
The need for managing large volumes of data has led to the evolution of database technology. In recent years,
the amount of data handled by a database management system (DBMS) has been increasing continuously, and
it is no longer unusual for a DBMS to manage data sizes of several hundred gigabytes to terabytes. This is
owing to the growing need for DBMSs to exhibit more sophisticated functionality such as the support of
object-oriented, deductive, and multimedia-based applications. Also, the evolution of the Internet and the
World Wide Web has increased the number of DBMS users tremendously. The motivation behind the
development of centralized database systems was the need to integrate the operational data of an organization,
and to provide controlled access to the data centrally. As the size of the userbase and data increases, traditional
DBMSs that run on a single powerful mainframe system face difficulty in meeting the I/O and CPU
performance requirements, as it cannot scale up well to the changes.
To achieve the performance requirements, database systems are increasingly required to make use of
parallelism, which results in one of the two DBMS architectures: parallel DBMS or distributed DBMS. A
parallel DBMS can be defined as a DBMS implemented on a tightly coupled multiprocessor system. The goals
of parallelism are two-fold: speedup and scaleup. A distributed DBMS is a software system that provides the
functionality for managing distributed databases and is implemented on a loosely coupled multiprocessor
system. A distributed database is a logically interrelated collection of shared data that is physically distributed
over a computer network. In the 1980s, a series of critical social and technological changes had taken place
that affected the design and development of database technology. During recent times, the rapid developments
in network and data communication technologies have changed the mode of working from a centralized to a
decentralized manner. Currently, distributed DBMS is gaining acceptance as a more sophisticated modern
computing environment.
This book focuses on the key concepts of the distributed database technology. In this book, all concepts related
to distributed DBMS such as distributed database fundamentals, distributed database design, distributed
DBMS architectures, distributed transaction management, distributed concurrency control, deadlock handling
in distributed system, distributed recovery management, distributed query processing, data security and
catalog management in distributed system and related concepts are briefly presented. Two popular distributed
database systems, R* and SDD-1, are discussed as case studies. Moreover, the basic concepts of mobile
databases and object-oriented DBMS are introduced. A full chapter is devoted to data warehousing, data
mining and online analytical processing (OLAP) technology. In this book, each topic is illustrated with
suitable examples. This book is intended for those who are professionally interested in distributed data
processing including students and teachers of computer science and information technology, researchers,
application developers, analysts and programmers.
Chapter 1 introduces the fundamentals of relational databases. The relational data model terminology,
relational algebra, normalization and other related concepts are discussed in this chapter. Chapter 2introduces
the evaluation of distributed database technology and some key concepts of parallel database systems. Benefits
of parallel databases and different alternative architectural model for parallel databases, namely, shared-
memory, shared-disk, shared-nothing and hierarchial are described. In this chapter, parallel database design
has also been discussed. The fundamentals of distributed database systems are presented in Chapter 3. This
chapter introduces the benefits and limitations of distributed systems over centralized systems, objectives of
a distributed system, components of a distributed system, types of distributed DBMS, and the functionality
provided by a distributed database system.
The architecture of a system reflects the structure of the underlying system. Owing to the versatility of
distributed database systems it is very difficult to describe a general architecture for distributed DBMS.
In Chapter 6, distributed data independence and the reference architectures of distributed database systems
such as client/server, peer-to-peer and multidata-base systems (MDBS) are briefly presented. This chapter
also highlights the differences between federated MDBS and non-federated MDBS and the corresponding
reference architectures. Chapter 7 focuses on distributed transaction management. In a distributed DBMS, it
is necessary to maintain ACID property of local transactions as well as global transactions to ensure data
integrity. A framework for transaction management in distributed system is illustrated in Chapter 7. Moreover,
this chapter also introduces the different categories of transactions.
In a distributed environment, concurrent accesses to data are allowed in order to increase performance, but the
database may become inconsistent owing to concurrent accesses. To ensure consistency of data, distributed
concurrency control techniques are described in Chapter 8. In this chapter, degrees of consistency and
distributed serializability are also discussed. Different locking protocols and timestamp-based protocols used
for achieving serializability are clearly explained in Chapter 8. Chapter 9 focuses on deadlock handling in a
distributed environment. In this chapter, different deadlock detection algorithms and recovery from deadlock
are presented in detail.
In a distributed environment, different kinds of failures may occur such as node failure, link failure and
network partition, in addition to the other possible failures in a centralized environment. Two popular
distributed recovery protocols, namely, two-phase commit protocol (2PC) and three-phase commit protocol
(3PC), which ensure the consistency of data in spite of failures, are introduced in Chapter 10. Check pointing
mechanism and cold restart are also described in Chapter 10. Chapter 11 presents distributed query processing
techniques. This chapter briefly introduces distributed query transformation, distributed query optimization,
and the different join strategies and distributed query optimization algorithms used for distributed query
processing.
Chapter 12 focuses on distributed security control mechanisms used for ensuring security of distributed data.
In this chapter, various security control techniques, such as authorization and protection and view management
are described in detail. In addition, semantic data control and global system catalog management are also
discussed in Chapter 12. Chapter 13 focuses on the preliminary concepts of mobile databases and object-
oriented database management systems. Chapter 14 introduces the concepts of R* distributed database system
and SDD-1 distributed database system, case studies from emerging technologies. The concluding
chapter, Chapter 15, is devoted to the fundamentals of data warehousing, data mining and OLAP technologies.
Chapter 1. Overview of Relational DBMS
This chapter presents an overview of relational databases. Normalization is a very important
concept in relational databases. In this chapter, the process of normalization is explained
with suitable examples. Integrity constraints, another important concept in relational data
model, are also briefly discussed here. Relational algebra and the overall concept of
relational database management system have also been illustrated in this chapter with
appropriate examples.
The organization of this chapter is as follows: Section 1.1 represents the basic concepts of
relational databases. In Section 1.2, integrity constraints are described with examples. The
process of normalization is illustrated with examples in Section 1.3. Section 1.4 introduces
the formal database language relational algebra, and overall concept of relational DBMS is
discussed in Section 1.5.
The relational data model was first introduced by E.F. Codd of IBM Research in 1970. The
relational model is based on the concept of mathematical relation, and it uses set theory and
first-order predicate logic as its theoretical basis. In relational data model, the database is
expressed as a collection of relations. A relation is physically represented as a table in
which each row corresponds to an individual record, and each column corresponds to an
individual attribute, where attributes can appear in any order within the table. In formal
relational data model terminology, a row is called a tuple, a column header is called an
attribute, and the table is called a relation. The degree of a relation is represented by the
number of attributes it contains, while the cardinality of a relation is represented by the
number of tuples it contains. For each attribute in the table there is a set of permitted values,
known as the domain of the attribute. Domains may be distinct for each attribute or two or
more attributes may be defined in the same domain. A named relation defined by a set of
attributes and domain name pairs is called relational schema. Formally, a relation can be
defined as follows:
A relation R defined over a set of n attributes A1, A2,..., An with domains D1, D2,..., Dn is a set
of n tuples denoted by <d1, d2,..., dn> such that d1 ε D1, d2 ε D2,..., dn ∊ Dn. Hence, the set
{A1:D1, A2:D2,..., An:Dn) represents the relational schema.
By definition, all tuples within a relation should be distinct. To maintain the uniqueness of
tuples in a relation, the concept of different relational keys is introduced into the relational
data model terminology. The key of a relation is the non-empty subset of its attributes that is
used to identify each tuple uniquely in a relation. The attributes of a relation that make up the
key are called prime attributes or key attributes. In contrast, the attributes of a relation that
do not participate in key formation are called non-prime attributes or non-key attributes.
The superset of a key is usually known as superkey. A superkey uniquely identifies each
tuple within a relation, but a superkey may contain additional attributes. A candidate key is
a superkey such that no proper subset of it is a superkey within the relation. Thus, a minimal
superkey is called candidate key. In general, a relational schema can have more than one
candidate key among which one is chosen as primary key during design time. The candidate
keys that are not selected as primary key are called alternate keys.
The information stored in a table may vary over time, thereby producing many instances of
the same relation. Thus, the term relation refers to a relation instance. The value of an
attribute in a relation may be undefined. This special value of the attribute is generally
expressed as null value. The concept of relational database is illustrated in the following
example.
Example 1.1.
Let us consider an organization that has a number of departments where each department has
a number of employees. The entities that are considered here are employee and department.
The information that is to be stored in the database for each employee are employee code
(unique for each employee), employee name, designation of the employee, salary,
department number of the employee and voter identification number (unique for each
employee). Similarly, department number (unique for each department), department name
and department location for each department are to be stored in the database. The relational
schemas for this database can be defined as follows:
In Employee relation there are six attributes, and in Department relation there are three
attributes. Thus, the degree of the relations Employee and Department are 6 and 3,
respectively. For each attribute in the above relations, there is a corresponding domain,
which need not necessarily be distinct. If D1 is the domain of the attribute emp-id in
Employee relation, then all values of emp-id in employee records must come from the
domain D1. The primary keys for the relations Employee and Department are emp-id and
deptno, respectively. In Employee relation, there are two candidate keys: emp-id and voter-
id. In this case, voter-id is called alternate key. These relations are physically represented in
tabular format in figure 1.1.
Department
40 HR Near garden
Integrity Constraints
The term data integrity, which is expressed by a set of integrity rules or constraints, refers to
the correctness of data. A constraint specifies a condition and a proposition that must be
maintained as true. For instance, a domain constraint imposes a restriction on the set of
values of an attribute, which indicates that the values must be projected only from the
associated domain. In relational data model, there are two principal integrity constraints
known respectively as entity integrity constraint and referential integrity constraint.
Entity integrity constraint specifies that each attribute of a primary key in a relation must
be not null in a relational data model. This integrity rule ensures that no subset of the primary
key is sufficient to provide uniqueness of tuples. To define referential integrity constraint it
is necessary to define the foreign key first. A foreign key is an attribute or a set of attributes
within one relation that matches the candidate key of some other relation (possibly the same
relation). In relational databases, sometimes it requires to ensure that a value that appears in
one relation for a given attribute (or a given set of attributes) also appears for a certain
attribute (or a certain set of attributes) in another relation. This is known as referential
integrity constraint, which is expressed in terms of foreign key. Referential integrity
constraint specifies that if a foreign key exists in a relation either the foreign key value must
match a candidate key value of some tuples in another relation (may be the same relation), or
the foreign key value must be null. The relation that contains the foreign key is
called referencing relation, and the relation that contains the candidate key is
called referenced relation.
Database modifications may violate the referential integrity constraint. The following
anomalies can occur in a relational database owing to referential integrity constraint:
Example 1.2.
Let us consider the same relations Employee and Department, which were described
in example 1.1. In this case, the attribute deptno, candidate key of the Department relation, is
a foreign key of the Employee relation, and it references the attribute deptno in Employee
relation. Hence, Employee relation is called referencing relation and Department relation is
called referenced relation.
To preserve referential integrity between these two relations, during insertion of new records
into the Employee relation it must be ensured that the department numbers mentioned in
these new employee records are valid and they match with the existing department numbers
of the Department relation. While deleting records from the Department relation, either all
records in the Employee relation whose department numbers match with department
numbers of the deleted records of Department relation should be deleted, or department
numbers of those records in Employee relation should be updated. While updating records in
Employee relation, it must be ensured that the department numbers specified in employee
records are valid. Similarly, no violations of referential integrity constraint must be made
while updating records in Department relation.
Normalization
The process of normalization was first developed by E.F. Codd in 1972. In relational
databases, normalization is a step-by-step process of replacing a given relation with a
collection of successive relations to achieve simpler and better data representation. The main
objective of normalization is to eliminate different anomalies that may occur owing to
referential integrity constraint [described in Section 1.2] and to identify a suitable set of
relations in database design. In normalization, a series of tests are performed to determine
whether a relation satisfies the requirements of a given normal form or not. Initially, three
normal forms were introduced, namely, 1NF (First normal form), 2NF (Second normal form)
and 3NF (Third normal form). In 1974, a stronger version of 3NF called Boyce–Codd
normal form (BCNF) was developed. All these normal forms are based on functional
dependencies among the attributes in a relation. Later, higher normal forms such as 4NF
(Fourth normal form) and 5NF (Fifth normal form) were introduced. 4NF is based on multi-
valued dependencies, while 5NF is based on join dependencies.
Functional Dependencies
Functional dependencies are used to decompose relations in the normalization process.
Formally, a functional dependency can be defined as follows:
Example 1.3.
In this example, the Employee relation, which was defined in example 1.1, is considered.
Assume that (emp-id, voter-id) is a composite key of the Employee relation. Hence, all non-
prime attributes of the Employee relation are functionally dependent on the key attribute
(emp-id, voter-id).
The above functional dependency is not a full functional dependency, because even if voter-
id or emp-id attribute is removed from the composite key attribute, the dependency persists.
In this case, the functional dependency is partial. The above functional dependency is not
trivial, because the attributes on the right-hand side of the arrow is not a subset of the
attributes on the left-hand side of the arrow. Hence, the above functional dependency is non-
trivial.
Example 1.4.
Let us consider the following relational schema, which represents the course information
offered by an institution:
In the above Course relation, course-id is the key attribute. In this relation, course fees are
dependent on course durations and course durations are dependent on course names.
Hence, the attribute fees is functionally dependent on the attribute course-name via rules of
transitivity. In this case, course-name → fees is a transitive functional dependency.
Example 1.5.
In this example, the relational schema Emp-proj is considered, which is defined as follows:
Assume that in the above Emp-proj relation, one employee can work in multiple projects,
and one employee can have more than one hobby. Hence, for each value of the attribute
emp-id, there exists a set of values for the attributes project-no and hobby in the Emp-proj
relation, but the set of values for these multi-valued attributes are not dependent on each
other. In this case, the attributes project-no and hobby are multi-valued dependent on the
attribute emp-id. These multi-valued dependencies are listed in the following:
Emp-proj
Normal Forms
In each normal form, there are some known requirements, and a hierarchical relationship
exists among the normal forms. Therefore, if a relation satisfies the requirements of 2NF,
then it also satisfies the requirements of 1NF. A relation is said to be in 1NF, if domains of
the attributes of the relation are atomic. In other words, a relation that is in 1NF should be
flat with no repeating groups. A relation in 1NF must not contain any composite attribute or
multi-valued attribute. To convert a relation to 1NF, composite attributes are to be
represented by their component attributes, and multi-valued attributes are to be represented
by using separate tables.
A relation is in 2NF, if it is in 1NF, and all non-prime attributes of the relation are fully
functionally dependent on the key attribute. To convert a relation from 1NF to 2NF all partial
functional dependencies are to be removed.
A stronger version of 3NF is BCNF. Before defining BCNF, it is necessary to introduce the
concept of determinant. Determinant refers to the attribute or group of attributes on the left-
hand side of the arrow of a functional dependency. For example, in the functional
dependency X → YZ, X is called the determinant. A relation is in BCNF, if it is in 3NF, and
every determinant is a candidate key.
Relations that satisfy the requirements of 5NF are very rare in real-life applications. So, these
are not discussed here.
Example 1.6.
Let us consider the following relational schema, which represents the student information:
In this relation, sreg-no is the key attribute and address is a composite attribute that consists
of the simple attributes street and city. Further, each course has a unique course identification
number and a number of students can enroll for each course. The set of functional
dependencies that exists among the attributes of the student relation are listed below:
The above relation does not satisfy the requirements of 1NF, because it contains a composite
attribute address. To transform this relation into 1NF, the composite attribute address is to be
represented by its components, street and city. The student table depicted in figure 1.3 is in
1NF.
Student
The above relation is also in 2NF because all non-prime attributes are fully functionally
dependent on the key attribute sreg-no. Now, the Student relation does not satisfy the
requirements of 3NF, because it contains transitive dependencies. To convert this relation
into 3NF, it is decomposed into two relations STU1 and STU2, respectively as shown
in figure 1.4
STU1
STU2
The course-id attribute of STU2 relation is a foreign key of the relation STU1. After
decomposition, the above relations are also in BCNF, because the determinants sreg-no and
course-id are candidate keys.
Example 1.7.
In this example, the Emp-proj relation of Example 1.5 is considered. This relation contains
two multivalued dependencies as follows:
By definition of 4NF, the Emp-proj relation is not in 4NF because it contains non-trivial
multi-valued dependencies. To convert this relation into 4NF, it is decomposed into two
relations PROJ1 and PROJ2 as illustrated in figure 1.5.
PROJ1
emp-id project-no
E01 P01
E02 P02
E02 P04
PROJ2
emp-id hobby
E01 Reading
E01 Singing
E02 Swimming
E02 Reading
In this case, the attribute emp-id of PROJ2 relation is a foreign key of the relation PROJ1.
Relational Algebra
Relational algebra is a high-level procedural language that is derived from mathematical set
theory. It consists of a set of operations that are used to produce a new relation from one or
more existing relations in a database. Each operation takes one or more relations as input and
produces a new relation as output, which can be used further as input to another operation.
This allows expressions to be nested in the relational algebra, which is known as closure.
In 1972, Codd originally proposed eight operations of relational algebra, but several others
have also been developed subsequently. The five fundamental operations of relational
algebra are selection, projection, Cartesian product, union and set difference. The first two
among these are unary operations and the last three are binary operations. In addition, there
are three derived operations of relational algebra, namely, join, intersection and division,
which can be represented in terms of the five fundamental operations.
Selection Operation
The selection operation operates on a single relation and selects those tuples from the
relation that satisfies a given predicate. It is denoted by the symbol σ (sigma), and the
predicate appears as a subscript of it. The predicate may be simple or complex. Complex
predicates can be represented by using the logical connectors ∧(AND), ∨(OR) and ¬ (NOT).
The comparison operators that are used in selection operation are <, >, =, ≠, ≤ and ≥.
Example 1.8.
Consider the query “Retrieve all those students whose city is Kolkata and enrolled for the
courses B.Tech or M.Tech”, which involves the Student relation of Example 1.6. Using
selection operation the above query can be represented as follows:
Student
Projection Operation
The projection operation works on a single relation and defines a new relation that contains a
vertical subset of the attributes of the input relation. This operation extracts the values of
specified attributes and eliminates duplicates. It is denoted by the symbol ∏.
Example 1.9.
By using projection operation, the query “Retrieve registration number, name and course
name for all students” can be expressed as:
Student
Union Operation
The union operation is a binary operation, and it is denoted by the symbol ∪. The union
operation of two relations R and S selects all tuples that are in R, or in S, or in both,
eliminating duplicate tuples. Hence, R and S must be union-compatible. Two relations are
said to be union-compatible, if they have the same number of attributes and each pair of
corresponding attributes have the same domain.
Intersection Operation
The intersection operation of two relations R and S, denoted by R ∩ S, defines a new relation
consisting of the set of tuples that are in both R and S. Hence, R and S must be union-
compatible. In terms of the set difference operation, the intersection operation, R ∩ S, can be
represented as R − (R − S).
Join Operation
The join operation is derived from the Cartesian product operation of relational algebra. We
can say join operation is a combination of the operations selection, projection and Cartesian
product. There are various forms of join operation such as theta join, equi join, natural join,
outer join and semijoin. The outer join operation is further classified into full outer join, left
outer join and right outer join.
1. Theta join (ϑ) –. The most common join operation is theta join. The theta join
operation of two relations R and S defines a relation that contains tuples satisfying a
join predicate from the Cartesian product of R and S. The join predicate is defined
as R.A1 ϑ S.B1, where ϑ may be one of the comparison operator <, >, =, ≠, ≤ and ≥. In
terms of selection and Cartesian product operations, the theta join operation can be
represented as follows:
1. R ⋈F S = σF (R × S).
2. Equi join –. Equi join operation is a theta join operation where the comparison
operator equality (=) is used in the join predicate.
3. Natural join –. The natural join operation is an equi join operation performed over all
the attributes in two relations that have the same name. One occurrence of each
common attribute is eliminated from the resultant relation.
4. Outer join –. Generally, a join operation selects tuples from two relations that
satisfies the join predicate. In outer join operation, those tuples that do not satisfy the
join predicate also appear in the resultant relation. The left outer join operation on
two relations R and S selects unmatched tuples from the relation R along with the
matched tuples from both the relations R and S that satisfy the join predicate. Missing
values in the relation S are set to null. The right outer join operation on two
relations R and S selects unmatched tuples from the relation S along with the matched
tuples from both the relations R and S that satisfy the join predicate. In the case of
right outer join, the missing values in the relation R are set to null. In the case of full
outer join, unmatched tuples from both the relations R and S appear in the resultant
relation along with the matched tuples.
5. Semijoin –. The semijoin operation on two relations R and S defines a new relation
that contains the tuples of R that participate in the join of R with S. Hence, only the
attributes of relation R are projected in the resultant relation. The advantage of
semijoin operation is that it decreases the number of tuples that are required to be
handled to form the join. The semijoin operation is very useful for performing join
operations in distributed databases. The semijoin operation can be rewritten as follows
using the projection and join operations of relational algebra:
0. R ⊳F S = ∏A (R ⋈F S), where A is the set of all attributes for R.
Example 1.10.
The result of natural join operation that involves the Employee and Department relations
of Example 1.1 is depicted in figure 1.8.
Example 1.11.
The result of semijoin operation that involves the Employee and Department relations is
illustrated in figure 1.9.
Employee ⋉deptnoDepartment
Division Operation
Consider two relations R and S defined over the attribute sets A and B respectively such
that B ⊆ A and C = A − B. The division operation of the two relations R and S defines a
relation over the attribute set Cthat consists of the set of tuples from R that matches the
combination of every tuple in S. In terms of fundamental operations in relational algebra, the
division operation can be represented as follows:
Example 1.12.
Let us consider the two relations Studentinfo and Courseinfo [in figure 1.10] where one
student can take admission in multiple courses and more than one student can enrol in each
course. Using division operation the query “Retrieve all those student names who have taken
admission in all courses where the course duration is greater than 2 years” will produce the
result as shown in figure 1.10.
Studentinfo
Result Courseinfo
Chapter Summary
• A database is a collection of interrelated data and a database management system
(DBMS) is a set of programs that are used to store, retrieve and manipulate data in the
database.
• A relational database is a collection of relations where each relation is represented by
a two-dimensional table. Each row of the table is known as a tuple, and each column
of the table is called an attribute.
• There are two principal integrity constraints in relational data model: entity integrity
constraint and referential integrity constraint. Entity integrity constraint requires that
the primary key of a relation must be not null. Referential integrity constraint is
expressed in terms of foreign key.
• Normalization is a step-by-step process to achieve better data representation in
relational database design. There are different normal forms such as 1NF, 2NF, 3NF,
BCNF, 4NF and 5NF, and each normal form has some requirements. Functional
dependency is a very important concept in normalization.
• The relational algebra is a high-level procedural language that is based on
mathematical set theory. The five fundamental operations of relational algebra are
selection, projection, Cartesian product, union and set difference. In addition, there are
three derived operations, namely, join, intersection and division, which can be
represented in terms of the five fundamental operations.
• An RDBMS is a software package that supports relational data model and relational
languages.
Chapter 2. Review of Database Systems
This chapter introduces the evolution of distributed database technology and some key
concepts of parallel database systems. This chapter mainly focuses on the benefits of parallel
databases and different types of architecture for parallel database systems. Data partitioning
techniques for parallel database design are also discussed with examples.
The outline of this chapter is as follows. The evolution of distributed database system is
represented in Section 2.1. In Section 2.2, an overview of parallel processing system is
presented, and different architectures for parallel database systems are described in Section
2.2.3. The data partitioning techniques for parallel database design are discussed in Section
2.3.
A major limitation of centralized database system is that all information must be stored in a
single central location, usually in a mainframe computer or a minicomputer. The
performance may degrade if the central site becomes a bottleneck. Moreover, the centralized
approach fails to satisfy the faster response time and quick access to information. In the
1980s, a series of critical social and technological changes had occurred that affected the
design and development of database technology. Business environment became
decentralized geographically and required a dynamic environment in which organizations
had to respond quickly under competitive and technological pressures. Over the past two
decades, advancements in microelectronic technology have resulted in the availability of
fast, inexpensive processors. During recent times, rapid development in network and data
communication technology, characterized by the internet, mobile and wireless computing,
has resulted in the availability of cost-effective and highly efficient computer networks. The
merging of computer and network technologies changed the mode of working from a
centralized to a decentralized manner. This decentralized approach simulates the
organizational structure of many companies that are logically distributed into divisions,
departments, projects and so on and physically distributed into offices, plants, depots and
factories, where each unit maintains its own operational data [Date, 2000]. Database systems
had addressed the challenges of distributed computing from the very early days. In this
context, distributed DBMS has been gaining acceptance as a more sophisticated modern
computing environment. Starting from the late 1970s, a significant amount of research work
has been carried out both in universities and in industries in the area of distributed systems.
These research activities provide us with the basic concepts for designing distributed
systems.
Computer system architectures consisting of interconnected multiple processors are basically
classified into two different categories: loosely coupled systems and tightly coupled systems.
These are described in the following:
2. Loosely coupled systems –. In these systems, processors do not share memory (clocks
and system buses also), and each processor has its own local memory. In this case, if a
processor writes the value 200 to a memory location y, this write operation only
changes the content of its own local memory and does not affect the content of the
memory of any other processor. In such systems, all physical communications
between the processors are established by passing messages across the network that
interconnects the processors of the system. The loosely coupled system is depicted
in figure 2.2.
Usually, tightly coupled systems are referred to as parallel processing systems, and loosely
coupled systems are referred to as distributed computing systems, or simply distributed
systems.
• Structuring of tasks so that several tasks can be executed at the same time “in
parallel”.
• Preserving task sequencing so that tasks can be executed serially.
The parallel processing technique increases the system performance in terms of two
important properties. These are speedup and scaleup as described below.
1. Speedup –. Speedup is the extent to which more hardware can perform the same task
in less time than the original system. With added hardware, speedup holds the task
constant and measures time saving. With good speedup, additional processors reduce
system response time. For example, if one processor requires n units of time to
complete a given task, n processors require 1 unit of time to complete the same task if
parallel execution is possible for the given task. Speedup can be calculated using the
following formula:
1. speedup = time_original / time_parallel,
where time_parallel is the elapsed time spent by a larger parallel system on the same
task.
2. Scaleup –. Scaleup is the factor that represents how much more work can be done in
the same time period by a larger system. With added hardware, scaleup holds time as a
constant and measures the increased size of the job that can be done within that
constant period of time. If transaction volumes grow and the system has good scaleup,
the response time can be kept constant by adding more hardware resources (such as
CPUs). Scaleup can be computed using the following formula:
0. scaleup = volume_parallel / volume_original,
Parallel Databases
In the case of database applications, some tasks can be divided into several subtasks
effectively, and the parallel execution of these subtasks reduces the total processing time by
surprisingly large amounts, thereby improving system performance. Features such as online
backup, data replication, portability, interoperability and support for a wide variety of client
tools can enable a parallel server to support application integration, distributed operations
and mixed-application workloads. A variety of computer system architectures allow sharing
of resources such as data, software and peripheral devices among multiple processors.
Parallel databases are designed to take advantage of such architectures.
SHARED-MEMORY ARCHITECTURE
Shared memory architecture is a tightly coupled architecture in which multiple processors
within a single system share a common memory, typically via a bus or through an
interconnection network. This approach is known as symmetric multiprocessing (SMP).
SMP has become popular on platforms ranging from personal workstations that support a
few microprocessors in parallel to large RISC-(reduced instruction set computer) based
machines, all the way up to the largest mainframes. In shared-memory architecture, a
processor can send messages to other processors using memory writes (which usually take
less than a microsecond), which is much faster than sending a message through a
communication mechanism. This architecture provides high-speed data access for a limited
number of processors, but it is not scalable beyond about 64 processors, when the
interconnection network becomes a bottleneck. Shared-memory architecture is illustrated
in figure 2.3.
Shared-memory architecture has a number of disadvantages also. These are cost, limited
extensibility and low availability. High cost is incurred owing to the interconnection that
links each processor to each memory module or disk. Usually, Shared-memory architectures
have large memory caches at each processor, so that referencing of the shared memory is
avoided whenever possible. Moreover, the caches need to be kept coherent; that is, if a
processor performs a write operation to a memory location, the data in that memory location
should be either updated or removed from any processor where the data is cached.
Maintaining cache-coherency becomes an increasing overhead with increasing number of
processors. Consequently, shared-memory architecture is not scalable beyond about 64
processors when the interconnection network becomes a bottleneck; hence, extensibility is
limited. Finally, in shared-memory architecture memory space is shared by all processors in
the system; therefore, a memory fault may affect most of the processors, thereby limiting
data availability.
Examples of shared-memory parallel database system include XPRS [Hong, 1992], DBS3
[Bergsten et al., 1991], and Volcano [Graefe, 1990], as well as portings of major commercial
DBMSs on shared-memory multiprocessors.
SHARED-DISK ARCHITECTURE
Shared-disk is a loosely coupled architecture in which all processors share a common set of
disks. Shared-disk systems are sometimes called clusters. In this architecture, each processor
can access all disks directly via an interconnection network, but each processor has its own
private memory. Shared-disk architecture can be optimized for applications that are
inherently centralized and require high availability and performance.
DEC (Digital Equipment Corporation) cluster running Rdb was one of the early commercial
users of the shared-disk database architecture. Rdb is now owned by Oracle and is called
Oracle Rdb.
SHARED-NOTHING ARCHITECTURE
Shared-nothing architecture, often known as MPP, is a multiprocessor architecture in which
each processor is part of a complete system with its own memory and disk storage. The
communication between the processors is incorporated via high-speed interconnection
network. The database is partitioned among all disks associated with the system, and data is
transparently available to all users in the system. Each processor functions as a server for the
data that is stored on its own disk (or disks). The shared-nothing architecture is depicted
in figure 2.5.
Shared-nothing architecture has several disadvantages also. The main drawback of shared-
nothing architecture is the cost of communication for accessing non-local disk, which is
much higher than that of other parallel system architectures as sending data involves
software interaction at both ends. Shared-nothing architecture entails more complexity than
shared-memory architecture, as necessary implementation of distributed database functions –
assuming large numbers of nodes – is very complicated. Although shared-nothing systems
are more scalable, the performance is optimal only when requested data is available in local
storage. Furthermore, the addition of new nodes in the system presumably requires
reorganization of the database to deal with the load-sharing issues.
The Teradata database machine was among the earliest commercial systems using shared-
nothing database architecture. The Grace and the Gamma research prototypes also made use
of shared-nothing architecture.
HIERARCHICAL ARCHITECTURE
The hierarchical architecture is a combination of shared-memory, shared-disk and shared-
nothing architectures. At the top level, the system consists of a number of nodes connected
by an interconnection network, and the nodes do not share disks or memory with one
another. Thus, top level takes the form of shared-nothing architecture. Each node of the
system is a shared-memory system with a few processors. Alternatively, each node can be a
shared-disk system, and each of these subsystems sharing a set of disks can be a shared-
memory system. Thus, hierarchical system is built as a hierarchy, with shared-memory
architecture with a few processors at the base, and shared-nothing architecture at the top,
with possibly shared-disk architecture in the middle. The hierarchical architecture is
illustrated in figure 2.6.
Today, several commercial parallel database systems run on these hierarchical architectures.
Data Partitioning
Data partitioning has its origins in centralized system where it is used to partition files, either
because the file is too big for one disk or because the file access rate cannot be supported by
a single disk. Data partitioning allows parallel database systems to exploit the I/O
parallelism of multiple disks by reading and writing on them in parallel. I/O
parallelism refers to reducing the time required to retrieve relations from the disk by
partitioning the relations on multiple disks. Partitioning of a relation involves distributing its
tuples over several disks, so that each tuple resides on one disk. This most common form of
data partitioning in a parallel database environment is known as horizontal partitioning. A
number of data partitioning techniques for parallel database system have been projected,
which are listed in the following.
ROUND-ROBIN
The simplest partitioning strategy distributes tuples among the fragments in a round-
robin fashion. This is the partitioned version of the classic entry-sequence file. Assume that
there are n disks, D1, D2, ..., Dn, among which the data are to be partitioned. This strategy
scans the relation in some order and sends the ith tuple to disk number Di mod n. The round-
robin scheme ensures an even distribution of tuples across disks.
Round-robin partitioning is excellent for applications that wish to read the entire relation
sequentially for each query. The problem with round-robin partitioning is that it is very
difficult to process point queries and range queries. A point query involves the retrieval of
tuples from a relation that satisfies a particular attribute value whereas a range
query involves the retrieval of tuples from a relation that satisfies the attribute value within a
given range. For example, the retrieval of all tuples from the Student relation [Chapter
1, Example 1.6] with the value “Kolkata” for the attribute “city” is a point query. An
example of a range query is retrieving all tuples from the Employee relation [Chapter
1, Example 1.1] that satisfy the criterion “the value of the salary attribute is within the range
50,000 to 80,000”. With round-robin partitioning scheme, processing of range queries as
well as point queries is complicated, as it requires the searching of all disks.
Example 2.1.
Let us consider the Employee relation [Chapter 1, Example 1.1] with 1,000 tuples that is to
be distributed among 10 disks. Using round-robin data partitioning technique, tuple number
1 will be stored on disk number 1 and tuple number 2 will be stored on disk number 2. All
those tuples that have the same digit in the right most position of the tuple number will be
stored on the same disk. In this case, tuple numbers 1, 11, 21, 31, 41, ..., 991 of the
Employee relation will be stored on disk number 1, because “tuple number mod 10”
produces the result 1 for all these tuples. The data distribution on disks is uniform here, and
each disk will store 100 tuples. Hence, processing of the point query “designation =
‘Manager’” for the relation is very difficult. Similarly, the range query “salary between
50,000 and 80,000” is also very difficult to process when using round-robin partitioning
technique.
HASH PARTITIONING
This strategy chooses one or more attributes from the given relational schema as the
partitioning attributes. A hash function is chosen whose range is within 0 to n−1, if there
are n disks. Each tuple of the original relation is hashed based on the partitioning attributes.
If the hash function returns i then the tuple is placed on disk Di.
Hash Partitioning is ideally suited for applications that want only sequential and associative
accesses to the data. Tuples are stored by applying a hashing function to an attribute. The
hash function specifies the placement of the tuple on a particular disk. Associative access
(point query) to the tuples with a specific attribute value can be directed to a single disk,
avoiding the overhead of starting queries on multiple disks. If the hash function is a good
randomizing function and the partitioning attributes form a key of the relation, then the
number of tuples in each disk is almost the same without much variance. In this case, the
time required to scan the entire relation is 1/n of the time required to scan the same relation
in a single-disk system, because it is possible to scan multiple disks in parallel. However,
hash partitioning is not well-suited for point queries on non-partitioning attributes. This
partitioning technique is not suitable for range queries also.
Database systems pay considerable attention to cluster related data together in physical
storage. If a set of tuples is routinely accessed together, the database system attempts to store
them on the same physical page. Hashing tends to randomize data rather than cluster it. Hash
partitioning mechanisms are provided by Arbre, Bubba [Copeland, Alexander, Bougherty
and Keller, 1988], Gamma, and Teradata.
Example 2.2.
Assume that the relational schema Student (rollno, name, course-name, year-of-
admission)has 1,000 tuples and is to be partitioned among 10 disks. Further assume that, the
attribute rollno is the hash partitioning attribute and the hash function is R mod n,
where R represents the values of the attribute rollno in student tuples and n represents the
total number of disks. Using hash partitioning technique, all those tuples of the Student
relation that have the same digit in the right most position of the roll number will be stored
into the same disk. Hence, all tuples whose roll numbers belong to the set A = {1, 11, 21, 31,
41, ..., 991} will be stored on disk number 1. Similarly, all those student records whose roll
numbers belong to the set B = {2, 12, 22, 32, ..., 992} will be stored on disk number 2. In this
case, it is easier to process point queries and range queries that involve the attribute rollno. If
rollno is the key of the relation Student, then tuples distribution among disks is also uniform.
However, processing of point queries and range queries that involve non-partitioning
attributes such as name, course-name and year-of-admission is difficult.
RANGE PARTITIONING
Range partitioning clusters together tuples with similar attribute values in the same partition.
This strategy distributes contiguous attribute-value ranges to each disk. It chooses a
partitioning attribute as a partitioning vector. Using range-partitioning technique, a relation is
partitioned in the following way. Let {v0, v1, ..., vn−2} denote the partition vector of the
partitioning attribute such that if i < j, then vi < vj. Consider a tuple t such that t[A] = x.
If x < v0, then t goes on disk D0 and if x > = vn−2, then t goes on disk Dn−1. If vi ≤ x < vi+1,
then t goes on disk Di+1.
Range partitioning is good for sequential and associative access, and is also good for
clustering data. It derives its name from the typical SQL range queries such as “salary
between 50,000 and 80,000”. The problem with range partitioning is that it risks data skew,
where all the data is placed in one partition, and execution skew, in which all the execution
occurs in one partition.
Hashing and round-robin are less susceptible to these skew problems. Range partitioning can
minimize skew by picking non-uniformly distributed partitioning criteria. Bubba uses this
concept by considering the access frequency (heat) of each tuple when creating partitions of
a relation.
Example 2.3.
Let us assume that the Department relation [Chapter 1, Example 1.1] has 50 tuples and is to
be distributed among 3 disks. Further assume that deptno is the partitioning attribute and
{<20, ≥20 and ≤40, >40} is the partition vector. Using range-partitioning technique, all
tuples that has deptno values less than 20 will be stored on disk number 1, all tuples that has
deptno values between 20 and 40 will be stored on disk number 2 and all tuples that has
deptno values greater than 40 will be stored on disk number 3. In this case, processing of
point queries and range queries that involve the deptno attribute is easier. However,
processing of point queries and range queries that involve non-partitioning attributes is
difficult.
Chapter Summary
• Computer architectures comprising of interconnected multiple processors are basically
classified into two categories: loosely coupled systems and tightly coupled systems.
• In loosely coupled systems, processors do not share memory, but in tightly coupled
systems processors share memory. Loosely coupled systems are referred to as
distributed systems, whereas tightly coupled systems are referred to as parallel
processing systems.
• Parallel processing systems divide a large task into many smaller tasks, and execute
these smaller tasks concurrently on several processors.
• Parallel processing increases the system performance in terms of two important
properties, namely, scaleup and speedup.
• Alternative architectural models for parallel databases are shared-memory, shared-
disk, shared-nothing and hierarchical.
• Shared-memory – Shared-memory is a tightly coupled architecture in which
multiple processors within a single system share system memory.
• Shared-disk – Shared-disk is a loosely coupled architecture in which multiple
processors share a common set of disks. Shared-disk systems are sometimes
called clusters.
• Shared-nothing – Shared-nothing, often known as MPP, is a multiple processor
architecture in which processors do not share common memory or common set
of disks.
• Hierarchical – This model is a hybrid of shared memory, shared-disk and
shared-nothing architectures.
• Data partitioning allows parallel database systems to exploit the I/O parallelism of
multiple disks by reading and writing on them in parallel. Different data-partitioning
techniques are round-robin, hash partition and range partition.
Chapter 3. Distributed Database Concepts
In the 1980s, distributed database systems had evolved to overcome the limitations of
centralized database management systems and to cope up with the rapid changes in
communication and database technologies. This chapter introduces the fundamentals of
distributed database systems. Benefits and limitations of distributed DBMS over centralized
DBMS are briefly discussed. The objectives of a distributed system, the components of a
distributed system, and the functionality provided by a distributed system are also described
in this chapter.
This chapter is organized as follows. Section 3.1 represents the fundamentals of distributed
databases, and the features of distributed DBMSs are described in Section 3.2. In Section
3.3, pros and cons of distributed DBMSs are discussed, and an example of a distributed
database system is represented in Section 3.4. The classification of distributed DBMSs is
explained in Section 3.5, and the functions of distributed DBMSs are introduced in Section
3.6. Section 3.7 illustrates the components of a distributed database system, and Date’s 12
objectives for distributed database system are discussed in Section 3.8.
• The distribution is transparent –. users must be able to interact with the system as if
it is a single logical system. This applies to the system performance and method of
accessing amongst other things.
• The transactions are transparent –. each transaction must maintain database
integrity across multiple databases. Transactions may also be divided into
subtransactions; each subtransaction affects one database system.
• Local applications –. These applications require access to local data only and do not
require data from more than one site.
• Global applications –. These applications require access to data from other remote
sites in the distributed system.
Every site in a distributed DBMS may have its own local database depending on the
topology of the Distributed DBMS.
Faster data access. End-users in a distributed system often work with only a subset of the
entire data. If such data are locally stored and accessed, data accessing in distributed
database system will be much faster than it is in a remotely located centralized system.
Speeding up of query processing. A distributed database system makes it possible to
process data at several sites simultaneously. If a query involves data stored at several sites, it
may be possible to split the query into a number of subqueries that can be executed in
parallel. Thus, query processing becomes faster in distributed systems.
Increased local autonomy. In a distributed system, the primary advantage of sharing data
by means of data distribution is that each site is able to retain a degree of control over data
that are stored locally. In a centralized system, the database administrator is responsible for
controlling the entire data in the database system. In a distributed system, there is a global
database administrator responsible for the entire system. A part of these responsibilities is
delegated to local database administrators residing at each site. Each local administrator may
have a different degree of local autonomy or independence depending on the design of the
distributed database system. The possibility of local autonomy is often a major advantage of
distributed databases.
Increased availability. If one site fails in a distributed system, the remaining sites may be
able to continue the transactions of the failed site. In a distributed system, data may be
replicated at several sites; therefore, a transaction requiring a particular data item from a
failed site may find it at other sites. Thus, the failure of one site does not necessarily imply
the shutdown of the system. In distributed systems, some mechanism is required to detect
failures and to recover from failures. The system must no longer use the services of the failed
site. Finally, when the failed site recovers, it must get integrated and smoothly come back
into the system using appropriate mechanisms. Although recoveries in distributed systems
are more complex than in centralized systems, the ability to continue normal execution in
spite of the failure of one site increases availability. Availability is crucial for database
systems used in real-time applications.
Increased reliability. As data may be replicated in distributed systems, a single data item
may exist at several sites. If one site fails, a transaction requiring a particular data item from
that site may access it from other sites. Therefore, the failure of a node or a communication
link does not necessarily make the data inaccessible; thus, reliability increases.
Better performance. The data in a distributed system are dispersed to match business
requirements; therefore, data are stored near the site where the demand for them is the
greatest. As the data are located near the greatest-demand site and given the inherent
parallelism of distributed DBMSs, speed of database access may be better than in a remote
centralized database. Moreover, as each site handles only a part of the entire database, there
may not be the same level of contention for CPU and I/O services that characterizes a
centralized DBMS.
Reduced operating costs. In the 1960s, computing power used to be estimated as being
proportional to the square of the cost of equipments; hence, three times the cost would
provide nine times the power. This was known as Grosch’s Law. It is now generally
accepted that the cost will be much lesser to create a system of smaller computers with
computing power equivalent to that of a single large computer. Thus, it is more cost-effective
for corporate divisions and departments to obtain separate computers. It is also much more
cost-effective to add workstations to a network than to update a mainframe system. The
second potential cost savings occurs where databases are geographically remote and the
applications require access to distributed data. In such cases, owing to the relative expense of
transmitting data across the network as opposed to the cost of local access, it may be much
more economic to partition the application and perform the processing locally at each site.
Modular extensibility. Modular extension in a distributed system is much easier. New sites
can be added to the network without affecting the operations of other sites. Such flexibility
allows organizations to extend their system in a relatively rapid and easier way. Increasing
data size can usually be handled by adding processors and storage power to the network.
A distributed DBMS has a number of disadvantages also. These are listed in the following.
Increased complexity. Management of distributed data is a very complicated task than the
management of centralized data. A distributed DBMS that hides the distributed nature from
the users and provides an acceptable level of performance, reliability and availability is
inherently more complex. In a distributed system, data may be replicated, which adds
additional complexity to the system. All database related tasks such as transaction
management, concurrency control, query optimization and recovery management are more
complicated than in centralized systems.
Increased maintenance and communication cost. The procurement and maintenance costs
of a distributed DBMS are much higher than those of a centralized system, as complexity
increases. Moreover, a distributed DBMS requires a network to afford communication
among different sites. An additional ongoing communication cost is incurred owing to the
use of this network. There are also additional labour costs to manage and maintain the local
DBMSs and the underlying network.
Security. As data in a distributed DBMS are located at multiple sites, the probability of
security lapses increases. Further, all communications between different sites in a distributed
DBMS are conveyed through the network, so the underlying network has to be made secure
to maintain system security.
Lack of standards. A distributed system can consist of a number of sites that are
heterogeneous in terms of processors, communication networks (communication mediums
and communication protocols) and DBMSs. This lack of standards significantly limits the
potential of distributed DBMSs. There are also no tools or methodologies to help users to
convert a centralized DBMS into a distributed DBMS.
Maintenance of integrity is very difficult. Database integrity refers to the validity and
consistency of stored data. Database integrity is usually expressed in terms of constraints. A
constraint specifies a condition and a proposition that the database is not permitted to violate.
Enforcing integrity constraints generally requires access to a large amount of data that define
the constraint but are not involved in the actual update operation itself. The communication
and processing costs that are required to enforce integrity constraints in a distributed DBMS
may be prohibitive.
Lack of experience. Distributed DBMSs have not been widely accepted, although many
protocols and problems are well-understood. Consequently, we do not yet have the same
level of experience in industry as we have with centralized DBMSs. For a prospective
adopter of this technology, this may be a significant restriction.
Database design more complex. The design of a distributed DBMS involves fragmentation
of data, allocation of fragments to specific sites and data replication, besides the normal
difficulties of designing a centralized DBMS. Therefore, the database design for a distributed
system is much more complex than for a centralized system.
Each local branch can access its local data via local applications as well as it can access data
from other branches via global applications. During normal operations that are requested
from a branch, local applications need only to access the database of that particular branch.
These applications are completely executed by the processor of the local branch where they
are initiated, and are therefore called local applications. Similarly, there exist many
applications that require projects, clients and employees information from other branches
also. These applications are called global applications or distributed applications.
Heterogeneous systems are usually constructed over a number of existing individual sites
where each site has its own local databases and local DBMS software, and integration is
considered at a later stage. To allow the necessary communications among the different sites
in a heterogeneous distributed system, interoperability between different DBMS products is
required. If a distributed system provides DBMS transparency, users are unaware of the
different DBMS products in the system, and therefore it allows users to submit their queries
in the language of the DBMS at their local sites. Achieving DBMS transparency increases
system complexity. Heterogeneity may occur at different levels in a distributed database
system as listed in the following.
• different hardware
• different DBMS software
• different hardware and different DBMS software
If the hardware is different but the DBMS software is the same in a distributed system, the
translation is straightforward between sites involving the change of codes and word lengths.
Moreover, the differences among the sites at lower levels in a distributed system are usually
managed by the communication software. Therefore, homogeneous distributed DBMS refers
to a DDBMS with the same DBMS at each site, even if the processors and/or the operating
systems are not the same.
If the DBMS software is different at different sites, the execution of global transactions
becomes very complicated, as it involves the mapping of data structures in one data model to
the equivalent data structures in another data model. For example, relations in the relational
data model may be mapped into records and sets in the network data model. The translation
of query languages is also necessary. If both the hardware and DBMS software are different
at different sites in a distributed system, the processing of global transactions becomes
extremely complex, because translations of both hardware and DBMS software are required.
The provision of a common conceptual schema, which is formed by integrating individual
local conceptual schemas, adds extra complexity to the distributed processing. The
integration of data models in different sites can be very difficult owing to the semantic
heterogeneity.
Some relational systems that are part of a heterogeneous distributed DBMS use gateways,
which convert the language and data model of each different DBMS into the language and
data model of the relational system. Gateways are mechanisms that provide access to other
systems. In a gateway, one vendor (e.g., Oracle) provides single direction access through its
DBMS to another database managed by a different vendor’s DBMS (e.g., IBM DB2). The
two DBMSs need not share the same data model. For example, many RDBMS vendors
provide gateways to hierarchical and network DBMSs. However, the gateway approach has
some serious limitations as listed below.
• The gateway between two systems may be only a query translator, that is, it may not
support transaction management even for a pair of systems.
• The gateway approach generally does not address the issue of homogenizing the
structural and representational differences between different database schemas; that is,
it is only concerned with the problem of translating a query expressed in one language
into an equivalent expression in another language.
In this context, one solution is a multi-database system (MDBS) that resides on top of
existing databases and file systems and provides a single logical database to its users. An
MDBS maintains only the global schema against which users issue queries and updates.
[MDBSs are discussed in detail in Chapter 6.]
Functions of Distributed DBMS
A distributed DBMS manages the storage and processing of logically related data on
interconnected computers wherein both data and processing functions are distributed among
several sites. Thus, a distributed DBMS has at least all the functionality of a centralized
DBMS. In addition, it must provide the following functions to be classified as distributed.
Mapping and I/O interfaces. A distributed DBMS must provide mapping techniques to
determine the data location of local and remote fragments. It must also provide I/O interfaces
to read or write data from or to permanent local storage.
Extended query processing and optimization. The distributed DBMS supports query
processing techniques to retrieve answers of local queries as well as global queries. It also
provides query optimization both for local and global queries to find the best access strategy.
In the case of global query optimization, the global query optimizer will determine which
database fragments are to be accessed in order to execute global queries.
Distributed backup and recovery services. Backup services in distributed DBMSs ensure
the availability and reliability of a database in case of failures. Recovery services in
distributed DBMSs take account of failures of individual sites and of communication links
and preserve the database in the consistent state that existed prior to the failure.
Support for global system catalog. A distributed DBMS must contain a global system
catalog to store data distribution details for the system. This feature includes tools for
monitoring the database, gathering information about database utilization and providing a
global view of data files existing at the various sites.
Distributed security control. A distributed DBMS offers security services to provide data
privacy at both local and remote databases. This feature is used to maintain appropriate
authorization/access privileges to the distributed data.
Chapter Summary
• A DDBMS consists of a single logical database that is split into a number
of partitions or fragments. Each partition is stored on one or more computers under
the control of a separate DBMS, with the computers connected by a communication
network.
• A distributed DBMS provides a number of advantages over centralized DBMS, but it
has several disadvantages also.
• A distributed system can be classified as homogeneous distributed DBMS or
heterogeneous distributed DBMS.
• In a distributed system, if all sites use the same DBMS product, it is called
a homogenous distributed database system.
• In a heterogeneous distributed system, different sites may run different DBMS
products, which need not be based on the same underlying data model.
• A distributed DBMS consists of the components computer workstations, computer
network, communication media, transaction processor and data processor.
Chapter 4. Overview of Computer
Networking
This chapter presents the fundamentals of computer networking. Different types of
communication networks and different network topologies are briefly described. Network
protocol is a set of rules that govern the data transmission between the nodes within a
network. A wide range of network protocols that are used to transmit data within connection-
oriented as well as within connectionless computer networks are represented. The concept of
the internet and the world-wide web has also been introduced in this chapter.
The outline of this chapter is as follows. Section 4.1 introduces the fundamentals of
computer networks. The types of computer networks are described in Section 4.2,
and Section 4.3 introduces different communication schemes. In Section 4.4, network
topologies are discussed, and the OSI model is represented in Section 4.5. The network
protocols are briefly described in Section 4.6, and Section 4.7represents the concept of the
internet and the world-wide web.
Introduction to Networking
Data communications and networking is the fastest growing technology today. The rapid
development in this technology changes the business scenario and the modern computing
environment. Technological advancement in the network field merges data processing
technology with data communication technology. The development of the internet and the
world-wide web has made it possible to share information among millions of users
throughout the world with thousands of machines connected to the network.
Data communication is the exchange of data between two devices via some form of
transmission medium. To exchange data, the communicating devices must be part of some
communication system that is made up of a combination of hardware and software. A data
communication system is made up of five components. These are message, sender, receiver,
medium and protocol. Message is the information to be communicated. Sender is the device
that sends the message and receiver is the device that receives the message. Transmission
medium is the physical path through which a message travels from the sender to the
receiver. The transmission medium may be cables (twisted-pair wire, coaxial cable, fibre-
optic cable), laser or radio waves. Protocol is a set of rules that determine how information
is to be sent, interpreted and processed between the nodes in a network.
A computer network is the collection of devices referred to as nodes (or sites or hosts)
connected by transmission media. A node can be a computer, a printer or any other device
that is capable of receiving and/or sending information that is generated by other nodes in the
network and/or by itself. The transmission media is often called communication channel or
communication link. Computer networks use distributed processing in which a task is
divided into several subtasks that are executed in parallel on multiple computers connected
to the network. Therefore, in the context of distributed database systems, it is necessary to
introduce the fundamentals of computer networks.
Types of Computer Networks
Computer networks are broadly classified into three different categories. These are local area
networks (LANs), metropolitan area networks (MANs) and wide area networks (WANs).
This classification is done based on the geographical area (i.e., distance) covered by a
network and the physical structure of the network.
Local Area Network (LAN) –. A LAN usually connects the computers within a single
office, building or campus. A LAN covers a small geographical area of a few kilometers.
LANs are designed to share resources between personal computers or workstations. The
resources that are shared within a LAN may be hardware, software or data. In addition to
geographical area, LANs are distinguishable from other types of networks by their
communication media and topology. The data transfer rates for LANs are 10 to 2,500 Mbps,
and LANs are highly reliable.
Wide Area Network (WAN) –. A WAN allows sharing of information over a large
geographical area that may comprise a country, a continent or even the whole world. Since
WANs cover a large geographical area, the communication channels in a WAN are relatively
slow and less reliable than those in LANs. The data transfer rate for a WAN generally ranges
from 33.6 kbps to 45 Mbps. WANs may utilize public, leased or private communication
devices for sharing information, thereby covering a large geographical area. A WAN that is
totally owned and used by a single company is called an enterprise network.
Metropolitan Area Network (MAN) –. A MAN is a special case of the WAN that generally
covers the geographical area within a city. A MAN may be a single network (e.g., a cable
television network) or it may consist of several LANs so that information can be shared
LAN-to-LAN as well as device-to-device. A MAN may be totally owned by a private
company or it may be a service provided by a public company (e.g., local telephone
company).
Communication Schemes
In terms of physical communication pathway, networks can be classified into two different
categories: point-to-point (also called unicast) networks and broadcast (also called multi-
point) networks.
A point-to-point network provides a dedicated link between each pair of nodes in the
network. The entire capacity of the channel is reserved for the transmission between the two
nodes. Hence, the receiver and the sender are identified by their addresses that are included
in the frame header. Data transfer between two nodes follows one of the many possible links,
some of which may involve visiting other intermediate nodes. The intermediate node checks
the destination address in the frame header and passes it to the next node. This is known
as switching. The most common communication media for point-to-point networks are
coaxial cables or fiber-optic cables.
In broadcast networks a common communication channel is shared by all the nodes in the
network. In such networks, the capacity of the channel is shared either spatially or
temporarily among the nodes. If several nodes use the communication channel
simultaneously, then it is called a spatially shared broadcast network. A special case of
broadcasting is multicasting where the message is sent to a subset of all the nodes in the
network.
Network Topologies
The term network topology refers to the physical layout of a computer network. The
topology of a network is the geometric representation of the relationships between all
communication links and nodes. Five basic network topologies are star, ring, bus, tree and
mesh.
Star Topology –. In star topology, each node has a dedicated point-to-point connection only
to a central controller node, known as the hub; but the nodes are not directly connected to
each other. If one node wants to send some information to another node in the network, it
sends the information to the central controller node, and the central controller node passes it
to the destination node. The major advantage of the star topology is that it is less expensive
and easier to install and reconfigure. Another benefit is that even if one node fails in star
topology, others remain active; thus, it is easier to identify the fault and isolate the faulty
node. However, possible failure of the central controller node is a major disadvantage. The
star topology is illustrated in figure 4.1.
Ring Topology –. In ring topology, each node has a dedicated connection with two
neighboring nodes on either side of it. The data transfer within a ring topology is
unidirectional. Each node in the ring incorporates a repeater, and when a signal passes
through the network, it receives the signal, checks the address and copies the message if it is
the destination node or retransmits it. In ring topology, the data transmission is controlled by
a token. The token circulates in the ring all the time, and the bit pattern within the token
represents whether the network is free or in use. If a node wants to transmit data it receives
the token and checks the bit pattern of it. If it indicates that the network is free, the node
changes the bit pattern of the token and then puts the data on the ring. After transmitting the
data to the receiver, the token again comes back to the sender, and the sender changes the bit
pattern of the token to indicate that the network is free for further transmission. A ring
topology is relatively easy to install and reconfigure. However, unidirectional traffic can be a
major disadvantage. Further, a break in the ring can disable the entire network. This problem
can be solved by using a dual ring or a switch. The ring topology is depicted in figure 4.2.
Bus Topology –. In bus topology, all the nodes in the network are connected by a common
communication channel, known as the backbone. Bus topology uses multi-point
communication scheme. In this topology, the nodes are connected to the backbone by drop
lines and taps. A drop line is a connection between a node and the backbone. A tap is a
connector that is used to make contact with the metallic core. The advantage of bus topology
is its ease of installation. The major disadvantage of this topology is that it is very difficult to
isolate a fault and reconfigure the network. Furthermore, a fault or break in the backbone
stops all transmission within the network. The bus topology is illustrated in figure 4.3.
Tree Topology –. Tree topology is a variation of star topology. In tree topology, every node
is not directly connected to the central hub. A majority of the nodes are connected to the
central hub via a secondary hub. The central hub is the active hub and it contains a repeater
that regenerates the bit patterns of the received signal before passing them out. The
secondary hubs may be active or passive. A passive hub simply provides a physical
connection between the attached nodes. The addition of secondary hubs provides extra
advantages. First, it increases the number of nodes that can be attached to the central hub.
Second, it allows the network to isolate and prioritize communications from different
computers. The tree topology is illustrated in the figure 4.4.
Mesh Topology –. In mesh topology, each node has a dedicated point-to-point connection
with every other node in the network. Therefore, a fully connected mesh network
with n nodes has n (n − 1)/2 physical communication links. Mesh topology provides several
advantages over other topologies. First, the use of dedicated links guarantees that each link
carries only its own data load, thereby eliminating traffic problems. Second, mesh topology
is robust; even if one link fails, other links remain active. Third, in a mesh network, as data
are transmitted through dedicated links, privacy and security are inherently maintained.
Finally, fault detection and fault isolation are much easier in mesh topology. The main
disadvantage of mesh topology is that it requires a large number of cables and I/O ports;
thus, installation and reconfiguration are very difficult. Another disadvantage is that it is
more costly than other topologies. Hence, mesh networks are implemented in limited
applications. The mesh topology is depicted in figure 4.5.
Each layer in the OSI model provides a particular service to its upper layer, hiding the
detailed implementation. The application layer provides user interfaces and allows users to
access network resources. It supports a number of services such as electronic mail, remote-
file access, distributed information services and shared database management. The
presentation layer deals with the syntax and semantics of the information that is to be
exchanged between two systems. The responsibility of this layer is to transform, encrypt and
compress data. Session establishment and termination are controlled by the session layer.
The transport layer is responsible for the delivery of the entire message from source to
destination, and the network layer is responsible for the delivery of packets from source to
destination across multiple networks. The data link layer organizes bits into frames and
provides node-to-node delivery. The physical layer deals with the mechanical and electrical
specifications of the interface and the communication medium that are required to transmit a
bit stream over a physical medium. The details of these layers are not discussed here.
Interested readers can consult [Tanenbaum, 1997] or any other network books available in
the market.
Network Protocols
A network protocol is a set of rules that determine how information between nodes are sent,
interpreted and processed. It represents an agreement between the communicating nodes.
The key elements of a protocol are syntax, semantics and timing. The syntax of a protocol
represents the structure or format of the data that is to be communicated. The semantics of a
protocol refers the meaning of each section of bits, and the timing of a protocol represents
when data should be sent and how fast it should be sent. Several network protocols are
described in the following section.
• IP –. IP is responsible for moving packets from node to node. IP forwards each packet
based on a four-byte destination address (the IP number). The internet authorities
assign ranges of numbers to different organizations. The organizations assign groups
of their numbers to departments. IP operates on gateway machines that move data
from department to organization, organization to region, and then around the world.
• TCP –. TCP is responsible for verifying the correct delivery of data from client to
server. Data can be lost in the intermediate network. TCP can detect errors and loss of
data and retransmit the data until the data is correctly and completely received by the
destination node.
• Sockets –. Sockets is the name given to the package of subroutines that provide access
to TCP/IP on most systems.
The TCP part and the IP part correspond to the transport layer and the network layer,
respectively, of the OSI model. In TCP/IP communications, the common applications are
remote login and file transfer.
• Naming services –. Names are necessary for identification, can be specific to the node
(unique name) or for a group of nodes (group name) and can be changed. Naming
services are provided by name management protocol (NMP).
• Session services –. Session services provide a connection-oriented, reliable, full
duplex message service to user process using session management protocol (SMP).
• Datagram services –. Using user datagram protocol (UDP), datagrams can be sent
to a specific name or all members of a group, or can be broadcast to the entire LAN.
• Diagnostic services –. Diagnostic services provide the ability to query the status of
nodes on the network using diagnostic and monitoring protocol (DMP).
APPC provides commands for managing a session, sending and receiving data, and
transaction management using a two-phase commit protocol. Instead of establishing a new
session for every request, in APPC communication is managed through a subsystem. The
subsystem maintains a queue of long-running sessions between the user machine and
subsystems on the server machine. In most of the cases, a new request is sent through an
existing session. However, there is some overhead as programs communicate to the
subsystem, but it is much lesser than the cost of creating and ending new sessions constantly.
The smallest APPC transaction consists of two operations: allocate and deallocate.
The allocate operation acquires temporary ownership of one of the existing sessions on the
server node. The deallocate operation frees the session and ends the conversation. Such a
transaction requires that a program be run on the server node, but it provides no data
and does not wait for a response. APPC programs use two statements to send and receive
data. These are Send_Data and Receive_and_Wait. LU 6.2 is a set of SNA parameters used
to support APPC when it runs on IBM’s System Network Architecture (SNA) network; thus,
sometimes APPC and LU 6.2 are considered identical.
DECnet
DECnet is a group of data communication products including a protocol suite, developed and
supported by Digital Equipment Corporation (Digital). The first version of DECnet was
released in 1975, which allowed two directly attached PDP-11 minicomputers to
communicate. In recent years, Digital has started supporting several non-proprietary
protocols. DECnet is a series of products that conform to DEC’s Digital Network
Architecture (DNA). DNA supports a variety of media and link implementations such as
Ethernet and X.25. DNA also offers a proprietary point-to-point link-layer protocol called
Digital Data Communications Message Protocol (DDCMP) and a 70-Mbps bus used in the
VAXcluster called the computer-room interconnect bus (CI bus).
DECnet routable protocol supports two different types of nodes. These are end nodes and
routing nodes. Both end nodes and routing nodes can send and receive network information,
but routing nodes can provide routing services for other DECnet nodes. Unlike TCP/IP and
some other network protocols, DECnet addresses are not associated with the physical
network to which the nodes are connected. Instead, DECnet locates hosts using area and
node address pairs. Areas can span many routers, and a single cable can support many areas.
DECnet is currently in its fifth major product release, which is known as Phase V and
DECnet/OSI.
AppleTalk
In the early 1980s, AppleTalk, a protocol suite, was developed by Apple Computer in
conjunction with the Macintosh Computer. AppleTalk is one of the early implementations of
a distributed client/server networking system. AppleTalk is Apple’s LAN routable protocol
that supports Apple’s proprietary LocalTalk access method as well as Ethernet and token
ring technologies.
AppleTalk networks are arranged hierarchically. The four basic components that form the
basis of an AppleTalk network are sockets, nodes, networks and zones. AppleTalk utilizes
addresses to identify and locate devices on a network in a manner similar to the one utilized
by the common protocols such as TCP/IP and IPX. The AppleTalk network manager and the
LocalTalk access method are built into all Macintoshes and Laser Writers.
The WAP protocol stack is designed to operate with a variety of bearer services with
emphasis on low-speed mobile communication, but is most suited for packet-switched bearer
services. The wireless datagram protocol (WDP) operates above the bearer services and
provides a connectionless unreliable datagram service similar to UDP, transporting data from
a sender to a receiver. The wireless transaction protocol (WTP) in a WAP stack can be
considered as equivalent to the TCP layer in the IP stack, but it is optimized for low
bandwidth. To run a WAP application, a complex infrastructure consisting of a mobile client,
a public land–mobile network (such as GSM), a public switched telephony network (such as
ISDN), a WAP gateway, an IP network and a WAP application server are required in
addition to the system components WAP portals, proxy servers, routers and firewalls. WAP
also provides a couple of advanced security features that are not available in the IP
environment. Wireless markup language (WML) is an XML-based markup language that is
specially designed for small mobile devices. In short, WAP provides a complete environment
for wireless applications.
The World-Wide Web (WWW or the Web) is a hyper media-based system, which is a
repository of information spread all over the world and linked together. The WWW is a
distributed client–server service where clients can access a service by using a web
browser. The service is provided by a number of web servers that are distributed over many
locations. A web browser is the software that provides the facility to access web resources.
The information on the Web is stored in the form of web pages, which are a collection of
text, graphics, pictures, sound and video. Web pages also contain links to other web pages,
called hyperlinks. A website is a collection of several web pages. The information on web
pages is generally represented by a special language other than simple text, called hypertext
markup language (HTML). The protocol that determines the exchange of information
between the web server and the web browser is called hypertext transfer protocol
(HTTP). The location or address of each resource on the Web is unique and it is represented
by a uniform resource locator (URL). The World-Wide Web has established itself as the
most popular way of accessing information via the Internet.
Chapter Summary
• Data communication is the exchange of data between two devices via some form of
transmission medium. A data communication system is made up of five components,
namely, message, sender, receiver, medium and protocol.
• A computer network is a collection of nodes (or sites or hosts) connected by
transmission media that are capable of receiving and/or sending information that are
generated by other nodes in the network and/or itself.
• Computer networks are broadly classified into three different categories: LANs,
WANs and MANs. Depending on the communication pathway, networks can also be
classified into two different categories: point-to-point networks and broadcast
networks.
• Network topology represents the physical layout of a computer network. There are
five different network topologies: star, bus, ring, tree and mesh. Sometimes, hybrid
topologies are also adopted in specific networks.
• A protocol is a set of rules that governs the data transmission between the nodes in a
network. The OSI Model is a protocol that determines how two different systems will
interact with each other regardless of their underlying architecture.
• The Internet consists of a number of separate but interconnected networks. The
World-Wide Web (WWW) is a repository of information spread all over the world
and linked together.
Chapter 5. Distributed Database Design
This chapter introduces the basic principles of distributed database design and related
concepts. All distributed database design concepts, such as fragmentation, replication, and
data allocation are discussed in detail. The different types of fragmentations are illustrated
with examples. The benefits of fragmentation, objectives of fragmentation, different
allocation strategies, and allocation of replicated and non-replicated fragments are explained
here briefly. Different types of distribution transparencies have also been focused in this
chapter.
The outline of this chapter is as follows. Section 5.1 represents the basic concepts of
distributed database design. The objectives of data distribution are introduced in Section 5.2.
In Section 5.3, data fragmentation – one important issue in distributed database design – is
explained briefly with examples. Section 5.4 focuses on the allocation of fragments, and the
measure of costs and benefits of fragment allocation. In Section 5.5, different types of
distribution transparencies are represented.
The definition and allocation of fragments must be based on how the database is to be used.
After designing the database schemas, the design of application programs is required to
access and manipulate the data into the distributed database system. In the design of a
distributed database system, precise knowledge of application requirements is necessary,
since database schemas must be able to support applications efficiently. Thus, the database
design should be based on both quantitative and qualitative information, which collectively
represents application requirements. Quantitative information is used in allocation, while
qualitative information is used in fragmentation. The quantitative information of application
requirements may include the following:
• The frequency with which a transaction is run, that is, the number of transaction
requests in the unit time. In case of general applications that are issued from multiple
sites, it is necessary to know the frequency of activation of each transaction at each
site.
• The site from which a transaction is run (also called site of origin of the transaction).
• The performance criteria for transactions.
Characterizing these features is not trivial. Moreover, this information is typically given for
global relation and must be properly translated into terms of all fragmentation alternatives
that are considered during database design.
Top-down design process–. In this process, the database design starts from the global
schema design and proceeds by designing the fragmentation of the database, and then by
allocating the fragments to the different sites, creating the physical images. The process is
completed by performing the physical design of the data at each site, which is allocated to it.
The global schema design involves both designing of global conceptual schema and global
external schemas (view design). In global conceptual schema designing step, the user needs
to specify the data entities and to determine the applications that will run on the database as
well as statistical information about these applications. At this stage, the design of local
conceptual schemas is considered. The objective of this step is to design local conceptual
schemas by distributing the entities over the sites of the distributed system. Rather than
distributing relations, it is quite common to partition relations into subrelations, which are
then distributed to different sites. Thus, in a top-down approach, the distributed database
design involves two phases, namely, fragmentation and allocation.
The fragmentation phase is the process of clustering information in fragments that can be
accessed simultaneously by different applications, whereas the allocation phase is the
process of distributing the generated fragments among the sites of a distributed database
system. In the top-down design process, the last step is the physical database design, which
maps the local conceptual schemas into physical storage devices available at corresponding
sites. Top-down design process is the best suitable for those distributed systems that are
developed from scratch.
Bottom-up design process–. In the bottom-up design process, the issue of integration of
several existing local schemas into a global conceptual schema is considered to develop a
distributed system. When several existing databases are aggregated to develop a distributed
system, the bottom-up design process is followed. This process is based on the integration of
several existing schemas into a single global schema. It is also possible to aggregate several
existing heterogeneous systems for constructing a distributed database system using the
bottom-up approach. Thus, the bottom-up design process requires the following steps:
1. The selection of a common database model for describing the global schema of the
database
2. The translation of each local schema into the common data model
3. The integration of the local schemas into a common global schema.
Any one of the above design strategies is followed to develop a distributed database system.
1. Centralized–. In this strategy, the distributed system consists of a single database and
DBMS is stored at one site with users distributed across the communication network.
Remote users can access centralized data over the network; thus, this strategy is
similar to distributed processing.
In this approach, locality of reference is the lowest at all sites, except the central site
where the data are stored. The communication cost is very high since all users except
the central site have to use the network for all types of data accesses. Reliability and
availability are very low, since the failure of the central site results in the loss of entire
database system.
2. Fragmented (or Partitioned)–. This strategy partitions the entire database into
disjoint fragments, where each fragment is assigned to one site. In this strategy,
fragments are not replicated.
If fragments are stored at the site where they are used most frequently, locality of
reference is high. As there is no replication of data, storage cost is low. Reliability and
availability are also low but still higher than centralized data allocation strategies, as
the failure of a site results in the loss of local data only. In this case, communication
costs are incurred only for global transactions. However, in this approach,
performance should be good, and communication costs are low if the data distribution
is designed properly.
3. Complete replication–. In this strategy, each site of the system maintains a complete
copy of the entire database. Since all the data are available at all sites, locality of
reference, availability and reliability, and performance are maximized in this
approach.
Storage costs are very high in this case, and hence, no communication costs are
incurred due to global transactions. However, communication costs for updating data
items are the most expensive. To overcome this problem, snapshots are sometimes
used. A snapshot is a copy of the data at a given time. The copies are updated
periodically, so they may not be always up-to-date. Snapshots are also sometimes used
to implement views in a distributed database, to reduce the time taken for performing a
database operation on a view.
The objective of this strategy is to utilize all the advantages of all other strategies but
none of the disadvantages. This strategy is used most commonly because of its
flexibility [see table 5.1].
Complete Highest Highest Best for reading Highest High for updating
replication low for reading
Data Fragmentation
In a distributed system, a global relation may be divided into several non-overlapping
subrelations and allocated to different sites, called fragments. This process is called data
fragmentation. The objective of data fragmentation design is to determine non-overlapping
fragments, which are logical units of allocation. Fragments can be designed by grouping a
number of tuples or attributes of relations. Each group of tuples or attributes that constitute a
fragment has the same properties.
1. Better usage–. In general, applications work with views, rather than the entire
relation. Therefore, it seems to be beneficial to fragment the relations into subrelations
and store them into different sites as units of distribution in data distribution.
2. Improved efficiency–. Fragmented data can be stored close to where it is most
frequently used. In addition, data that are not required by local applications are not
stored locally, which may result in faster data access, thereby increasing the
efficiency.
3. Improved parallelism or concurrency–. With a fragment as the unit of distribution,
a transaction can be divided into several subtransactions that operate on different
fragments in parallel. This increases the degree of concurrency or parallelism in the
system, thereby allowing transactions to execute in parallel in a safe way.
4. Better security–. Data that are not required by local applications are not stored locally
and, consequently, are not available to unauthorized users of the distributed system,
thus, improving the security.
1. Completeness–. If a relation instance R is decomposed into fragments R1, R2, ... , Rn,
each data item in R must appear in at least one of the fragments Ri. This property is
identical to the lossless decomposition property of normalization and it is necessary in
fragmentation to ensure that there is no loss of data during data fragmentation.
2. Reconstruction–. If a relation R is decomposed into fragments R1, R2, ... , Rn, it must
be possible to define a relational operation that will reconstruct the relation R from the
fragments R1, R2, ... , Rn. This rule ensures that constraints defined on the data in the
form of functional dependencies are preserved during data fragmentation.
3. Disjointness–. If a relation instance R is decomposed into fragments R1, R2, ... , Rn, and
if a data item is found in the fragment Ri, then it must not appear in any other
fragment. This rule ensures minimal data redundancy. In case of vertical
fragmentation, primary key attribute must be repeated to allow reconstruction and to
preserve functional dependencies. Therefore, in case of vertical fragmentation,
disjointness is defined only on non-primary key attributes of a relation.
HORIZONTAL FRAGMENTATION
Horizontal fragmentation partitions a relation along its tuples, that is, horizontal fragments
are subsets of the tuples of a relation. A horizontal fragment is produced by specifying a
predicate that performs a restriction on the tuples of a relation. In this fragmentation, the
predicate is defined by using the selection operation of the relational algebra. For a given
relation R, a horizontal fragment is defined as
σρ(R)
In some cases, the choice of horizontal fragmentation strategy, that is, the predicates or
search conditions for horizontal fragmentation is obvious. However, in some cases, it is very
difficult to choose the predicates for horizontal fragmentation and it requires a detailed
analysis of the application. The predicates may be simple involving single attribute, or may
be complex involving multiple attributes. Further, the predicates for each attribute may be
single or multi-valued, and the values may be discrete or may involve a range of values.
Rather than complicated predicates, the fragmentation strategy involves
finding midterm predicates or minimal predicates that can be used as the basis for
fragmentation schema. A minimal predicate is the conjunction of simple predicates, which
is complete and relevant. A set of predicates is complete if and only if any two tuples in the
same fragment are referenced with the same probability by any transaction. A predicate is
relevant if there is at least one transaction that accesses the resulting fragments differently.
• Consider a predicate P1 which partitions the tuples of a relation R into two parts which
are referenced differently by at least one application. Assume that P = P1.
• Consider a new simple predicate Pi that partitions at least one fragment of P into two
parts which are referenced in a different way by at least one application;
Set P ← P ∪ Pi. Non-relevant predicate should be eliminated from P and this
procedure is to be repeated until the set of the midterm fragments of P is complete.
Example 5.1.
Let us consider the relational schema Project [Chapter 3, Section 3.1.3] where project-type
represents whether the project is an inside project (within the country) or abroad project
(outside the country). Assume that P1 and P2 are two horizontal fragments of the relation
Project, which are obtained by using the predicate “whether the value of project-type
attribute is ‘inside’ or ‘abroad’”, as listed in the following:
• P1: σproject-type = “inside”(Project)
• P2: σproject-type = “abroad”(Project)
The descriptions of the Project relation and the horizontal fragments of this relation are
illustrated in figure 5.1.
Project
P1
P2
These horizontal fragments satisfy all the correctness rules of fragmentation as shown below:
• Completeness–. Each tuple in the relation Project appears either in fragment P1 or P2.
Thus, it satisfies completeness rule for fragmentation.
• Reconstruction–. The Project relation can be reconstructed from the horizontal
fragments P1 and P2 by using the union operation of relation algebra, which ensures
the reconstruction rule.
Thus, P1 ∪ P2 = Project.
• Disjointness–. The fragments P1 and P2 are disjoint, since there can be no such
project whose project-type is both “inside” and “abroad”.
In this example, the predicate set {project-type = “inside”, project-type = “abroad”} is
complete.
Example 5.2.
Let us consider the distributed database of a manufacturing company that has three sites in
eastern, northern, and southern regions. The company has a total of 20 products out of which
the first 10 products are produced in the eastern region, the next five products are produced
in the northern region, and the remaining five products are produced in the southern region.
The global schema of this distributed database includes several relational schemas such
as Branch, Product, Supplier, and Sales. In this example, the horizontal fragmentation of
Sales and Product has been considered.
Assume that there are two values for region attribute, “eastern” and “northern”, in the
relational schema Sales (depo-no, depo-name, region). Let us consider an application that
can generate from any site of the distributed system and involves the following SQL query.
If the query is initiated at site 1, it references Sales whose region is “eastern” with 80 percent
probability. Similarly, if the query is initiated at site 2, it references Sales whose region is
“northern” with 80 percent probability, whereas if the query is generated at site 3, it
references Sales of “eastern” and “northern” with equal probability. It is assumed that the
products produced in a region come to the nearest sales depot for sales. Now, the set of
predicates is {p1, p2}, where
Since the set of predicates {p1, p2} is complete and minimal, the process is terminated. The
relevant predicates cannot be deduced by analysing the code of an application. In this case,
the midterm predicates are as follows:
Since (region = “eastern”) ⇒ NOT (region = “northern”) and (region = “northern”) ⇒ NOT
(region = “eastern”), X1 and X4 are contradictory and X2 and X3 reduce to the predicates p1
and p2.
For the global relation Product (product-id, product-name, price, product-type), the set
of predicates are as follows:
• P1: product-id ≤ 10
• P2: 10 < product-id ≤ 15
• P3: product-id > 15
• P4: product-type = “consumable”
• P5: product-type = “Non-consumable”
It is assumed that applications are generated at site 1 and site 2 only. It is further assumed
that the applications that involve queries about consumable products are issued at site 1,
while the applications that involve queries about non-consumable products are issued at site
2. In this case, the fragments after reduction with the minimal set of predicates are listed in
the following:
• F1: product-id ≤ 10
• F2: (10 < product-id ≤ 15) AND (product-type = “consumable”)
• F3: (10 < product-id ≤ 15) AND (product-type = “Non-consumable”)
• F4: product-id > 15.
VERTICAL FRAGMENTATION
Vertical fragmentation partitions a relation along its attributes, that is, vertical fragments are
subsets of attributes of a relation. A vertical fragment is defined by using the projection
operation of relational algebra. For a given relation R, a vertical fragment is defined as
∏a ,a ,...,a (R)
1 2 n
The choice of vertical fragmentation strategy is more complex than that of horizontal
fragmentation, since a number of alternatives are available. One solution is to consider the
affinity of one attribute to another. Two different types of approaches have been identified
for attribute partitioning in vertical fragmentation of global relations as listed in the
following:
a1 1 0 1
a2 0 1
a3 0
a4
In this process, a matrix is produced for each transaction and an overall matrix is produced
showing the sum of all accesses for each attribute pair. Pairs with high affinity should appear
in the same vertical fragment and pairs with low affinity may appear in different fragments.
This technique was first proposed for centralized database design [Hoffer and Severance,
1975] and then it was extended for distributed environment [Navathe et al., 1984].
Example 5.3.
In this case, the Project relation is partitioned into two vertical fragments V1 and V2, which
are described below [figure 5.2]:
V1
V2
Hence, primary key for the relation Project is Project-id, which is repeated in both vertical
fragments V1 and V2 to reconstruct the original base relation from the fragments.
Hence, vertical fragmentation also ensures the correctness rules for fragmentation.
• Disjointness–. The vertical fragments V1 and V2 are disjoint, except for the primary
key project-id, which is repeated in both fragments and is necessary for reconstruction.
Hence, the primary key Project-id of the Project relation appears in both vertical
fragments V1 and V2.
Bond energy algorithm The Bond Energy Algorithm (BEA) is the most suitable algorithm
for vertical fragmentation [Navathe et al., 1984]. The bond energy algorithm uses attribute
affinity matrix (AA) as input and produces a clustered affinity matrix (CA) as output by
permuting rows and columns of AA. The generation of CA from AA involves three different
steps: initialization, iteration, and row ordering, which are illustrated in the following:
• Initialization–. In this step, one column from AA is selected and is placed into the
first column of CA.
• Iteration–. In this step, the remaining n – i columns are taken from AA and they are
placed in one of the possible i + 1 positions in CA that makes the largest contribution
to the global neighbour affinity measure. It is assumed that i number of columns are
already placed into CA.
• Row ordering–. In this step, rows are ordered in the same way as columns are
ordered. The contribution of a column Ak, which is placed between Ai and Aj, can be
represented as follows:
Now, for a given set of attributes many orderings are possible. For example, for n number of
attributes n orderings are possible. One efficient algorithm for ordering is searching for
clusters. The BEA proceeds by linearly traversing the set of attributes. In each step, one of
the remaining attributes is added and is inserted in the current order of attributes in such a
way that the maximal contribution is achieved. This is first done for the columns. Once all
the columns are determined, the row ordering is adapted to the column ordering, and the
resulting affinity matrix exhibits the desired clustering. To compute the contribution to the
global affinity value, the loss incurred through separation of previously joint columns is
subtracted from the gain, obtained by adding a new column. The contribution of a pair of
columns is the scalar product of the columns, which is maximal if the columns exhibit the
same value distribution.
Example 5.4.
Consider Q = {Q1, Q2, Q3, Q4} as a set of queries, A = {A1, A2, A3, A4} as a set of attributes for
the relation R, and S = {S1, S2, S3} as a set of sites in the distributed system. Assume that A1 is
the primary key of the relation R, and the following matrices represent the attribute usage
values of the relation Rand application access frequencies at different sites:
A1 A2 A3 A4
Q1 0 1 1 0
Q2 1 1 1 0
Q3 1 0 0 1
Q4 0 0 1 0
S1 S2 S3 Sum
Q1 10 20 0 30
Q2 5 0 10 15
Q3 0 35 5 40
Q4 0 10 0 10
Since A1 is the primary key of the relation R, the following attribute affinity matrix is
considered here:
A2 A3 A4
A2 45 45 0
A3 45 55 0
A4 0 0 40
Now,
A2 A3 A4
A2 45 45 0
A3 45 55 0
A4 0 0 40
and
A1 A2 A3 A4
Q1 0 1 1 0
Q2 1 1 1 0
Q3 1 0 0 1
Q4 0 0 1 0
S1 S2 S3 Sum
Q1 10 20 0 30
Q2 5 0 10 15
Q3 0 35 5 40
Q4 0 10 0 10
Hence,
MIXED FRAGMENTATION
Mixed fragmentation is a combination of horizontal and vertical fragmentation. This is also
referred to as hybrid or nested fragmentation. A mixed fragment consists of a horizontal
fragment that is subsequently vertically fragmented, or a vertical fragment that is then
horizontally fragmented. A mixed fragment is defined by using selection and projection
operations of relational algebra. For example, a mixed fragment for a given relation R can be
defined as follows:
where ρ is a predicate based on one or more attributes of the relation R and a1,a2,..., an are
attributes of the relation R.
Example 5.5.
Let us consider the same Project relation used in the previous example. The mixed fragments
of the above Project relation can be defined as follows:
Hence, first the Project relation is partitioned into two vertical fragments P1 and P2 and then
each of the vertical fragments is subsequently divided into two horizontal fragments, which
are shown in figure 5.3.
DERIVED FRAGMENTATION
A derived fragmentation is a horizontal fragment that is based on the horizontal
fragmentation of a parent relation and it does not depend on the properties of its own
attributes. Derived fragmentation is used to facilitate the join between fragments. The
term child is used to refer to the relation that contains the foreign key, and the term parent is
used for the relation containing the targeted primary key. Derived fragmentation is defined
by using the semi-join operation of relation algebra. For a given child relation C and parent
relation P, the derived fragmentation of C can be represented as follows:
Ci =C⊳Pi, l ≤ i ≤ w
Pi = σFi(S)
where Fi is the predicate according to which the primary horizontal fragment Si is defined.
If a relation contains more than one foreign key, it will be necessary to select one of the
referenced relations as the parent relation. The choice can be based on one of the following
two strategies:
Example 5.6.
Let us consider the following Department and Employee relation together [Chapter
1, Example 1.1].
Assume that the Department relation is horizontally fragmented according to the deptno so
that data relating to a particular department is stored locally. For instance,
Hence, in the Employee relation, each employee belongs to a particular department (deptno)
that references to the deptno field in the Department relation. Thus, it should be beneficial to
store Employee relation using the same fragmentation strategy of the Department relation.
This can be achieved by derived fragmentation strategy that partitions the Employee relation
horizontally according to the deptno as follows:
It can be shown that derived fragmentation also satisfies the correctness rules of
fragmentation.
R = ∪ Ri, 1≤i ≤ n
NO FRAGMENTATION
A final strategy of the fragmentation is not to fragment a relation. If a relation contains a
smaller number of tuples and not updated frequently, then is better not to fragment the
relation. It will be more sensible to leave the relation as a whole and simply replicate the
relation at each site of the distributed system.
The Allocation of Fragments
The allocation of fragments is a critical performance issue in the context of distributed
database design. Before allocation of fragments into different sites of a distributed system, it
is necessary to identify whether the fragments are replicated or not. The allocation of non-
replicated fragments can be handled easily by using “best-fit” approach. In best-fit approach,
the best cost-effective allocation strategy is selected among several alternatives of possible
allocation strategies. Replication of fragments adds extra complexity to the fragment
allocation issue as follows:
For allocating replicated fragments, one of the following methods can be used:
• In the first approach, the set of all sites in the distributed system is determined where
the benefit of allocating one replica of the fragment is higher than the cost of
allocation. One replica of the fragment is allocated to such beneficial sites.
• In the alternative approach, allocation of fragments is done using best-fit method
considering fragments are not replicated, and then progressively replicas are
introduced starting from the most beneficial. This process is terminated when addition
of replicas is no more beneficial.
Both the above approaches have some limitations. In the first approach, determination of the
cost and the benefit of each replica of the fragment is very complicated, whereas in the latter
approach, progressive increment of additional replicas is less beneficial. Moreover, the
reliability and availability increases if there are two or three copies of the fragment, but
further copies give a less than proportional increase.
HORIZONTAL FRAGMENTS
1. In this case, using best-fit approach for non-replicated fragments, the fragment Ri of
relation R is allocated at site j where the number of references to the fragment Ri is
maximum. The number of local references of Ri at site j is as follows:
2. In this case, using the first approach for replicated fragments, the fragment Ri of
relation R is allocated at site j, where the cost of retrieval references of applications is
larger than the cost of update references to Ri from applications at any other site.
Hence, Bij can be evaluated as follows:
where C is a constant that measures the ratio between the cost of an update and a
retrieval access. Typically, update accesses are more expensive and C≥1. Ri is
allocated at all sites j* where Bij* is positive. When Bij* is negative, a single copy
of Ri is placed at the site where Bij Bij* is maximum.
VERTICAL FRAGMENTS
In this case, the benefit is calculated by vertically partitioning a fragment Ri into two vertical
fragments Rs and Rt allocated at site s and site t, respectively. The effect of this partition is
listed below.
• It is assumed that there are two sets of applications As and At issued at site s and site j,
respectively, which use the attributes of Rs and Rt and become local to sites s and t,
respectively.
• There is a set A1 of applications previously local to r, which uses only attributes
of Rs and Rt. An additional remote reference is now required for these applications.
• There is a set A2 of applications previously local to r, which reference attributes of
both Rs and Rt. These applications make two additional remote references.
• There is a set A3 of applications at sites different than r, s, or t, which reference
attributes of both Rsand Rt. These applications make one additional remote reference.
• Distribution transparency
• Transaction transparency
• Performance transparency
• DBMS transparency.
If a system supports higher degree of distribution transparency, the user sees a single
integrated schema with no details of fragmentation, allocation, or distribution. The
distributed DBMS stores all the details in the distribution catalog. All these distribution
transparencies are discussed in the following:
Example 5.7.
2. Location transparency–. With location transparency, the user is aware how the data
are fragmented but still does not have any idea regarding the location of the fragments.
Location transparency is the middle level of distribution transparency. To retrieve data
from a distributed database with location transparency, the end user or programmer
has to specify the database fragment names but need not specify where these
fragments are located in the distributed system.
Example 5.8.
Let us assume that the tuples of the above Employee relation is horizontally
partitioned into two fragments EMP1 and EMP2 depending on the selection
predicates “emp-id≤100” and “emp-id>100”. Hence, the user is aware that the
Employee relation is horizontally fragmented into two relations EMP1 and EMP2, but
they have no idea in which sites these relations are stored. Thus, the user will write the
following SQL statement for the above query “retrieve the names of all employees of
branch number 10”:
Example 5.9.
Let us assume that the horizontal fragments EMP1 and EMP2 of the Employee
relation in example 5.7 are replicated and stored in different sites of the distributed
system. Further, assume that the distributed DBMS supports replication transparency.
In this case, the user will write the following SQL statement for the query “retrieve the
names of all employees of branch number 20 whose salary is greater than Rs. 50,000”:
If the distributed system does not support replication transparency, then the user will
write the following SQL statement for the above query considering there are a number
of replicas of fragments EMP1 and EMP2 of Employee relation:
Similarly, the above query can be rewritten as follows for a distributed DBMS with
replication transparency which does not exhibit location transparency:
union
Example 5.10.
Hence, it is assumed that replicas of fragments P1 and P2 of the Project relation are
allocated to different sites of the distributed system such as site1, site3, and site4.
5. Naming transparency–. Naming transparency means that the users are not aware of
the actual name of the database objects in the system. If a system supports naming
transparency, the user will specify the alias names of database objects for data
accessing. In a distributed database system, each database object must have a unique
name. The distributed DBMS must ensure that no two sites create a database object
with the same name. To ensure this, one solution is to create a central name
server, which ensures the uniqueness of names of database objects in the system.
However, this approach has several disadvantages as follows:
0. Loss of some local autonomy, because during creation of new database objects,
each site has to ensure uniqueness of names of database objects from the central
name server.
1. Performance may be degraded, if the central site becomes a bottleneck.
2. Low availability and reliability, because if the central site fails, the remaining
sites cannot create any new database object. As availability decreases, reliability
also decreases.
A second alternative solution is to prefix a database object with the identifier of the site that
created it. This naming method will also be able to identify each fragment along with each of
its copies. Therefore, copy2 of fragment 1 of Employee relation created at site 3 can be
referred as S3.Empolyee. F1.C2. The only problem with this approach is that it results in loss
of distribution transparency.
One solution that can overcome the disadvantages of the above two approaches is the use
of aliases(sometimes called synonyms) for each database object. It is the responsibility of
the distributed database system to map an alias to the appropriate database object.
The distributed system R* differentiates between an object’s name and its system-wide name
(global name). The printname is the name through which the users refer to the database
object. The system-wide name or global name is a globally unique internal identifier for the
database object that is never changed. The system-wide name contains four components as
follows:
• Creator ID–. This represents a unique site identifier for the user who created the
database object.
• Creator site ID–. It indicates a globally unique identifier for the site from which the
database object was created.
• Local name–. It represents an unqualified name for the database object.
• Birth-site ID–. This represents a globally unique identifier for the site at which the
object was initially stored.
For example, the system-wide name, [email protected]@kolkata, represents
an object with local name localBranch, created by Project Leader in India site and initially
stored at the Kolkata site.
Example 5.11.
In this example, data distribution transparencies for update application have been considered.
Assume that the relational schema Employee (emp-id, emp-name, designation, salary,
emp-branch, project-no) is fragmented as follows:
Consider an update request is generated in the distributed system that the branch of an
Employee with emp-id 55 will be modified from emp-brach 10 to emp-branch 20. The user
written queries are illustrated in the following for different levels of transparency such as
fragmentation transparency, location transparency, and local mapping transparency.
• Fragmentation transparency:
• Update Employee
• set emp-branch = 20
• where emp-id = 55.
• Location transparency:
• Select emp-name, project-no into $emp-name, $project-no from Emp1
• where emp-id = 55
• Select salary, design into $salary, $design from Emp2
• where emp-id = 55
• Insert into Emp3 (emp-id, emp-name, emp-branch, project-no)
• values (55, $emp-name, 20, $project-no)
• Insert into Emp4 (emp-id, salary, design)
• values (55, $salary,$design)
• Delete from Emp1 where emp-id = 55
• Delete from Emp2 where emp-id = 55
• Local Mapping transparency:
• Select emp-name, project-no into $emp-name, $project-no from Emp1 at site 1
• where emp-id = 55
• Select salary, design into $salary, $design from Emp2 at site 2
• where emp-id = 55
• Insert into Emp3 at site 3 (emp-id, emp-name, emp-branch, project-no)
• values (55, $emp-name, 20, $project-no)
• Insert into Emp3 at site 7 (emp-id, emp-name, emp-branch, project-no)
• values (55, $emp-name, 20, $project-no)
• Insert into Emp4 at site 4(emp-id, salary, design)
• values (55, $salary,$design)
• Insert into Emp4 at site 8(emp-id, salary, design)
• values (55, $salary,$design)
• Delete from Emp1 at site 1 where emp-id = 55
• Delete from Emp1 at site 5 where emp-id = 55
• Delete from Emp2 at site 2 where emp-id = 55
• Delete from Emp2 at site 6 where emp-id = 55
Hence, it is assumed that the fragment Emp1 has two replicas stored at site 1 and site 5,
respectively, the fragment Emp2 has two replicas stored at site 2 and site 6, respectively, the
fragment Emp3 has two replicas stored at site 3 and site 7, respectively, and the fragment
Emp4 has two replicas stored at site 4 and site 8, respectively.
Transaction Transparency
Transaction transparency in a distributed DBMS ensures that all distributed transactions
maintain the distributed database integrity and consistency. A distributed transaction can
update data stored at many different sites connected by a computer network. Each transaction
is divided into several subtransactions (represented by an agent), one for each site that has
to be accessed. Transaction transparency ensures that the distributed transaction will be
successfully completed only if all subtransactions executing in different sites associated with
the transaction are completed successfully. Thus, a distributed DBMS requires complex
mechanism to manage the execution of distributed transactions and to ensure the database
consistency and integrity. Moreover, transaction transparency becomes more complex due to
fragmentation, allocation, and replication schemas in distributed DBMS. Two further aspects
of transaction transparency are concurrency transparency and failure transparency,
which are discussed in the following:
Functions that are lost due to failures will be picked up by another network node and
continued.
Performance Transparency
Performance transparency in a distributed DBMS ensures that it performs its tasks as
centralized DBMS. In other words, performance transparency in a distributed environment
assures that the system does not suffer from any performance degradation due to the
distributed architecture and it will choose the most cost-effective strategy to execute a
request. In a distributed environment, the distributed Query processor maps a data request
into an ordered sequence of operations on local databases. In this context, the added
complexity of fragmentation, allocation, and replication schemas is to be considered. The
distributed Query processor has to take decision regarding the following issues:
The distributed Query processor determines an execution strategy that would be optimized
with respect to some cost function. Typically, the costs associated with a distributed data
request include the following:
• The access time (I/O) cost involved in accessing the physical data on disk.
• The CPU time cost incurred when performing operations on data in main memory.
• The communication cost associated with the transmission of data across the network.
A number of query processing and query optimization techniques have been developed for
distributed database system: some of them minimize the total cost of query execution time
[Sacco and Yao, 1982], and some of them attempt to maximize the parallel execution of
operations [Epstein et al., 1978] to minimize the response time of queries.
DBMS Transparency
DBMS transparency in a distributed environment hides the knowledge that the local
DBMSs may be different and is, therefore, only applicable to heterogeneous distributed
DBMSs. This is also known as heterogeneity transparency, which allows the integration of
several different local DBMSs (relational, network, and hierarchical) under a common global
schema. It is the responsibility of distributed DBMS to translate the data requests from the
global schema to local DBMS schemas to provide DBMS transparency.
Chapter Summary
• Distributed database design involves the following important issues: fragmentation,
replication, and allocation.
• Fragmentation–. A global relation may be divided into a number of
subrelations, called fragments, which are then distributed among sites. There are
two main types of fragmentation: horizontal and vertical. Horizontal fragments
are subsets of tuples and vertical fragments are subsets of attributes. Other two
types of fragmentations are mixed and horizontal.
• Allocation–. Allocation involves the issue of allocating fragments among sites.
• Replication–. The distributed database system may maintain a copy of
fragment at several different sites.
• Fragmentation must ensure the correctness rules – completeness, reconstruction, and
disjointness.
• Alternative data allocation strategies are centralized, partitioned, selective replication,
and complete replication.
• Transparency hides the implementation details of the distributed systems from the
users. Different transparencies in distributed systems are distribution transparency,
transaction transparency, performance transparency, and DBMS transparency.
Chapter 6. Distributed DBMS Architecture
This chapter introduces the architecture of different distributed systems such as client/server
system and peer-to-peer distributed system. Owing to the diversity of distributed systems, it
is very difficult to generalize the architecture of distributed DBMSs. Different alternative
architectures of the distributed database systems and the advantages and disadvantages of
each system are discussed in detail. This chapter also introduces the concept of a multi-
database system (MDBS), which is used to manage the heterogeneity of different DBMSs in
a heterogeneous distributed DBMS environment. The classification of MDBSs and the
architecture of such databases are presented in detail.
The outline of this chapter is as follows. Section 6.2 introduces different alternative
architectures of client/server systems and pros and cons of these systems. In Section
6.3 alternative architectures for peer-to-peer distributed systems are discussed. Section
6.4 focuses on MDBSs. The classifications of MDBSs and their corresponding architectures
are illustrated in this section.
Introduction
The architecture of a system reflects the structure of the underlying system. It defines the
different components of the system, the functions of these components and the overall
interactions and relationships between these components. This concept is true for general
computer systems as well as for software systems. The software architecture of a program or
computing system is the structure or structures of the system, which comprises software
elements or modules, the externally visible properties of these elements and the relationships
between them. Software architecture can be thought of as the representation of an
engineering system and the process(es) and discipline(s) for effectively implementing the
design(s) of such a system.
A distributed database system can be considered as a large-scale software system; thus, the
architecture of a distributed system can be defined in a manner similar to that of software
systems. This chapter introduces the different alternative reference architectures of
distributed database systems such as client/server, peer-to-peer and MDBSs.
Client/Server System
In the late 1970s and early 1980s smaller systems (mini computer) were developed that
required less power and air conditioning. The term client/server was first used in the 1980s,
and it gained acceptance in referring to personal computers (PCs) on a network. In the late
1970s, Xerox developed the standards and technology that is familiar today as the Ethernet.
This provided a standard means for linking together computers from different manufactures
and formed the basis for modern local area networks (LANs) and wide area networks
(WANs). Client/server system was developed to cope up with the rapidly changing business
environment. The general forces that drive the move to client/server systems are as follows:
• A strong business requirement for decentralized computing horsepower.
• Standard, powerful computers with user-friendly interfaces.
• Mature, shrink-wrapped user applications with widespread acceptance.
• Inexpensive, modular systems designed with enterprise class architecture, such as
power and network redundancy and file archiving network protocols, to link them
together.
• Growing cost/performance advantages of PC-based platforms.
Usually, a client is defined as a requester of services, and a server is defined as the provider
of services. A single machine can be both a client and a server depending on the software
configuration. Sometimes, the term server or client refers to the software rather than the
machines. Generally, server software runs on powerful computers dedicated for exclusive
use of business applications. On the other hand, client software runs on common PCs or
workstations. The properties of a server are:
• Passive (slave)
• Waiting for requests
• On request serves clients and sends reply.
• Active (Master)
• Sending requests
• Waits until reply arrives.
A server can be stateless or stateful. A stateless server does not keep any information
between requests. A stateful server can remember information between requests.
The client/server environment is more difficult to maintain for a variety of reasons, which
are as follows:
Two-tier architecture –. A generic client/server architecture has two types of nodes on the
network: clients and servers. As a result, these generic architectures are sometimes referred
to as two-tierarchitectures. In two-tier client/server architecture, the user system interface is
usually located in the user’s desktop environment, and the database management services are
usually located on a server that services many clients. Processing management is split
between the user system interface environment and the database management server
environment. The general two-tier architecture of a Client/Server system is illustrated
in figure 6.2.
In a two-tier client/server system, it may occur that multiple clients are served by a single
server, called multiple clients–single server approach. Another alternative is multiple
servers providing services to multiple clients, which is called multiple clients–multiple
servers approach. In the case of multiple clients–multiple servers approach, two alternative
management strategies are possible: either each client manages its own connection to the
appropriate server or each client communicates with its home server, which further
communicates with other servers as required. The former approach simplifies server code but
complicates the client code with additional functionalities, which leads to a heavy (fat)
client system. On the other hand, the latter approach loads the server machine with all data
management responsibilities and thus leads to a light (thin) client system. Depending on the
extent to which the processing is shared between the client and the server, a server can be
described as fat or thin. A fat server carries the larger proportion of the processing load,
whereas a thin server carries a lesser processing load.
The two-tier client/server architecture is a good solution for distributed computing when
work groups of up to 100 people are interacting on a local area network simultaneously. It
has a number of limitations also. The major limitation is that performance begins to
deteriorate when the number of users exceeds 100. A second limitation of the two-tier
architecture is that implementation of processing management services using vendor
proprietary database procedures restricts flexibility and choice of DBMS for applications.
Three-tier architecture –. Some networks of client/server architecture consist of three
different kinds of nodes: clients, application servers, which process data for the clients, and
database servers, which store data for the application servers. This arrangement is
called three-tier architecture. The three-tier architecture (also referred to as multi-
tier architecture) emerged to overcome the limitations of the two-tier architecture. In the
three-tier architecture, a middle tier was added between the user system interface client
environment and the database management server environment. The middle tier can perform
queuing, application execution, and database staging. There are various ways of
implementing the middle tier, such as transaction processing monitors, message servers, web
servers, or application servers. The typical three-tier architecture of a client/server system is
depicted in figure 6.3.
The most basic type of three-tier architecture has a middle layer consisting of transaction
processing (TP) monitor technology. The TP monitor technology is a type of message
queuing, transaction scheduling and prioritization service where the client connects to the TP
monitor (middle tier) instead of the database server. The transaction is accepted by the
monitor, which queues it and takes responsibility for managing it to completion, thus, freeing
up the client. TP monitor technology also provides a number of services such as updating
multiple DBMSs in a single transaction, connectivity to a variety of data sources including
flat files, non-relational DBMSs and the mainframe, the ability to attach priorities to
transactions and robust security. When all these functionalities are provided by third-party
middleware vendors, it complicates the TP monitor code, which is then referred to as TP
heavy, and it can service thousands of users. On the other hand, if all these functionalities
are embedded in the DBMS and can be considered as two-tier architecture, it is referred to
as TP Lite. A limitation of TP monitor technology is that the implementation code is usually
written in a lower level language, and the system is not yet widely available in the popular
visual toolsets.
In general, a multi-tier (or n-tier) architecture may deploy any number of distinct services,
including transitive relations between application servers implementing different functions of
business logic, each of which may or may not employ a distinct or shared database system.
Considering the complexity associated with discovering, communicating with and managing
the large number of computers involved in a distributed system, the software module at each
node in a peer-to-peer distributed system is typically structured in a layered manner. Thus,
the software modules of peer-to-peer applications can be divided into the three layers, known
as the base overlay layer, the middleware layer, and the application layer. The base
overlay layer deals with the issue of discovering other participants in the system and creating
a mechanism for all nodes to communicate with each other. This layer ensures that all
participants in the network are aware of other participants. The middleware layer includes
additional software components that can be potentially reused by many different
applications. The functionalities provided by this layer include the ability to create a
distributed index for information in the system, a publish/subscribe facility and security
services. The functions provided by the middleware layer are not necessary for all
applications, but they are developed to be reused by more than one application. The
application layer provides software packages intended to be used by users and developed so
as to exploit the distributed nature of the peer-to-peer infrastructure. There is no standard
terminology across different implementations of the peer-to-peer system, and thus, the term
“peer-to-peer” is used for general descriptions of the functionalities required for building a
generic peer-to-peer system. Most of the peer-to-peer systems are developed as single-
application systems.
Distributed DBMS (DDBMS) component. The DDBMS component is the controlling unit
of the entire system. This component provides the different levels of transparencies such as
data distribution transparency, transaction transparency, performance transparency and
DBMS transparency (in the case of heterogeneous DDBMS). Ozsu has identified four major
componets of DDBMS as listed below:
1. The user interface handler –. This component is responsible for interpreting user
commands as they come into the system and formatting the result data as it is sent to
the user.
2. The semantic data controller –. This component is responsible for checking integrity
constraints and authorizations that are defined in the GCS, before processing the user
requests.
3. The global query optimizer and decomposer –. This component determines an
execution strategy to minimize a cost function and translates the global queries into
local ones using the global and local conceptual schemas as well as the global system
catalog. The global query optimizer is responsible for generating the best strategy to
execute distributed join operations.
4. The distributed execution monitor –. It coordinates the distributed execution of the
user request. This component is also known as distributed transaction manager. During
the execution of distributed queries, the execution monitors at various sites may and
usually do communicate with one another.
DC component. The DC component is the software that enables all sites to communicate
with each other. The DC component contains all information about the sites and the links.
Global system catalog (GSC). The GSC provides the same functionality as system catalog
of a centralized DBMS. In addition to metadata of the entire database, a GSC contains all
fragmentation, replication and allocation details considering the distributed nature of a
DDBMS. It can itself be managed as a distributed database and thus, it can be fragmented
and distributed, fully replicated or centralized like any other relations in the system. [The
details of GSC management will be introduced in Chapter 12, Section 12.2].
Local DBMS (LDBMS) component. The LDBMS component is a standard DBMS, stored
at each site that has a database and responsible for controlling local data. Each LDBMS
component has its own local system catalog that contains the information about the data
stored at that particular site. In a homogeneous DDBMS, the LDBMS component is the same
product, replicated at each site, while in a heterogeneous DDBMS, there must be at least two
sites with different DBMS products and/or platforms. The major components of an LDBMS
are as follows:
1. The local query optimizer –. This component is used as the access path selector and
responsible for choosing the best access path to access any data item for the execution
of a query (the query may be a local query or part of a global query executed at that
site).
2. The local recovery manager –. The local recovery manager ensures the consistency
of the local database inspite of failures.
3. The run-time support processor –. This component physically accesses the database
according to the commands in the schedule generated by the query optimizer and is
responsible for managing main memory buffers. The run-time support processor is the
interface to the operating system and contains the database buffer (or cache) manager.
Distributed Data Independence
The reference architecture of a DDBMS is an extension of ANSI–SPARC architecture;
therefore, data independence is supported by this model. Distributed data independence
means that upper levels are unaffected by changes to lower levels in the distributed database
architecture. Like a centralized DBMS, both distributed logical data
independence and distributed physical data independence are supported by this
architecture. In a distributed system, the user queries data irrespective of its location,
fragmentation or replication. Furthermore, any changes made to the GCS do not affect the
user views in the global external schemas. Thus, distributed logical data independence is
provided by global external schemas in distributed database architecture. Similarly, the GCS
provides distributed physical data independence in the distributed database environment.
An MDBS is a software that can be manipulated and accessed through a single manipulation
language with a single common data model (i.e., through a single application) in a
heterogeneous environment without interfering with the normal execution of the individual
database systems. The MDBS has developed from a requirement to manage and retrieve data
from multiple databases within a single application while providing complete autonomy to
individual database systems. To support DBMS transparency, MDBS resides on top of
existing databases and file systems and presents a single database to its users. An MDBS
maintains a global schema against which users issue queries and updates, and this global
schema is constructed by integrating the schemas of local databases. To execute a global
query, the MDBS first translates it into a number of subqueries, and converts these
subqueries into appropriate local queries for running on local DBMSs. After completion of
execution, local results are merged and the final global result for the user query is generated.
An MDBS controls multiple gateways and manages local databases through these gateways.
MDBSs can be classified into two different categories based on the autonomy of the
individual DBMSs. These are non-federated MDBSs and federated MDBSs. A federated
MDBS is again categorized as loosely coupled federated MDBS and tightly coupled
federated MDBS based on who manages the federation and how the components are
integrated. Further, a tightly coupled federated MDBS can be classified as single federation
tightly coupled federated MDBS and multiple federations tightly coupled federated
MDBS. The complete taxonomy of MDBSs [Sheth and Larson, 1990] is depicted in figure
6.5.
Figure 6.5. Taxonomy of Multi-database Systems
Two types of FMDBSs have been identified, namely, loosely coupled FMDBS and tightly
coupled FMDBS depending on how multiple component databases are integrated. An
FMDBS is loosely coupled if it is the user’s responsibility to create and maintain the
federation and there is no control enforced by the federated system and its administrators.
Similarly, an FMDBS is tightly coupled if the federation and its administrator(s) have the
responsibility for creating and maintaining the integration and they actively control the
access to component databases. A federation is built by a selective and controlled integration
of its components. A tightly coupled FMDBS may have one or more federated schemas. A
tightly coupled FMDBS is said to have single federation if it allows the creation and
management of only one federated schema. On the other hand, a tightly coupled FMDBS is
said to have multiple federations if it allows the creation and management of multiple
federated schemas. A loosely coupled FMDBS always supports multiple federated schemas.
In a tightly coupled FMDBS, federated schema takes the form of schema integration. For
simplicity, a single (logical) federation is considered for the entire system, and it is
represented by a GCS. A number of export schemas are integrated into the GCS, where the
export schemas are created through negotiation between the local databases and the GCS.
Thus, in an FMDBS, the GCS is a subset of local conceptual schemas and consists of the
data that each local DBMS agrees to share. The GCS of a tightly coupled FMDBS involves
the integration of either parts of the local conceptual schemas or the local external schemas.
Global external schemas are generated through negotiation between global users and the
GCS. The reference architecture of a tightly coupled FMDBS is depicted in figure 6.7.
Chapter Summary
This chapter introduces several atternative architectures for a distributed database system
such as Client/server, peer-to-peer and MDBSs.
The organization of this chapter is as follows. Section 7.1 introduces the basic concepts of
transaction management, and ACID properties of transactions are described in Section 7.2.
In Section 7.3, the objectives of distributed transaction management are explained, and a
model for transaction management in distributed systems is described in Section 7.4. The
classification of transactions is introduced in Section 7.5.
The basic difference between a query and a transaction is the fundamental properties of a
transaction: atomicity and durability. A transaction ensures the consistency of the database
irrespective of the facts that several transactions are executed concurrently and that failures
may occur during their execution. A transaction can be considered to be made up of a
sequence of read and write operations on the database, together with some computational
steps. In that sense, a transaction may be thought of as a program with embedded database
access queries. The difference between a query and a transaction is illustrated with an
example in the next section.
Each transaction has to terminate. The outcome of the termination depends on the success or
failure of the transaction. Once a transaction starts execution, it may terminate in one of the
following possible ways:
• The transaction aborts if a failure occurred during its execution.
• The transaction commits if it was completed successfully.
Example 7.1.
• “Update employee salary by 20% for the employee whose emp-id is E001 in the
Employee relation”.
This query can be represented as a transaction with the name salary_update by using the
embedded SQL notation as listed below.
Begin-
transaction salary_update begin EXEC SQL UPDATE EMPLOYEE SET
SALARY = SALARY * 1.2 WHERE EMP-ID = “E001” end
The Begin-transaction and end statements represent the beginning and end of a transaction
respectively, which is treated as a program that performs a database access.
Atomicity Atomicity refers to the fact that each transaction is a single logical unit of work in
a database environment that may consist of a number of operations. Thus, either all
operations of a transaction are to be completed successfully, or none of them are carried out
at all. This property is known as all-or-nothing property of a transaction. It is the
responsibility of the DBMS to maintain atomicity even when failures occur in the system. If
the execution of a transaction is interrupted by any sort of failures, the DBMS will determine
what actions should be performed with that transaction to recover from the failed state. To
maintain atomicity during failures, either the system should complete the remaining
operations of the transaction or it should undo all the operations that have already been
executed by the transaction before the failure.
The execution of a transaction may be interrupted owing to a variety of reasons such as bad
input data, deadlock, system crash, processor failures, power failure, or media and
communication links failures (in case of distributed system). The recovery after a transaction
failure is divided into two different categories based on the different types of failures that can
occur in a database environment. The activity of ensuring transaction atomicity in the case of
transaction aborts due to failures caused by bad input data, deadlock and other factors
(except system crash and hardware failures) is called transaction recovery. On the other
hand, the activity of preserving transaction atomicity in the case of system crash or hardware
failures is called crash recovery.
Consistency Referring to the correctness of data, this property states that a transaction must
transform the database from one consistent state to another consistent state. It is the
responsibility of both the semantic data controller (integrity manager) and the concurrency
control mechanisms to maintain consistency of the data. The semantic data controller can
ensure consistency by enforcing all the constraints that have been specified in the database
schema, such as integrity and enterprise constraints. Similarly, the responsibility of a
concurrency control mechanism is to disallow transactions from reading or updating “dirty
data”. Dirty data refers to the state of the data at which the data have been updated by a
transaction, but the transaction has not yet committed. (The details of distributed
concurrency control mechanisms are discussed in Chapter 8). A classification of consistency
has been defined [Grey et al., 1976] based on dirty data, which groups databases into four
levels of consistency, which are listed in the following.
Thus, it is clear that a higher degree of consistency covers all the lower levels. These
different degrees of consistency provide flexibility to application developers to define
transactions that can operate at different levels.
1. Lost update –. This problem occurs when an apparently successful completed update
operation made by one transaction (first) is overridden by another transaction
(second). The lost update problem is illustrated in example 7.2.
Example 7.2.
Assume that there are two concurrent transactions T1 and T2 respectively occurring in
a database environment. The transaction T1 is withdrawing $500 from an account with
balance B, with an initial value of $2,000, while the transaction T2 is depositing an
amount of $600 into the same account. If these two transactions are executed serially,
then the balance B of the account will be $2,100. Consider that both the transactions
are started nearly at the same time, and both have read the same value of B, that is,
$2,000. Owing to the concurrent access to the same data item, when the transaction T1
is decrementing the value of B to $1,500, the transaction T2 is incrementing the value
of B to $2,600. Thus, update to the data item B made by the transaction T1 will be
overridden by the transaction T2. This is known as lost update problem.
2. Cascading aborts –. This problem occurs when one transaction allows other
transactions to read its uncommitted data and decides to abort; the transactions that
have read or modified its uncommitted data will also abort. Consider the
previous example 7.2. If transaction T1 permits transaction T2 to read its uncommitted
data, then if transaction T1 aborts, transaction T2 will also abort consequently.
The type of isolation that is ensured by not allowing uncommitted data to be seen by
other transactions is called cursor stability. It is obvious that higher levels of
consistency provide more isolation among transactions. If different degrees of
consistency are considered from the point of view of the isolation property, then
degree 0 prevents lost update problem, degree 2 prevents cascading aborts and degree
3 provides the full isolation by restricting both the above problems.
A set of isolation levels has been defined by ANSI SQL2 [ANSI, 1992] based on the
events dirty read, non-repeatable or fuzzy read and phantom read, when full isolation is
not maintained in a database environment. These are read uncommitted, read committed,
repeatable read and anomaly serializable. Dirty read refers to the situation when a
transaction reads the dirty data of another transaction. Fuzzy read or non-repeatable
read occurs when one transaction reads the value of a data item that is modified or deleted
by another transaction. If the first transaction attempts to re-read the data item, either that
data item may not be found or it may read a different value. Similarly, it is possible that
when a transaction is performing a search operation depending on a selection predicate,
another transaction may be inserting new tuples into the database satisfying that predicate.
This is known as phantom read. The different isolation levels based on these events are
listed below.
• Read uncommitted –. At this isolation level, all the three above events may occur.
• Read committed –. At this isolation level, non-repeatable read and phantom read are
possible, but dirty read cannot occur.
• Repeatable read –. Only phantom read is possible at this isolation level.
• Anomaly serializable –. At this isolation level, none of the above events – dirty read,
phantom read and fuzzy read – is possible.
Example 7.3.
Assume that there are two concurrent transactions T1 and T2 respectively occurring in a
database environment. The transaction T1 is depositing $500 to an account X with an initial
value $3,000, while the transaction T2 is withdrawing an amount $1,000 from the same
account X. Further assume that the transaction T1 is executed first and allows the
transaction T2 to read the uncommitted value $3,500 for X and then aborts. This problem is
known as dirty read problem, which is illustrated in the following:
• T1: write(X)
• T2: read(X)
• T1: abort
Thus, the transaction T2 reads the dirty data of the transaction T1.
Example 7.4.
In a DBMS, consider a situation when one transaction T1 reads the value of a data item X,
while another transaction T2 updates the same data item. Now, if the transaction T1 reads the
same data item X again, it will read some different value, because the data item is already
manipulated by the transaction T2. This problem is known as non-repeatable or fuzzy read
problem, which is illustrated in the following:
• T1: read(X)
• T2: write(X)
• T2: end transaction
• T1: read(X)
Example 7.5.
Durability. Durability refers to the fact that the effects of a successfully completed
(committed) transaction are permanently recorded in the database and will not be affected by
a subsequent failure. This property ensures that once a transaction commits, the changes are
durable and cannot be erased from the database. It is the responsibility of the recovery
subsystem of a DBMS to maintain durability property of transactions. The recovery manager
determines how the database is to be recovered to a consistent state (database recovery) in
which all the committed actions are reflected.
1. CPU and main memory utilization should be improved. Most of the typical
database applications spend much of their time waiting for I/O operations rather than
on computations. In a large system, the concurrent execution of these I/O bound
applications can turn into a bottleneck in main memory or in CPU time utilization. To
alleviate this problem, that is, to improve CPU and main memory utilization, a
transaction manager should adopt specialized techniques to deal with specific database
applications. This aspect is common to both centralized and distributed DBMSs.
2. Response time should be minimized. To improve the performance of transaction
executions, the response time of each individual transaction must be considered and
should be minimized. In the case of distributed applications, it is very difficult to
achieve an acceptable response time owing to the additional time required to
communicate between different sites.
3. Availability should be maximized. Although the availability in a distributed system
is better than that in a centralized system, it must be maximized for transaction
recovery and concurrency control in distributed databases. The algorithms
implemented by the distributed transaction manager must not block the execution of
those transactions that do not strictly need to access a site that is not operational.
4. Communication cost should be minimized. In a distributed system an additional
communication cost is incurred for each distributed or global application, because a
number of message transfers are required between sites to control the execution of a
global application. These messages are not only used to transfer data, but are required
to control the execution of the application. Preventative measures should be adopted
by the transaction manager to minimize the communication cost.
In a distributed DBMS, all these modules exist in each local DBMS. In addition, a global
transaction manager or transaction coordinator is required at each site to control the
execution of global transactions as well as of local transactions initiated at that site.
Therefore, an abstract model of transaction management at each site of a distributed system
consists of two different submodules: the transaction manager and the transaction
coordinator (which are described below).
Transaction manager –. The transaction manager at each site manages the execution of the
transactions that access data stored at that local site. Each such transaction may be a local
transaction or part of a global transaction. The structure of the transaction manager is similar
to that of its counterpart in a centralized system, and it is responsible for the following:
Classification of Transactions
At this point it is useful to discuss the classification of transactions. These classifications
have been defined based on various criteria such as the lifetime of the transaction,
organization of read and write operations within the transaction and the structure of the
transaction. Depending on their lifetime or duration, transactions can be classified into two
broad categories: short-duration transactions and long-duration transactions. A short-
duration transaction (also called online transaction) requires a very short
execution/response time and accesses a relatively small portion of the database. On the other
hand, a long-duration transaction (also called a batch transaction) requires a longer
execution/response time and generally accesses a larger portion of the database. Most of the
statistical applications – report generation, complex queries, image processing – are
characterized by long-duration transactions, whereas most of the current normal database
applications including railway reservation, banking system, project management require the
short-duration category transactions.
Another way to divide transactions into different classes is based on the organization of the
read and the write operations within the transaction. In some transactions, read and write
operations are organized in a mixed manner; they do not appear in any specific order. These
are called general transactions. Some transactions restrict the organization of read and write
operations in such a way that all the read operations within the transaction are performed
before the execution of any write operation. These transactions are called two-step
transactions. Similarly, some transactions may restrict that a data item has to be read before
it can be written (updated) within the transaction operation. These transactions are known
as restricted (or read-before-write) transactions. If a transaction is characterized by
features of both two-step and restricted transactions, then it is called a restricted two-step
transaction.
Transactions can also be classified into three broad categories based on the structure of
transactions, namely, flat transactions, nested transactions and workflow models, which
are discussed in the following:
1. A flat transaction has a single start point (begin_transaction) and a single end point
or termination point (end_transaction). Most of the typical transaction management
database applications are characterized by flat transactions. Flat transactions are
relatively simple and perform well in short activities, but they are not well-suited for
longer activities.
2. A nested transaction includes other transactions within its own start point and end
point; thus, several transactions, called subtransactions, may be nested into one
transaction. A nested transaction is further classified into two categories: closed nested
transaction and open nested transaction. In a closed nested transaction commit
occurs in a bottom-up fashion, that is, parent transactions are committed after the
commit of subtransactions or child transactions that are nested within it, although
subtransactions are started after the parent transaction. On the other hand, open
nested transactions provide flexibility by allowing the commit of the parent
transaction before the commit of subtransactions. An example of an open nested
transaction is “Saga” [Garcia-Molina and Salem, 1987], which is a sequence of
transactions that can be interleaved with other transactions. In “Saga”, only two levels
of nesting are permitted, and it does not support full atomicity at the outer level. Thus,
flexibility is provided with respect to ensuring the ACID properties of transactions.
There are two general issues involved in workflows: specification of workflows and
execution of workflows. The key issues involved in specifying a workflow are as
follows:
Example 7.6.
Let us consider the transactions T1, T2, T3 and T4, where T1 represents a general
transaction, T2 represents a two-step transaction, T3 represents a restricted transaction
and T4 represents a two-step restricted transaction.
T1: {read (a1), read (b1), write(a1), read(b2), write(b2), write(b1), commit}
T2: {read (a1), read (b1), read(b2), write(a1), write(b2), write(b1), commit}
T3: {read (a1), read (b1), read(b2), write(a1), read(c1), write(b2), read(c2), write(b1), write(c1), write(c2), commit}
T4: {read (a1), read (b1), read(b2), read(c1), read(c2), write(a1), write(b2), write(b1), write(c1), write(c2), commit}
Example 7.7.
Example 7.8.
Let us consider the example of patient’s health checkup transaction. The entire health
checkup activity consists of the following tasks and involves the following data.
• Task1 (T1): A patient requests for health checkup registration and Patient database is
accessed.
• Task2 (T2 & T3): Some medical tests have been performed, and the patient’s health
information is stored in the Health database. In this case, several medical tests can be
performed simultaneously. For simplicity, here it is assumed that two medical tests
have been performed.
• Task3 (T4): The doctor’s advice is taken and hence Doctor database is accessed.
• Task4 (T5): A bill is generated and the billing information is recorded into the
database.
In this patient’s health checkup transaction, there is a serial dependency of T2, T3
on T1, and T4 on T2, T3. Hence, T2 and T3 can be performed in parallel and T5 waits
until their completion. The corresponding workflow is illustrated in figure 7.2.
Chapter Summary
• A transaction consists of a set of operations that represents a single logical unit of
work in a database environment and moves the database from one consistent state to
another. ACID property of a transaction represents atomicity, consistency, isolation
and durability properties.
• Atomicity property stipulates that either all the operations of a transaction should be
completed successfully or none of them are carried out at all.
• Consistency property ensures that a transaction transforms the database from one
consistent state to another consistent state.
• Isolation property indicates that transactions are executed independently of one
another in a database environment; that is, they are isolated from one another.
• Durability refers to the fact that the effects of a successfully completed (committed)
transaction are permanently recorded in the database and are not affected by a
subsequent failure.
• The major components or subsystems of a distributed transaction management system
are the transaction manager and the transaction coordinator. The transaction manager
at each site manages the execution of those transactions that access data stored at that
local site. The transaction coordinator at each site in a distributed system coordinates
the execution of both local and global transactions initiated at that site.
• Depending on their lifetime or duration, transactions can be classified into two broad
categories: short-duration transactions and long-duration transactions. Based on
the organization of read and write operations within them, transactions are classified
as general, restricted and two-step restricted. Transactions can also be classified into
three broad categories based on the structure of transactions: flat transactions, nested
transactions and workflow models.
Chapter 8. Distributed Concurrency Control
This chapter introduces the different concurrency control techniques to handle concurrent
accesses of data items in a distributed database environment. In this chapter, the concurrency
control anomalies and distributed serializability are discussed in detail. To maintain
distributed serializability, both pessimistic as well as optimistic concurrency control
mechanisms are employed, which are also briefly discussed in this chapter.
1. Lost update anomaly – This anomaly can occur when an apparently successful
completed update operation made by one user (transaction) is overridden by another
user (transaction). This anomaly is illustrated in example 8.1.
Example 8.1.
Consider the example of an online electronic funds transfer system accessed via
remote automated teller machines (ATMs). Assume that one customer (T1) is
withdrawing $500 from an account with balance B, initially $2,000, while another
customer (T2) is depositing an amount $600 into the same account via ATM from a
different place. If these two tasks are executed serially, then the balance B will be
$2,100. Further assume that both the tasks are started nearly at the same time and both
read the same value $2,000. Due to concurrent access to the same data item, when
customer T1 is decrementing the value of B to $1,500, customer T2 is incrementing
the value of B to $2,100. Thus, update to the data item B made by customer T1 will be
overridden by the customer T2. Here, the consistency of the data is violated owing to
this anomaly, known as lost update anomaly.
2. Uncommitted dependency or dirty read anomaly – This problem occurs when one
transaction allows other transactions to read its data before it has committed and then
decides to abort. The dirty read anomaly is explained in example 8.2.
Example 8.2.
Example 8.3.
Let us consider the example of a banking system in which two transactions T3 and T4
are accessing the database concurrently. The transaction T3 is calculating the average
balance of all customers of a particular branch while the transaction T4 is depositing
$1,000 into one customer account with balance balx at the same time. Hence, the
average balance calculated by the transaction T3 is incorrect, because the
transaction T4 incremented and updated the balance of one customer during the
execution of the transaction T3. This anomaly, known as inconsistent analysis
anomaly, can be avoided by preventing the transaction T4 from reading and updating
the value of balx until the transaction T3 completes its execution.
Example 8.4.
Assume that in the employee database, one transaction T1 reads the salary of an
employee while another transaction T2 is updating the salary of the same employee (or
the same employee record is deleted from the employee database by the
transaction T2) concurrently. Now, if the transaction T1 attempts to re-read the value
of the salary of the same employee, it will read a different value (or that record will be
not found) since employee database is updated by the transaction T2. Thus, two read
operations within the same transaction T1 return different values. This anomaly occurs
due to concurrent access of the same data item, known as fuzzy read anomaly.
5. Phantom read anomaly – This anomaly occurs when a transaction performing some
operation on the database based on a selection predicate, another transaction inserts
new tuples satisfying that predicate into the same database. This is known as phantom
read. For example, assume that in the employee database, one transaction T1 retrieves
all employees belonging to the R&D department while another transaction T2 inserts
new employees into that department. Hence, if the transaction T1 re-executes the same
operation, then the retrieved data set will contain additional (phantom) tuples that has
been inserted by the transaction T2 in the meantime.
Besides all the above anomalies, multiple-copy consistency problem can arise in a
distributed database system as a result of data distribution. This problem occurs when data
items are replicated and stored at different locations in a DDBMS. To maintain the
consistency of the global database, when a replicated data item is updated at one site, all
other copies of the same data item must be updated. If any copy is not updated, the database
becomes inconsistent. The updates of the replicated data items are carried out either
synchronously or asynchronously to preserve the consistency of the data.
Distributed Serializability
A transaction consists of a sequence of read and write operations together with some
computation steps that represents a single logical unit of work in a database environment. All
the operations of a transaction are units of atomicity; that is, either all the operations should
be completed successfully or none of them are carried out at all. Ozsu has defined a formal
notation for the transaction concept. According to this formalization, a transaction Ti is a
partial ordering of its operations and the termination condition. The partial order P = {Σ, α}
defines an ordering among the elements of Σ (called the domain), where Σ consists of the
read and write operations and the termination condition (abort, commit) of the transaction Ti,
and α indicates the execution order of these operations within Ti.
In serializability, the ordering of read and write operations are important. In this context, it is
necessary to introduce the concept of conflicting transactions. If two transactions perform
read or write operations on different data items, then they are not conflicting, and hence the
execution order is not important. If two transactions perform read operation on the same data
item, then they are non-conflicting. Two transactions are said to be conflicting if they access
the same data item concurrently, and at least one transaction performs a write operation.
Thus, concurrent read-read operation by two transactions is non-conflicting, but concurrent
read-write, write-read and write-write operations by two transactions are conflicting. The
execution order is important if there is a conflicting operation caused by two transactions.
If concurrent accesses of data items are allowed in a database environment, then the schedule
may be non-serial, which indicates that the consistency of data may be violated. To allow
maximum concurrency and to preserve the consistency of data, it is sometimes possible to
generate a schedule from a non-serial schedule that is equivalent to a serial schedule or serial
execution order by swapping the order of the non-conflicting operations. If schedule S1 is
derived from a non-serial schedule S2 by swapping the execution order of non-conflicting
operations in S2, and if S1 produces the same output as that of a serial schedule S3 with the
same set of transactions, then S1 is called a serializableschedule. This type of serializability
is known as conflict serializability and Schedule S1 is said to be conflict equivalent to
schedule S2. A conflict serializable schedule orders any conflicting operations in the same
way as a serial execution. A directed precedence graph or a serialization graph can be
produced to test for conflict serializability.
Formally, distributed serializability can be defined as follows. Consider S is the union of all
local serializable schedules S1, S2,. . ., Sn respectively in a distributed system. Now, the
global schedule S is said to be distributed serializable, if for each pair of conflicting
operations Oi and Oj from distinct transactions Ti and Tj respectively from different sites,
Oi precedes Oj in the total ordering S, and if and only if Ti precedes Tj in all of the local
schedules where they appear together. To attain serializability, a DDBMS must incorporate
synchronization techniques that control the relative ordering of conflicting operations.
Example 8.5.
Read(b);
a : = a + b;
write(a);
commit;
The partial ordering of the above transaction Ti can be formally represented as P = {Σ, α},
Where Σ = {R(a), R(b), W(a), C} and
• α = {(R(a), W(a)), (R(b), W(a)), (W(a), (C)), (R(a), (C)), (R(b), (C))}.
Here, the ordering relation specifies the relative ordering of all operations with respect to the
termination condition. The partial ordering of a transaction facilitates to derive the
corresponding directed acyclic graph (DAG) for the transaction. The DAG of the above
transaction Ti is illustrated in figure 8.1.
In a distributed database system, the lock manager or scheduler is responsible for managing
locks for different transactions that are running on that system. When any transaction
requires read or write lock on data items, the transaction manager passes this request to the
lock manager. It is the responsibility of the lock manager to check whether that data item is
currently locked by another transaction or not. If the data item is locked by another
transaction and the existing locking mode is incompatible with the lock requested by the
current transaction, the lock manager does not allow the current transaction to obtain the
lock; hence, the current transaction is delayed until the existing lock is released. Otherwise,
the lock manager permits the current transaction to obtain the desired lock and the
information is passed to the transaction manager. In addition to these rules, some systems
initially allow the current transaction to acquire a read lock on a data item, if that is
compatible with the existing lock, and later the lock is converted into a write lock. This is
called upgradation of lock. The level of concurrency increases by upgradation of locking.
Similarly, to allow maximum concurrency some systems permit the current transaction to
acquire a write lock on a data item, and later the lock is converted into a read lock; this is
called downgradation of lock.
a : = a + 2; a : = a*10;
Write(a); Write(a);
Read(b); Read(b);
b : = b * 2; b : = b − 5;
Write(b); Write(b);
Commit; Commit;
In the above transactions, assume that transaction Ti acquires a lock on data item a initially.
The transaction Ti releases the lock on data item a as soon as the associated database
operation (read or write) is executed. However, the transaction Ti may acquire a lock on
another data item, say b, after releasing the lock on data item a. Although it seems that
maximum concurrency is allowed here and it is beneficial for the system, it permits
transaction Ti to interfere with other transactions, say Tj, which results in the loss of total
isolation and atomicity. Therefore, to guarantee serializability, the locking and releasing
operations on data items by transactions need to be synchronized. The most well-known
solution is two-phase locking (2PL) protocol.
The 2PL protocol simply states that no transaction should acquire a lock after it releases one
of its locks. According to this protocol, the lifetime of each transaction is divided into two
phases: Growing phase and Shrinking phase. In the growing phase, a transaction can
obtain locks on data items and can access data items, but it cannot release any locks. On the
other hand, in the shrinking phase a transaction can release locks but cannot acquire any new
locks after that. Thus, the ending of growing phase of a transaction determines the beginning
of the shrinking phase of that transaction. It is not necessary for each transaction to acquire
all locks simultaneously and then start processing. Normally, each transaction obtains some
locks initially, does some processing, and then requests for additional locks that are required.
However, it never releases any lock until it has reached a stage where no more locks are
required. If upgradation and downgradation of locks are allowed, then upgradation of locks
can take place only in the growing phase, whereas downgradation of locks can occur in the
shrinking phase.
When a transaction releases a lock on a data item, the lock manager allows another
transaction waiting for that data item to obtain the lock. However, if the former transaction is
aborted and rolled back, then the latter transaction is forced to be aborted and rolled back to
maintain the consistency of data. Thus, the rollback of a single transaction leads to a series of
rollbacks, called cascading rollback. One solution to the cascading rollback problem is to
implement 2PL in such a way that a transaction can release locks only when it has reached
the end of the transaction or when the transaction is committed. This is called rigorous
2PL protocol. To avoid cascading rollback, another possible variation in 2PL is strict 2PL
protocol, in which only write locks are held until the end of the transaction. In a distributed
database system, the 2PL protocol is further classified into several categories such as
centralized 2PL, primary copy 2PL, distributed 2PL and majority locking depending on how
the lock manager activities are managed.
Centralized 2PL
In the centralized 2PL method, the lock manager or scheduler is a central component, and
hence a single site is responsible for managing all activities of the lock manager for the entire
distributed system. Before accessing any data item at any site, appropriate locks must be
obtained from the central lock manager. If a global transaction Ti is initiated at site Si of the
distributed system, then the centralized 2PL for that global transaction should be
implemented in the following way.
The transaction manager at the site Si (called transaction coordinator) partitions the global
transaction into several subtransactions using the information stored in the global system
catalog. Thus, the transaction coordinator is responsible for maintaining consistency of data.
If the data items are replicated, then it is also the responsibility of the transaction coordinator
to ensure that all replicas of the data item are updated. Therefore, for a write operation, the
transaction coordinator requests exclusive locks on all replicas of the data item that are
stored at different sites. However, for a read operation, the transaction coordinator can select
any one of the replicas of the data item that are available, preferably at its own site. The local
transaction managers of the participating sites request and release locks from/to the
centralized lock manager following the normal rules of the 2PL protocol. Participating sites
are those sites that are involved in the execution of the global transaction (subtransactions).
After receiving the lock request, the centralized lock manager checks whether that lock
request is compatible with the existing locking status or not. If it is compatible, the lock
manager sends a message to the corresponding site that the lock has been granted; otherwise,
the lock request is put in a queue until it can be granted.
In some systems, a little variation of this method is followed. Here the transaction
coordinator sends lock requests to the centralized lock manager on behalf of participating
sites. In this case, a global update operation that involves n sites requires a minimum of 2n +
3 messages to implement centralized 2PL method. These are 1 lock request from the
transaction coordinator, 1 lock grant message from the centralized lock manager, n update
messages from the transaction coordinator, nacknowledgement messages from
the n participating sites and 1 unlock request from the transaction coordinator as illustrated
in figure 8.3.
The main advantage of the centralized 2PL method in DDBMSs is that it is a straightforward
extension of the 2PL protocol for centralized DBMSs; thus, it is less complicated and
implementation is relatively easy. In this case, the deadlock detection can be handled easily,
because the centralized lock manager maintains all the locking information. Hence, the
communication cost also is relatively low. In the case of DDBMSs, the disadvantages of
implementing the centralized 2PL method are system bottlenecks and lower reliability. For
example, as all lock requests go to the centralized lock manager, that central site may
become a bottleneck to the system. Similarly, the failure of the central site causes a major
failure of the system; thus, it is less reliable. This method also hampers the local autonomy.
Whenever a transaction is initiated at a site, the transaction coordinator will determine where
the primary copy is located, and then it will send lock requests to the appropriate lock
manager. If the transaction requires an update operation of a data item, it is necessary to
exclusively lock the primary copy of that data item. Once the primary copy of a data item is
updated, the change is propagated to the slave copies immediately to prevent other
transactions reading the old value of the data item. However, it is not mandatory to carry out
the updates on all copies of the data item as an atomic operation. The primary copy 2PL
method ensures that the primary copy is always updated.
The main advantage of primary copy 2PL method is that it overcomes the bottlenecks of
centralized 2PL approach and also reliability increases. This approach incurs lower
communication costs because less amount of remote locking is required. The major
disadvantage of this method is that it is only suitable for those systems where the data items
are selectively replicated, updates are infrequent and sites do not always require the latest
version of the data. In this approach, deadlock handling is more complex, as the locking
information is distributed among multiple lock managers. In the primary copy 2PL method
there is still a degree of centralization in the system, as the primary copy is only handled by
one site (primary site). This latter disadvantage can be partially overcome by nominating
backup sites to hold locking information.
Distributed 2PL
Distributed 2PL method implements the lock manager at each site of a distributed system.
Thus, the lock manager at each site is responsible for managing locks on data items that are
stored locally. If the database is not replicated, then distributed 2PL becomes the same as
primary copy 2PL. If the database is replicated, then distributed 2PL implements a read-one-
write-all (ROWA) replica control protocol. In ROWA replica control protocol, any copy of
a replicated data item can be used for a read operation, but for a write operation, all copies of
the replicated data item must be exclusively locked before performing the update operation.
In distributed 2PL, when a global transaction is initiated at a particular site, the transaction
coordinator (the transaction manager of that site is called transaction coordinator) sends lock
requests to lock managers of all participating sites. In response to the lock requests the lock
manager at each participating site can send a lock granted message to the transaction
coordinator. However, the transaction coordinator does not wait for the lock granted message
in some implementations of the distributed 2PL method. The operations that are to be
performed at a participating site are passed by the lock manager of that site to the
corresponding transaction manager instead of the transaction coordinator. At the end of the
operation the transaction manager at each participating site can send the corresponding
message to the transaction coordinator. In an alternative approach, the transaction manager at
a participating site can also pass the “end of operation” message to its own lock manager,
who can then release the locks and inform the transaction coordinator. The communication
between the participating sites and the transaction coordinator when executing a global
transaction using distributed 2PL is depicted in figure 8.4.
In distributed 2PL method, locks are handled in a decentralized manner; thus, it overcomes
the drawbacks of centralized 2PL method. This protocol ensures better reliability than
centralized 2PL and primary copy 2PL methods. However, the main disadvantage of this
approach is that deadlock handling is more complex owing to the presence of multiple lock
managers. Hence, the communication cost is much higher than in primary copy 2PL, as all
data items must be locked before an update operation. In distributed 2PL method, a global
update operation that involves n participating sites may require a minimum of 5n messages.
These are n lock requests, n lock granted messages, n update messages, n acknowledgement
messages and n unlock requests.
When a transaction requires to read or write a data item that is replicated at m sites, the
transaction coordinator must send a lock request to more than half of the m sites where the
data item is stored. Each lock manager determines whether the lock can be granted
immediately or not. If the lock request is not compatible with the existing lock status, the
response is delayed until the request is granted. The transaction coordinator waits for a
certain timeout period to receive lock granted messages from the lock mangers, and if the
response is not received within that time period, it cancels its request and informs all sites
about the cancellation of the lock request. The transaction cannot proceed until the
transaction coordinator obtains locks on a majority of copies of the data item. In majority
locking protocol, any number of transactions can simultaneously hold a shared lock on a
majority of copies of a data item, but only one transaction can acquire an exclusive lock on a
majority of copies. However, the execution of a transaction can proceed in the same manner
as in distributed 2PL method after obtaining locks on a majority of copies of the data items
that are required by the transaction.
This protocol is an extension of the distributed 2PL method, and hence it avoids the
bottlenecks of centralized 2PL method. The main disadvantage of this protocol is that it is
more complicated and hence the deadlock handling also is more complex. In this method,
locking a data item that has mcopies requires at least (m + 1)/2 messages for lock requests
and (m + 1)/2 messages for unlock requests; thus, communication cost is higher than that of
centralized 2PL and primary copy 2PL methods.
Biased Protocol
The biased protocol is another replica control approach. This approach assigns more
preference to requests for shared locks on data items than to requests for exclusive locks.
Like in majority locking protocol, in biased protocol a lock manager is implemented at each
site to manage locks for all data items stored at that site. The implementation of the biased
protocol in a distributed environment is described below.
In this protocol, when a transaction requires to read a data item that is replicated at m sites,
the transaction coordinator simply sends a lock request to the lock manager of one site that
contains a replica of that data item. Similarly, when a transaction needs to lock a data item in
exclusive mode, the transaction coordinator sends lock requests to lock managers at all sites
that contain a replica of that data item. As before, if the lock request is not compatible with
the existing lock status, the response is delayed until the request is granted.
The main advantage of the biased protocol is that it imposes less overhead on read operations
than the majority locking protocol. This scheme is beneficial for common transactions where
the frequency of read operations is much higher than the frequency of write operations.
However, this approach shares the additional overhead on write operations as in distributed
2PL protocol. In this case, the deadlock handling is also complicated as in majority locking
protocol.
• Pr + Pw > T
• and 2 * Pw > T
where T is the total weight of all sites of the distributed system where the data item P resides.
To perform a read operation on the data item P, enough number of replicas of the data
item P stored at different sites must be read, so that their total weight is greater than or equal
to Pr. On the other hand, to perform a write operation on the data item P enough number of
copies of the data item P stored at different sites must be written, so that their weight is
greater than or equal to Pw. In both cases, the protocol ensures that locks are achieved on a
majority of the replicas of the data item P stored at different sites before executing a read or
a write operation on P. Hence, the transaction coordinator would send lock requests to
different lock managers accordingly.
The major benefit of the quorum consensus protocol is that it allows to reduce the cost of
read and write operations selectively by appropriately assigning the read and write quorums
to data items. For example, a read operation on a data item with a small read quorum
requires reading of fewer replicas of the data item. But the write quorum of that data item
may be higher, and hence, to succeed, the write operation needs to obtain locks on more
replicas of the data item that are available. Furthermore, if higher weights are assigned to
some sites, fewer sites need to be accessed for acquiring locks on a data item. The quorum
consensus protocol can simulate the majority locking protocol and the biased protocol if
appropriate weights and quorums are assigned. The disadvantage of majority locking
protocol such as complexity in deadlock handling is shared by this protocol also.
There are two primary methods for generating unique timestamp values in a distributed
environment: centralized and distributed. In the centralized approach, a single site is
responsible for assigning unique timestamp values to all transactions that are generated at
different sites of the distributed system. The central site can use a logical counter or its own
system clock for assigning timestamp values. The centralized approach is very simple, but
the major disadvantage of this approach is that it uses a central component for generating
timestamp values, which is vulnerable to system failures. In the distributed approach, each
site generates unique local timestamp values by using either a logical counter or its own
system clock. To generate unique global timestamp values, unique local timestamp values
are concatenated with site identifiers and the pair <local timestamp value, site identifier>
represents a unique global timestamp value across the whole system. The site identifier is
appended in the least significant position to ensure that the global timestamp value generated
in one site is not always greater than those generated in another site. This approach still has a
problem: the system clocks at different sites would not be synchronized and one site can
generate local timestamp values at a rate faster than that of other sites. If logical counters are
used at each site to generate local timestamp values, there is the possibility of different sites
generating the same value on the counter. The general approach that is used to generate
unique global timestamp values in a distributed system is discussed in the following.
In this approach, each site generates unique local timestamp values based on its own local
logical counter. To maintain uniqueness, each site appends its own identifier to the local
counter value, in the least significant position. Thus, the global timestamp value is a two-
tuple of the form <local counter value, site identifier>. To ensure the synchronization of all
logical counters in the distributed system, each site includes its timestamp value in inter-site
messages. After receiving a message, each site compares its timestamp value with the
timestamp value in the message and if its timestamp value is smaller, then the site changes
its own timestamp to a value that is greater than the message timestamp value. For instance,
if site1 sends a message to site2 with its current timestamp value <8, 1>, where the current
timestamp value of site2 is <10, 2>, then site2 would not change its timestamp value. On the
other hand, if the current timestamp value of site2 is <6, 2>, then site2 would change its
current timestamp value to <11, 2>.
In the basic TO algorithm, each transaction is assigned a unique timestamp value by its
transaction manager. A timestamp ordering scheduler (which is a software module) is
implemented to check the timestamp of each new operation against those of conflicting
operations that have already been scheduled. The basic TO implementation distributes the
schedulers along with the database. At each site, if the timestamp value of a new operation is
found greater than the timestamp values of all conflicting operations that have already been
scheduled, then it is accepted; otherwise, the corresponding transaction manager assigns a
new timestamp value to the new operation, and the entire transaction is restarted. A
timestamp ordering scheduler guarantees that transactions are conflict serializable and that
the results are equivalent to a serial schedule in which the transactions are executed
chronologically, that is, in the order of the timestamp values. The comparison between the
timestamp values of different transactions can be done only when the timestamp ordering
scheduler has received all the operations to be scheduled. In a distributed database
environment, it may happen that operations come to the scheduler one at a time, and an
operation may reach out of sequence. To overcome this, each data item is assigned two
timestamp values: one is a read_timestamp value, which is the timestamp value of the
last transaction that had read the data item, and the other is a write_timestamp value, which
is the timestamp value of the last transaction that had updated the data item. The
implementation of the basic TO algorithm for a transaction T in a DDBMS is described
below.
The coordinating transaction manager assigns a timestamp value to the transaction T, say
ts(T), determines the sites where each data item is stored, and sends the relevant operations
to these sites. Let us consider that the transaction T requires to perform a read operation on a
data item x located at a remote site. The data item x has a read_timestamp value and a
write_timestamp value, which are Rxand Wx respectively. The read operation on the data
item x proceeds only if ts(T) > Wx, and a new value, ts(T), is assigned to the read_timestamp
of data item x; otherwise, the operation is rejected and the corresponding message is sent to
the coordinating transaction manager. In other cases, for instance, if ts(T) < Wx, it is
indicated that the older transaction T is trying to read a value of the data item x that has been
updated by a younger transaction. The older transaction T is too late to read the previous
outdated value of the data item x, and the transaction T is aborted here. Similarly, the
transaction T can perform a write operation on the data item x, only when the conditions
ts(T) > Rxand ts(T) > Wx are satisfied. When an operation (read or write) of a transaction is
rejected by the timestamp ordering scheduler, the transaction should be restarted with a new
timestamp value, and it is assigned by the corresponding transaction manager.
To maintain the consistency of data, the data processor must execute the accepted operations
in the order in which the scheduler passes them. The timestamp ordering scheduler maintains
a queue for each data item to enforce the ordering, and delays the transfer of the next
accepted operation on the same data item until an acknowledgement is received from the
data processor regarding the completion of the previous operation.
The major advantage of the basic TO algorithm is that deadlock never occurs here, because a
transaction that is rejected by the timestamp ordering scheduler can restart with a new
timestamp value. However, in this approach a transaction may have to restart numerous
times to facilitate deadlock freedom.
Conservative TO Algorithm
The Conservative TO algorithm is a technique for eliminating transaction restarts during
timestamp ordering scheduling and thereby reducing the system overhead. The conservative
TO algorithm differs from the basic TO algorithm in the way the operations are executed in
each method. In basic TO algorithm, when an operation is accepted by the timestamp
ordering scheduler, it passes the operation to the data processor for execution as soon as
possible. In the case of conservative TO algorithm, when an operation is received by the
timestamp ordering scheduler, the scheduler delays the execution of the operation until it is
sure that no future restarts are possible. Thus, the execution of operations is delayed to
ensure that no operation with a smaller timestamp value can arrive at the timestamp ordering
scheduler. The scheduler will never reject an operation if the satisfaction of the above
condition is assured. Thus, the operations of each transaction are buffered here until an
ordering can be established, so that rejections are not possible, and they are executed in that
order. The conservative TO algorithm is implemented in the following way.
In conservative TO algorithm, each scheduler maintains one queue for each transaction
manager in the distributed system. Whenever an operation is received by a timestamp
ordering scheduler, it buffers the operation according to the increasing order of timestamp
values in the appropriate queue for the corresponding transaction manager. For instance, the
timestamp ordering scheduler at site p stores all the operations that it has received from the
transaction manager of site q in a queue Qpq. The scheduler at each site passes a read or write
operation with the smallest timestamp value from these queues to the data processor for
execution. The execution of the operation can proceed after establishing an ordering of the
buffered operations so that rejections cannot take place.
The above implementation of the conservative TO algorithm suffers from some major
problems. First, if some transaction managers never send any operations to a scheduler, the
scheduler will get stuck and stop outputting. To overcome this problem, it is necessary that
each transaction manager communicate regularly with every scheduler in the system, which
is infeasible in large networks. Another problem is that although conservative TO reduces
the number of restarts of the transactions, it is not guaranteed that they will be eliminated
completely. For instance, consider that the timestamp scheduler at site S1 has chosen an
operation with the smallest timestamp value, say t(x), and passed it to the data processor for
execution. It may happen that the site S2 has sent an operation to the scheduler of site S1
with a timestamp value less than t(x) that may still be in transit in the network. When this
operation reaches the site S1, it will be rejected, because it violates the TO rule. In this case,
the operation wants to access a data item that is already accessed by another operation with a
higher timestamp value.
One solution to the above problems is to ensure that each scheduler chooses an operation for
execution and passes it to the data processor only when there is at least one operation in each
queue. In the absence of real traffic, when a transaction manager does not have any
operations to execute, it will send a null operation to each timestamp scheduler in the
system periodically. Thus, the null operations are sent periodically to ensure that the
operations that will be sent to schedulers in future will have higher timestamp values than the
null operations. This approach is more restrictive and is called extremely conservative TO
algorithm.
Another solution to the above problem is transaction classes, which is less restrictive and
reduces communications compared to extremely conservative TO algorithm. In this
technique, transaction classes are defined based on readset and writeset of transactions,
which are known in advance. A transaction T is a member of a class C if readset(T) is a
subset of readset(C) and writeset(T) is a subset of writeset(C). However, it is not necessary
that transaction classes should be disjoint. The conservative TO algorithm has to maintain
one queue for each class instead of maintaining one queue for the transaction manager at
each site in the distributed system. With this modification, the timestamp scheduler will
choose an operation of a transaction for execution with smallest timestamp value until there
is at least one operation in each class to which the transaction belongs.
Example 8.6.
Hence, T1 is a member of classes C1 and C2, T2 is a member of classes C2 and C3, and T3
is a member of class C3. Now, the conservative TO algorithm has to maintain one queue for
each class C1, C2, and C3 instead of maintaining one queue for each transaction manager at
each site in the distributed system. The timestamp scheduler will choose an operation of a
transaction, with the smallest timestamp value, for execution until there is at least one
operation in each class C1, C2 or C3 to which the transaction belongs.
Multi-version TO Algorithm
Multi-version TO algorithm is another timestamp-based protocol that reduces restart
overhead of transactions. In multi-version TO algorithm, the updates of data items do not
modify the database, but each write operation creates a new version of a data item while
retaining the old version. When a transaction requires reading a data item, the system selects
one of the versions that ensures serializability. Multi-version TO algorithm is implemented
in the following way.
In this method, for each data item P in the system, the database holds a number of versions
such as P1, P2, P3,..., Pn respectively. Furthermore, for each version of the data item P, the
system stores three values: the value of version Pi, read_timestamp(Pi), which is the largest
timestamp value of all transactions that have successfully read the version Pi, and
write_timestamp(Pi), which is the timestamp value of the transaction that created the
version Pi. The existence of versions of data items is transparent to users. The transaction
manager assigns a timestamp value to each transaction, which is used to keep track of the
timestamp values of each version. When a transaction T with timestamp value ts(T) attempts
to perform a read or write operation on the data item P, the multi-version TO algorithm uses
the following two rules to ensure serializability.
1. If the transaction T wants to read the data item P, a version Pj is chosen that has the
largest write_timestamp value that is less than ts(T), that is, write_timestamp(Pj) <
ts(T). In this case, the read operation is sent to the data processor for execution and the
value of read_timestamp(Pj) is reset as ts(T).
2. If the transaction T wishes to perform a write operation on the data item P, it must be
ensured that the data item P has not already been read by some other transaction
whose timestamp value is greater than ts(T). Thus, the write operation is permitted and
sent to the data processor if the version Pj that has the largest write_timestamp of data
item P that is less than ts(T) satisfies the condition read_timestamp(Pj) < ts(T). If the
operation is permitted, a new version of the data item P, say Pk, is created where
read_timestamp(Pk) = write_timestamp(Pk) = ts(T). Otherwise, the transaction T is
aborted and restarted with a new timestamp value.
The multi-version TO algorithm requires more storage space to store the values of different
versions of data items with respect to time. To save storage space, older versions of data
items can be deleted if they are no longer required. The multi-version TO algorithm is a
suitable concurrency control mechanism for DBMSs that are designed to support
applications that inherently have a notion of versions of database objects such as engineering
databases and document databases.
The optimistic concurrency control techniques can be implemented either based on locking
or based on timestamp ordering. The optimistic concurrency control technique for DDBMS
based on timestamp ordering [proposed by Ceri and Owicki] is implemented in the following
way.
Whenever a global transaction is generated at a site, the transaction manager of that site
(called the transaction coordinator) divides the transaction into several subtransactions, each
of which can execute at many sites. Each local execution can then proceed without delay and
follows the sequence of phases of optimistic concurrency control technique until it reaches
the validate phase. The transaction manager assigns a timestamp value to each transaction at
the beginning of the validate phase that is copied to all its subtransactions. Hence, no
timestamp values are assigned to data items and a timestamp value is assigned to each
transaction only at the beginning of the validate phase, because its early assignment may
cause unnecessary transaction restarts. Let us consider that a transaction Tp is divided into a
number of subtransactions, and a subtransaction of Tp that executes at site q is denoted by Tpq.
To perform the local validation in the validate phase, the following three rules are followed:
1. For each transaction Tr, if ts(Tr) < ts(Tpq) (that is, the timestamp value of each
transaction Tr is less than the timestamp value of the subtransaction Tpq) and if each
transaction Tr has finished its write phase before Tpq has started its read phase, the
validation succeeds.
2. If there is any transaction Tr such that ts(Tr) < ts(Tpq) that completes its write phase
while Tpq is in its read phase, then the validation succeeds, if none of the data items
updated by the transaction Trare read by the subtransaction Tpq and the
transaction Tr completes writing its updates into the database before the
subtransaction Tpq starts writing. Thus, WS(Tr) ∩ RS(Tpq) = Φ, where WS(Tr) represents
the write_timestamp value of the data item S after the transaction Tr completes a write
operation on it, and RS(Tpq) represents the read_timestamp value of the data
item S after the subtransaction Tpq has performed a read operation on S.
3. If there is any transaction Tr such that ts(Tr) < ts(Tpq) that completes its read phase
before Tpqfinishes its read phase, then the validation succeeds, if the updates made by
the transaction Tr do not affect the read phase or the write phase of the
transaction Tpq. In this case, WS(Tr) ∩ RS(Tpq) = Φ, and WS(Tr) ∩ WS(Tpq) = Φ.
In this algorithm Rule1 is used to ensure that the transactions are executed serially according
to their timestamp ordering. Rule2 indicates that the set of data items written by the older
transaction are not read by the younger transaction, and the older transaction completes its
write phase before the younger transaction enters its validate phase. Rule3 state that the
updates on data items made by the older transaction do not affect the read phase or write
phase of the younger transaction, but as in Rule2 it does not require that the older transaction
finish writing on data items before the younger transaction starts writing.
The major advantage of optimistic concurrency control algorithm is that a higher level of
concurrency is allowed here, since the transactions can proceed without imposed delays. This
approach is very efficient when there are few conflicts, but conflicts can result in the rollback
of individual transactions. However, there are no cascading rollbacks, because the rollback
involves only a local copy of the data. A disadvantage of optimistic concurrency control
algorithm is that it incurs higher storage cost. To validate a transaction, the optimistic
approach has to store the read and the write sets of several other terminated transactions.
Another disadvantage with this approach is starvation. It may happen that a long transaction
fails repeatedly in the validate phase during successive trials. This problem can be overcome
by allowing the transaction exclusive access to the data items after a specified number of
trials, but this reduces the level of concurrency to a single transaction.
Chapter Summary
• Concurrency control is the activity of coordinating concurrent accesses to a database
in a multi-user system, while preserving the consistency of the data. A number of
anomalies can occur owing to concurrent accesses on data items in a DDBMS, which
can lead to the inconsistency of the database. These are lost update anomaly,
uncommitted dependency or dirty read anomaly, inconsistent analysis anomaly,
non-repeatable or fuzzy read anomaly, phantom read anomaly and multiple-copy
consistency problem. To prevent data inconsistency, it is essential to guarantee
distributed serializability of concurrent transactions in a DDBMS.
• A distributed schedule or global schedule (the union of all local schedules) is said to
be distributed serializable if the execution orders at each local site is serializable and
local serialization orders are identical.
• The concurrency control techniques are classified into two broad
categories: Pessimistic concurrency control mechanisms and Optimistic
concurrency control mechanisms.
• Pessimistic algorithms synchronize the concurrent execution of transactions early in
their execution life cycle.
• Optimistic algorithms delay the synchronization of transactions until the transactions
are near to their completion. The pessimistic concurrency control algorithms are
further classified into locking-based algorithms, timestamp-based
algorithms and hybrid algorithms.
• In locking-based concurrency control algorithms, the synchronization of transactions
is achieved by employing physical or logical locks on some portion of the database or
on the entire database.
• In the case of timestamp-based concurrency control algorithms, the synchronization of
transactions is achieved by assigning timestamps both to the transactions and to the
data items that are stored in the database.
• To allow maximum concurrency and to improve efficiency, some locking-based
concurrency control algorithms also involve timestamps ordering; these are
called hybrid concurrency control algorithms.
Chapter 9. Distributed Deadlock Management
This chapter introduces different deadlock management techniques to handle deadlock
situations in a distributed database system. Distributed deadlock prevention, distributed
deadlock avoidance, and distributed deadlock detection and recovery methods are briefly
discussed in this chapter.
The outline of this chapter is as follows. Section 9.1 addresses the deadlock problem, and the
deadlock prevention methods for distributed systems are discussed in Section 9.2. In Section
9.3, distributed deadlock avoidance is represented. The techniques for distributed deadlock
detection and recovery are focused on in Section 9.4.
Introduction to Deadlock
In a database environment, a deadlock is a situation when transactions are endlessly waiting
for one another. Any lock-based concurrency control algorithm and some timestamp-based
concurrency control algorithms may result in deadlocks, as these algorithms require
transactions to wait for one another. In lock-based concurrency control algorithms, locks on
data items are acquired in a mutually exclusive manner; thus, it may cause a deadlock
situation. Whenever a deadlock situation arises in a database environment, outside
interference is required to continue with the normal execution of the system. Therefore, the
database systems require special procedures to resolve the deadlock situation.
Deadlock situations can be characterized by wait-for graphs, directed graphs that indicate
which transactions are waiting for which other transactions. In a wait-for graph, nodes of the
graph represent transactions and edges of the graph represent the waiting-for relationships
among transactions. An edge is drawn in the wait-for graph from transaction Ti to
transaction Tj, if the transaction Ti is waiting for a lock on a data item that is currently held by
the transaction Tj. Using wait-for graphs, it is very easy to detect whether a deadlock
situation has occurred in a database environment or not. There is a deadlock in the system if
and only if the corresponding wait-for graph contains a cycle.
Example 9.1.
Let us assume that in a database environment there are two
transactions T1 and T2 respectively. Further assume that currently the transaction T1 is holding
an exclusive lock on data item P, and the transaction T2 is holding an exclusive lock on data
item Q. Now, if the transaction T1 requires a write operation on data item Q, the
transaction T1 has to wait until the transaction T2 releases the lock on the data item Q.
However, in the meantime if the transaction T2 requires a read or a write operation on the
data item P, the transaction T2 also has to wait for the transaction T1. In this situation, both
the transactions T1 and T2 have to wait for each other indefinitely to release their respective
locks, and no transaction can proceed for execution; thus a deadlock situation arises.
One solution to the above problem is using the unique timestamp value of each transaction as
the priority of that transaction. One way to obtain a global timestamp value for every
transaction in a distributed system is assigning a unique local timestamp value to each
transaction by its local transaction manager and then appending the site identifier to the low-
order bits of this value. Thus, timestamp values are unique throughout the distributed system
and do not require that clocks at different sites are synchronized precisely. Based on
timestamp values, there are two different techniques for deadlock prevention: Wait-
die and Wound-Wait.
Deadlock situation cannot arise and the younger transaction is restarted in both the above
methods. The main difference between the two techniques is that whether they preempt the
active transaction or not. In the wait-die method a transaction can only be restarted when it
requires accessing a new data item for the first time. A transaction that has acquired locks on
all the required data items will never be aborted. In wound-wait method, it is possible that a
transaction that is already holding a lock on a data item is aborted, because another older
transaction may request the same data item. Wait-die method gives preference to younger
transactions and aborts older transactions, as older transactions wait for younger ones and
they tend to wait longer as they get older. On the other hand, wound-wait method prefers
older transactions, as an older transaction never waits for a younger transaction. In wait-die
method, it is possible that a younger transaction is aborted and restarted a number of times if
the older transaction holds the corresponding lock for a long time, while in wound-wait
method the older transaction is aborted and restarted only once.
The main disadvantage of a deadlock prevention scheme is that it may result in unnecessary
waiting and rollback. Furthermore, some deadlock prevention schemes may require more
sites in a distributed system to be involved in the execution of a transaction.
Deadlock avoidance methods are more suitable than deadlock prevention schemes for
database environments, but they require runtime support for deadlock management, which
adds runtime overheads in transaction execution. Further, in addition to requiring
predeclaration of locks, a principal disadvantage of this technique is that it forces locks to be
obtained sequentially, which tends to increase response time.
The corresponding GWFG in figure 9.2 illustrate that a deadlock has occurred in the
distributed system, although no deadlock has occurred locally.
There are three different techniques for detecting deadlock situations in a distributed
system: Centralized deadlock detection, Hierarchical deadlock
detection and Distributed deadlock detection, which are described in the following.
To minimize communication cost, the DDC should only send the changes that have to be
made in the LWFGs to the lock managers. These changes represent the addition or removal
of edges in the LWFGs. The actual length of a period for global deadlock detection is a
system design decision, and it is a trade-off between the communication cost of the deadlock
detection process and the cost of detecting deadlocks late. The communication cost increases
if the length of the period is larger, whereas some deadlocks in the system go undetected if
the length of the deadlock detection period is smaller.
The centralized deadlock detection approach is very simple, but it has several drawbacks.
This method is less reliable, as the failure of the central site makes the deadlock detection
impossible. The communication cost is very high in this case, as other sites in the distributed
system send their LWFGs to the central site. Another disadvantage of centralized deadlock
detection technique is that false detection of deadlocks can occur, for which the deadlock
recovery procedure may be initiated, although no deadlock has occurred. In this method,
unnecessary rollbacks and restarts of transactions may also result owing to phantom
deadlocks. [False deadlocks and phantom deadlocks are discussed in detail in Section 9.4.4.]
In figure 9.3, local deadlock detection is performed at leaf nodes site1, site2, site3, site4 and
site5. The deadlock detector at site6, namely DD12, is responsible for detecting any deadlock
situation involving its child nodes site1 and site2. Similarly, site3, site4 and site5 send their
LWFGs to site7 and the deadlock detector at site7 searches for any deadlocks involving its
adjacent child nodes. A global deadlock detector exists at the root of the tree that would
detect the occurrence of global deadlock situations in the entire distributed system. Here, the
global deadlock detector resides at site8 and would detect the deadlocks involving site6 and
site7.
The performance of the hierarchical deadlock detection approach depends on the hierarchical
organization of nodes in the system. This organization should reflect the network topology
and the pattern of access requests to different sites of the network. This approach is more
reliable than centralized deadlock detection as it reduces the dependence on a central site;
also it reduces the communication cost. However, its implementation is considerably more
complicated, particularly with the possibility of site and communication link failures. In this
approach, detection of false deadlocks can also occur.
In this approach, a LWFG is constructed for each site by the respective local deadlock
detectors. An additional external node is added to the LWFGs, as each site in the distributed
system receives the potential deadlock cycles from other sites. In the distributed deadlock
detection algorithm, the external node Tex is added to the LWFGs to indicate whether any
transaction from any remote site is waiting for a data item that is being held by a transaction
at the local site or whether any transaction from the local site is waiting for a data item that is
currently being used by any transaction at any remote site. For instance, an edge from the
node Ti to Tex exists in the LWFG, if the transaction Ti is waiting for a data item that is
already held by any transaction at any remote site. Similarly, an edge from the external
node Tex to Ti exists in the graph, if a transaction from a remote site is waiting to acquire a
data item that is currently being held by the transaction Ti at the local site. Thus, the local
detector checks for two things to determine a deadlock situation. If a LWFG contains a cycle
that does not involve the external node Tex, then it indicates that a deadlock has occurred
locally and it can be handled locally. On the other hand, a global deadlock potentially exists
if the LWFG contains a cycle involving the external node Tex. However, the existence of such
a cycle does not necessarily imply that there is a global deadlock, as the external
node Tex represents different agents.
The LWFGs are merged so as to determine global deadlock situations. To avoid sites
transmitting their LWFGs to each other, a simple strategy is followed here. According to this
strategy, one timestamp value is allocated to each transaction and a rule is imposed such that
one site Si transmits its LWFG to the site Sk, if a transaction, say Tk, at site Sk is waiting for a
data item that is currently being held by a transaction Ti at site Si and ts(Ti) < ts(Tk). If ts(Ti) <
ts(Tk), the site Si transmits its LWFG to the site Sk, and the site Sk adds this information to its
LWFG and checks for cycles not involving the external node Tex in the extended graph. If
there is no cycle in the extended graph, the process continues until a cycle appears and it may
happen that the entire GWFG is constructed and no cycle is detected. In this case, it is
decided that there is no deadlock in the entire distributed system. On the other hand, if the
GWFG contains a cycle not involving the external node Tex, it is concluded that a deadlock
has occurred in the system. The distributed deadlock detection method is illustrated in figure
9.4.
The major benefit of distributed deadlock detection algorithm is that it is potentially more
robust than the centralized or hierarchical methods. Deadlock detection is more complicated
in this method, because no single site contains all the information that is necessary to detect a
global deadlock situation in the system, and therefore substantial inter-site communication is
required, which increases the communication overhead. Another disadvantage of distributed
deadlock detection approach is that it is more vulnerable to the occurrence of false deadlocks
than centralized or hierarchical methods.
False Deadlocks
To handle the deadlock situation in distributed database systems, a number of messages are
transmitted among the sites. On the other hand, both in centralized and in hierarchical
deadlock detection methods, LWFGs are propagated periodically to one or more deadlock
detector sites in the system. The periodic nature of this transmission process causes two
problems. First, the delay that is associated with the transmission of messages that is
necessary for deadlock detection can cause the detection of false deadlocks. For instance,
consider that at a particular time the deadlock detector has received the information that the
transaction Ti is waiting for the transaction Tj. Further assume that after some time the
transaction Tj releases the data item requested by the transaction Ti and requests for data item
that is being currently held by the transaction Ti. If the deadlock detector receives the
information that the transaction Tj has requested for a data item that is held by the
transaction Ti before receiving the information that the transaction Ti is not blocked by the
transaction Tj any more, a false deadlock situation is detected.
Another problem is that a transaction Ti that blocks another transaction may be restarted for
reasons that are not related to the deadlock detection. In this case, until the restart message of
the transaction Ti is transmitted to the deadlock detector, the deadlock detector can find a
cycle in the wait-for graph that includes the transaction Ti. Hence, a deadlock situation is
detected by the deadlock detector, and this is called a phantom deadlock. When the
deadlock detector detects a phantom deadlock, it may unnecessarily restart a transaction
other than Ti. To avoid unnecessary restarts for phantom deadlocks, special safety measures
are required.
Chapter Summary
• In a database environment, a deadlock is a situation where transactions are waiting for
one another indefinitely. Deadlock situations can be represented by wait-for
graphs, directed graphs that indicate which transactions are waiting for which other
transactions. The handling of deadlock situations is more complicated in a distributed
DBMS than in a centralized DBMS, as it involves a number of sites. Three general
techniques are available for deadlock resolution in a distributed database
system: distributed deadlock prevention, distributed deadlock avoidance and
distributed deadlock detection and recovery from deadlock.
• Distributed Deadlock Prevention – Distributed deadlock prevention is a cautious
scheme in which a transaction is restarted when the distributed database system
suspects that a deadlock might occur. Deadlock prevention approach may be
preemptive or non-preemptive. Two different techniques are available for deadlock
prevention based on transaction timestamp values: Wait-die and Wound-Wait.
• Distributed Deadlock Avoidance – Distributed deadlock avoidance is an alternative
method in which each data item in the database system is numbered and each
transaction requests locks on these data items in that numeric order to avoid deadlock
situations.
• Distributed Deadlock Detection and Recovery – In distributed deadlock detection and
recovery method, first it is checked whether any deadlock situation has occurred in the
distributed system. After the detection of a deadlock situation in the system, a victim
transaction is chosen and aborted to resolve the deadlock situation. There are three
different techniques for detecting deadlock situations in the distributed
system: Centralized deadlock detection, Hierarchical deadlock detection and
Distributed deadlock detection.
• To handle the deadlock situations in distributed database systems, a number of
messages are transmitted periodically among the sites, which may cause two
problems: false deadlock and phantom deadlock.
Chapter 10. Distributed Recovery
Management
This chapter focuses on distributed recovery management. In the context of distributed
recovery management, two terms are closely related to each other: reliability and availability.
The concept of reliability and availability is clearly explained here. The chapter also
introduces different types of failures that may occur in a distributed DBMS. Different
distributed recovery protocols are also briefly represented, as well as the concepts of
checkpointing and cold restart.
The organization of this chapter is as follows. Section 10.1 introduces the concept of
availability and reliability. In Section 10.2 different types of failures that may occur in a
distributed database environment are discussed. Section 10.3 represents the different steps
that should be followed after a failure has occurred. In Section 10.4, local recovery protocols
are described, and distributed recovery protocols are represented in Section 10.5. Network
partitioning is focused on in Section 10.6.
In the context of distributed recovery management, two terms are often used: reliability and
availability. Hence, it is necessary to differentiate between these two
terms. Reliability refers to the probability that the system under consideration does not
experience any failures in a given time period. In general, reliability is applicable for non-
repairable (or non-recoverable) systems. However, the reliability of a system can be
measured in several ways that are based on failures. To apply reliability to a distributed
database, it is convenient to split the reliability problem into two separate parts, as it is very
much associated with the specification of the desired behavior of the database: application-
dependent part and application-independent part. The application-independent specification
of reliability consists of requiring that transactions preserve their ACID property. The
application-dependent part consists of requiring that transactions fulfill the system’s general
specifications, including the consistency constraints. The recovery manager only deals with
the application-independent specification of reliability.
Availability refers to the probability that the system can continue its normal execution
according to the specification at a given point in time in spite of failures. Failures may occur
prior to the execution of a transaction and when they have all been repaired, the system must
be available for continuing its normal operations at that point of time. Thus, availability
refers to both the resilience of the DBMS to various types of failures and its capability to
recover from them. In contrast with reliability, availability is applicable for repairable
systems. In distributed recovery management, reliability and availability are two related
aspects. It is accepted that it is easier to develop highly available systems, but it is very
difficult to develop highly reliable systems.
• Transaction failure–. There are two types of errors that may cause one or more
transactions to fail: application software errors and system errors. A transaction can
no longer continue owing to logical errors in the application software that is used for
database accessing, such as bad input, data not found, overflow or resource limit
exceeded. The system enters into an undesirable state (e.g., deadlock) resulting in the
failure of one or more transactions, known as a system error.
• System crash–. A hardware malfunction, or a bug in the software (either in the
database software or in the operating system) may result in the loss of main memory,
which forces the processing of transactions to halt.
• Disk failure–. A disk block loses its contents as a result of either a head crash or an
unreadable media or a failure during a data transfer operation. This also leads the
processing of transactions to halt.
In a distributed DBMS, several different types of failures can occur in addition to the failures
listed above for a centralized DBMS. These are site failure, communication link failure, loss
of messages and network partition as listed in the following.
• Site failure–. In a distributed DBMS, system failures are typically referred to as site
failures, and it can occur owing to a variety of reasons as mentioned above. In this
context, it is necessary to mention that system failures that result in the loss of the
contents of the main memory are only considered here. The system failure may be
total or partial. A partial failure indicates the failure of some sites in the distributed
system while the other sites remain operational. A total failureindicates the
simultaneous failure of all sites in a distributed system.
• Loss of messages–. The loss of messages, or improperly ordered messages, is the
responsibility of the underlying computer network protocol. It is assumed that these
types of failures are handled by the data communication component of the distributed
DBMS, so these are not considered here.
• Communication link failure–. The performance of a distributed DBMS is highly
dependent on the ability of all sites in the network to communicate reliably with one
another. Although the communication technology has improved significantly, link
failure can still occur in a distributed DBMS.
• Network partition–. Owing to communication link failure, it may happen that the
entire network gets split into two or more partitions, known as network partition. In
this case, all sites in the same partition can communicate with each other, but cannot
communicate with sites in the other partitions.
In some cases, it is very difficult to distinguish the type of failure that has occurred in the
system. Generally, the system identifies that a failure has occurred and initiates the
appropriate technique for recovery from the failure.
The recovery protocols are classified into two different categories based on the method of
executing the update operations: in-place updating and out-place updating. In in-place
updating method, the update operation changes the values of data items in the stable storage,
thereby losing the old values of data items. In the case of out-place updating method, the
update operation does not change the values of data items in the stable storage, but stores the
new values separately. These updated values are integrated into the stable storage
periodically. The immediate modification technique is a log-based in-place updating
recovery protocol. Shadow paging is a non-log-based out-place updating recovery protocol.
One in-place updating recovery protocol and one out-place updating recovery protocol are
described in the next sections.
The database log is used to protect against system failures in the following way.
To facilitate the recovery procedure, it is essential that the log buffers are written into the
stable storage before the corresponding write to the permanent database. This is known
as write-ahead log (WAL) protocol. If updates were made to the database before the log
record was written and if a failure occurred, the recovery manager would have no way of
undoing the operations.
Under WAL protocol, using immediate modification technique, the recovery manager
preserves the consistency of data in the following way, when failures occur.
• Any transaction that has both a transaction start and a transaction commit information
in the log must be redone. The redo procedure performs all the writes to the database
using the updated values of data items in the log records for the transaction, in the
order in which they were written to the log. It is to be noted that the updated values
that have already been written into the database have no effect, even though rewriting
is unnecessary.
• Any transaction that has a transaction start record in the log but has no transaction
commit record has to be undone. Using the old values of data items, the recovery
manager preserves the database in the consistent state that existed before the start of
the transaction. The undo operation is performed in the reverse order in which they
were written to the log.
Example 10.1.
Let us consider a banking transaction T1 that transfers $1,000 from account A to account B,
where the balance of account A is $5,000 and the balance of account B is $2,500 before the
transaction starts. Using immediate modification recovery technique, the following
information will be written into the database log and into the permanent database to recover
from a failure.
<T1, commit>
If the transaction T1 fails in between the log records <T1, start> and <T1, commit>, then it
should be undone. On the other hand, if the database log contains both the records <T1,
start> and <T1, commit> it should be redone.
Shadow Paging
Shadow paging is a non-log-based out-place updating recovery protocol in which duplicate
stable-storage pages are maintained during the lifetime of a transaction. These are known
as current pageand shadow page. When the transaction starts, the two pages are identical.
Whenever an update is made, the old stable-storage page, called the shadow page, is left
intact for recovery purpose and a new page with the updated data item values is written into
the stable-storage database, called the current page. During the transaction processing, the
current page is used to record all updates to the database. When the transaction completes,
the current page becomes the shadow page. If a system failure occurred before the successful
completion of the transaction, the current page is considered as garbage.
• All log records in the main memory are written into secondary storage.
• All modified pages in the database buffer are written into secondary storage.
• A checkpoint record is written to the log. This record indicates the identifiers of all
transactions that are active at the time of checkpoint.
If a system failure occurs during checkpointing, the recovery process will not find an
end_checkpoint record and will conclude that checkpointing is not completed. When a
system failure occurs, the database log is checked and all transactions that have committed
since the end_checkpoint are redone, and all transactions that were active at the time of the
crash are undone. In this case, the redo operation only needs to start from the end_checkpoint
record in the log. Similarly, the undo operation can go in the reverse direction, starting from
the end of the log and stopping at the end_checkpoint record.
Some catastrophic failure may cause the loss of log information on stable storage, or loss of
the stable database. In this case, it is not possible to reconstruct the current consistent state of
the database. Cold restart technique is required to restore the database to a previous
consistent state to maintain the consistency of data in case of loss of information on stable
storage. In a distributed database system, the cold restart technique is much more
complicated than in a centralized DBMS, because the state of the distributed database as a
whole is to be restored to the previous consistent state. The previous consistent state is
marked in a database log by a checkpoint. Hence, when the log information on stable storage
is lost, it is not possible to restore the database to the previous consistent state using
checkpointing.
Several techniques have been proposed for performing cold restarts in a distributed database,
and most of them produce a high overhead for the system. One solution to handle this
situation is to store an archive copy of both the database and the log on a different storage
medium. Thus, the DBMS deals with three different levels of memory hierarchy. When loss-
of-log failures occur, the database is recovered from the archive copy by redoing and
undoing the transactions as stored in the archive log. The overhead of writing the entire
database and the log information into a different storage is a significant one. The archiving
activity can be performed concurrently with normal processing whenever any changes are
made to the database. Each archive version contains only the changes that have occurred
since the previous archiving.
Another technique for performing cold restart in a distributed database is to use local logs,
local dumps and global checkpoint. A global checkpoint is a set of local checkpoints that
are performed at all sites of the network and are synchronized by the following criterion:
The reconstruction is relatively easier if global checkpoints are available. First, at the failed
site the latest local checkpoint that can be consider as safe is determined, and this determines
which earlier global state has to be reconstructed. Then all other sites are requested to re-
establish the local states of the corresponding local checkpoints. The main difficulty with this
approach is the recording of global checkpoints.
The following section introduces two distributed recovery protocols: two-phase commit
(2PC) protocol and three-phase commit (3PC) protocol, a non-blocking protocol. To
discuss these protocols, the transaction model described in Chapter 7, Section 7.4 is
considered.
• Phase 1–. The coordinator writes a begin_commit record in its log and sends a
“prepare” message to all the participating sites. The coordinator waits for the
participants to respond for a certain time interval (timeout period).
• Phase 2–. When a participant receives a “prepare” message, it may return an “abort”
vote or a “ready_commit” vote to the coordinator. If the participant returns an “Abort”
vote, it writes an abort record into the local database log and waits for the coordinator
for a specified time period. On the other hand, if the participant returns a
“Ready_commit” vote, it writes a commit record into the corresponding database log
and waits for the coordinator for a specified time interval. After receiving votes from
participants, the coordinator decides whether the transaction will be committed or
aborted. The transaction is aborted, if even one participant votes to abort or fails to
vote within the specified time period. In this case, the coordinator writes an “abort”
record in its log and sends a “global_abort” message to all participants and then waits
for the acknowledgements for a specified time period. Similarly, the transaction is
committed if the coordinator receives commit votes from all participants. Here, the
coordinator writes a “commit” record in its log and sends a “global_commit” message
to all participants. After sending the “global_commit” message to all participants, the
coordinator waits for acknowledgements for a specified time interval. If a participating
site fails to send the acknowledgement within that time limit, the coordinator resends
the global decision until the acknowledgement is received. Once all
acknowledgements have been received by the coordinator, it writes an end_transaction
record in its log.
When a participating site receives a “global_abort” message from the coordinator, it writes
an “abort” record it its log and after aborting the transaction sends an acknowledgement to
the coordinator.
If a participant fails to receive a vote instruction from the coordinator, it simply times out
and aborts. Therefore, a participant can abort a transaction before the voting. In 2PC, each
participant has to wait for either the “global_commit” or the “global_abort” message from
the coordinator. If the participant fails to receive the vote instruction from the coordinator, or
the coordinator fails to receive the response from a participant, then it is assumed that a site
failure has occurred and the termination protocol is invoked. The termination protocol is
followed by the operational sites only. The failed sites follow the recovery protocol on
restart. The 2PC protocol is illustrated in figure 10.1.
Coordinator
The coordinator may timeout in three different states, namely, wait,
abort and commit. Timeouts during the abort and the commit states are handled in the same
manner; thus, only two cases are considered in the following.
1. Timeout in the wait state–. In this state, the coordinator is waiting for participants to
vote whether they want to commit or abort the transaction and timeout occurs. In this
situation, the coordinator cannot commit the transaction, as it has not received all
votes. However, it can decide to globally abort the transaction. Hence, the coordinator
writes an “abort” record in the log and sends a “global_abort” message to all
participants.
2. Timeout in the commit or abort state–. In this state, the coordinator is waiting for all
participants to acknowledge whether they have successfully committed or aborted and
timeout occurs. In this case, the coordinator resends the “global_commit” or
“global_abort” message to participants that have not acknowledged.
Participant
A participant may timeout in two different states. These are initial and ready as shown
in figure 10.1.
Figure 10.1. Two-phase Commit Protocol
1. Timeout in the initial state–. In this state, the participant is waiting for a “prepare”
message from the coordinator and timeout occurs, which indicates that the coordinator
must have failed while in the initial state. Hence, the participant can unilaterally abort
the transaction. If the participant subsequently receives a “prepare” message, either it
can send an “abort” vote to the coordinator or it can simply ignore it.
2. Timeout in the ready state–. In this state, the participant has voted to commit and is
waiting for the “global_abort” or “global_commit” decision from the coordinator, and
timeout occurs. In this case, the participant cannot unilaterally abort the transaction, as
it had voted to commit. The participant also cannot commit the transaction, because it
does not know the global decision. Hence, the participant is blocked until it can learn
from someone the fate of the transaction. The participant could contact any of the
other participants to know the fate of the transaction. This is known as cooperative
termination protocol.
The cooperative termination protocol reduces the likelihood of blocking, but still blocking is
possible. If the coordinator fails and all participants detect this as a result of executing the
termination protocol, then they can elect a new coordinator and thereby resolve the blocking.
The state transition diagram for 2PC is depicted in figure 10.2.
Coordinator failure
The following are the three possible cases of failure of the coordinator.
1. Failure in initial state–. Here, the coordinator has not yet started the commit
procedure. Therefore, it will start the commit procedure on recovery.
2. Failure in wait state–. In this case, the coordinator has sent the “prepare” message to
participants. On recovery, the coordinator will again start the commit procedure from
the beginning; thus, it will send the “prepare” message to all participants once more.
3. Failure in commit or abort state–. In this case, the coordinator has sent the global
decision to all participants. On restart, if the coordinator has received all
acknowledgements, it can complete successfully. Otherwise, the coordinator will
initiate the termination protocol.
Participant failure
The objective of the recovery procedure for a participant is to ensure that it performs the
same action as all other participants on restart, and this restart can be performed
independently. The following are the three possible cases of failure of the participant.
1. Failure in initial state–. Here the participant has not yet voted to commit the
transaction. Hence, the participant can unilaterally abort the transaction, because the
participant has failed before sending the vote. In this case, the coordinator cannot
make a global commit decision without this participant’s vote.
2. Failure in ready state–. In this case, the participant has sent its vote to the
coordinator. On recovery, the participant will invoke the termination protocol.
3. Failure in commit or abort state–. In this case, the participant has completed the
transaction; thus, on restart no further action is necessary.
In linear 2PC, participants can communicate with one another. In this communication
scheme, an ordering is maintained among the sites in the distributed system. Let us consider
that the sites are numbered as 1, 2, 3, ..., n such that site number 1 is the coordinator and the
others are participants. The 2PC is implemented by a forward chain of communication from
the coordinator to the participant n in the voting phase and a backward chain of
communication from participant n to the coordinator in the decision phase. In the voting
phase, the coordinator passes the voting instruction to site 2, site 2 votes and passes it to site
3, site 3 combines its vote and passes it to site 4 and so on. When the nth participant adds its
vote, the global decision is obtained and it is passed backward to the participants, and
eventually back to the coordinator. The linear 2PC reduces the number of messages
compared to centralized 2PC, but does not provide any parallelism, thus, suffers from low
response time performance. It is suitable for networks that do not have broadcasting
capability [Bernstein et al., 1987]. The linear 2PC communication scheme is illustrated
in figure 10.4.
Another alternative popular communication scheme for 2PC protocol is distributed 2PC, in
which all participants can communicate with each other directly. In this communication
scheme, the coordinator sends “prepare” message to all participants in the voting phase. In
the decision phase, each participant sends its decision to all other participants and waits for
messages from all other participants. As the participants can reach a decision on their own,
the distributed 2PC eliminates the requirement for the decision phase of the 2PC protocol.
The distributed 2PC communication scheme is illustrated in figure 10.5.
1. Timeout in the wait state–. The action taken here is identical to that in the
coordinator timeout in the wait state for the 2PC protocol. In this state, the coordinator
can decide to globally abort the transaction. Therefore, the coordinator writes an
“abort” record in the log and sends a “global_abort” message to all participants.
2. Timeout in the precommit state–. In this case, the coordinator does not know
whether the non-responding participants have already moved to the precommit state or
not, but the coordinator can decide to commit the transaction globally as all
participants have voted to commit. Hence, the coordinator sends a “prepare-to-
commit” message to all participants to move them into the commit state, and then
globally commits the transaction by writing a commit record in the log and sending
“global_commit” message to all participants.
3. Timeout in the commit or abort state–. In this state, the coordinator is waiting for all
participants to acknowledge whether they have successfully committed or aborted and
timeout occurs. Hence, the participants are at least in the precommit state and can
invoke the termination protocol as listed in case (ii) and case (iii) in the following
section. Therefore, the coordinator is not required to take any special action in this
case.
Participant
A participant may timeout in three different states: initial, ready and precommit.
1. Timeout in the initial state–. In this case, the action taken is identical to that in the
termination protocol of 2PC.
2. Timeout in the ready state–. In this state, the participant has voted to commit and is
waiting for the global decision from the coordinator. As the communication with the
coordinator is lost, the termination protocol continues by electing a new coordinator
(the election protocol is discussed later). The new coordinator terminates the
transaction by invoking a termination protocol.
3. Timeout in the precommit state–. In this case, the participant has received the
“prepare-to-commit” message and is waiting for the final “global_commit” message
from the coordinator. This case is handled in the same way as described in case (ii)
above.
In the above two cases, the new coordinator terminates the transaction by invoking the
termination protocol. The new coordinator does not keep track of participant failures during
termination, but it simply guides operational sites for the termination of the transaction. The
termination protocol for the new coordinator is described below.
• If the new coordinator is in the wait state, it will globally abort the transaction.
• If the new coordinator is in the precommit state, it will globally commit the transaction
and send a “global_commit” message to all participants, as no participant is in the
abort state.
• If the new coordinator is in the abort state, it will move all participants to the abort
state.
The 3PC protocol is a non-blocking protocol, as the operational sites can properly terminate
all ongoing transactions. The state transition diagram for 3PC is shown in figure 10.7.
1. Coordinator failure in wait state–. In this case, the participants have already
terminated the transaction during the coordinator failure. On recovery, the coordinator
has to learn from other sites regarding the fate of the transaction.
2. Coordinator failure in precommit state–. The participants have already terminated
the transaction. As it is possible to move into the abort state from the precommit state
during coordinator failure, on restart, the coordinator has to learn from other sites
regarding the fate of the transaction.
3. Participant failure in precommit state–. On recovery, the participant has to learn
from other participants how they terminated the transaction.
ELECTION PROTOCOL
Whenever the participants detect the failure of the coordinator, the election protocol is
invoked to elect a new coordinator. Thus, one of the participating sites is elected as the new
coordinator for terminating the ongoing transaction properly. Using linear ordering, the
election protocol can be implemented in the following way. It is assumed here that each
site Si has an order i in the sequence, the lowest being the coordinator, and each site knows
the ordering and identification of other sites in the system. For electing a new coordinator, all
operational participating sites are asked to send a message to other participants that have
higher identification numbers. In this case, the site Si sends message to the sites Si + 1, Si + 2, Si +
3 and so on. Whenever a participant receives the message from a lower-numbered participant,
that participant stops sending messages. Eventually, each participant knows whether there is
an operational site with a lower number or not. If not, then it becomes the new coordinator. If
the elected new coordinator also times out during this process, then election protocol is again
invoked.
This protocol is relatively efficient and most participants stop sending messages quite
quickly. When a failed site with a lower identification number recovers it forces all higher-
numbered sites to elect it as the new coordinator, regardless of whether there is a new
coordinator or not.
Network Partition
Owing to communication link failures, network partition can occur in a distributed system. If
the network is split into only two partitions, then it is called simple partitioning. On the
other hand, if the network is split into more than two partitions, then it is called multiple
partitioning. It is very difficult to maintain the consistency of the database in the case of
network partition, as messages are lost. In the case of non-replicated databases, a transaction
is allowed to proceed during network partitioning, if it does not require any data from a site
outside the partition. Otherwise, the transaction must wait until the sites from which it
requires data are available. In the case of replicated data, the procedure becomes much more
complicated.
Pessimistic Protocols
Pessimistic protocols emphasize on the consistency of the database, and therefore do not
allow the processing of transactions during network partitioning if there is no guarantee that
consistency can be maintained. The termination protocols that deal with network partitioning
in the case of non-replicated databases are pessimistic. To minimize blocking, it is
convenient to allow the termination of the transaction by at least one group, possibly the
largest group, during a network partition. However, during network partition, it is impossible
for a group to determine whether it is the largest one or not. In this case, the termination
protocol can use a pessimistic concurrency control algorithm such as primary copy
2PL or majority locking as described in Chapter 8, Section 8.5.
In the case of primary copy 2PL, for any transaction, only the partition that contains the
primary copies of the data items can execute the transaction. Hence, the recovery of the
network involves simply propagating all the updates to every other site. However, this
method is vulnerable to the failure of the site that contains the primary copy. In many
subnets, a primary site failure is more likely to occur than a partition; thus, this approach can
increase blocking, instead of reducing it.
The above problem can be solved if majority locking technique is used. The basic idea of
majority locking protocol is that before committing (or aborting) a transaction, a majority of
sites must agree to commit (or abort) the transaction. The majority locking protocol cannot
be applied with 2PC protocol, as it requires a specialized commit protocol.
Both of the above approaches are simple, and recovery using these approaches is very much
straightforward. However, they require that each site is capable of differentiating network
partitioning from site failures. Another straightforward generalization of the majority locking
protocol, known as quorum-based protocol (or weighted majority protocol), can be used
as a replica control method for replicated databases as well as a commit method to ensure
transaction atomicity in the case of network partitioning. In the case of non-replicated
databases the integration of the quorum-based protocols with commit protocol is necessary.
In quorum-based protocol, a weight is assigned to each site usually known as a vote. The
basic rules for the quorum-based protocol are as follows:
1. In the network, each site i is associated a weight or a vote Vi, which is a positive
integer. Assume that V is the sum of the votes of all sites in the network.
2. A transaction must collect a commit quorum Vc before committing the transaction.
3. A transaction must collect an abort quorum Va before aborting the transaction.
4. Vc + Va > V.
The last rule ensures that one transaction cannot be aborted and committed at the same time.
The rules 2 and 3 ensure that the transaction has to obtain votes before termination.
The integration of 3PC protocol with quorum-based protocols requires a minor modification
to the phase 3. To move from precommit state to commit state, it is necessary for the
coordinator to obtain a commit quorum from the participants. It is not necessary to
implement the rule 3 explicitly, because a transaction that is in the wait or ready state is
always willing to abort the transaction; thus, an abort quorum already exists. When network
partition occurs, the sites in each partition elect a new coordinator to terminate the
transaction. If the newly elected coordinator fails, it is not possible to know whether a
commit quorum or an abort quorum was reached. Therefore, it is necessary for the
participants to take an explicit decision before joining either the commit quorum or the abort
quorum; the votes cannot be changed afterwards. Since the wait and the ready states do not
satisfy these requirements, one new state “preabort” is added between ready and abort states.
The state transition diagram for the modified 3PC is illustrated in figure 10.8.
During network partition, a transaction can be terminated in the following way using quorum
3PC protocol.
The state transition diagram for quorum 3PC protocol is illustrated in figure 10.8. The
quorum 3PC protocol is a blocking protocol, but it is capable of handling site failures as well
as network partitioning.
One protocol for enforcing consistency of replicated data items is ROWA (read-one-write-
all) protocol [described in Chapter 8]. There are various versions of ROWA protocol. In
ROWA-A protocol, update is performed only on all available copies of the replicated data
item, and unavailable copies are updated later. In this approach, the coordinator sends an
update message to all sites where replicas of the data item reside and waits for update
confirmation. If the coordinator times out before receiving acknowledgements from all sites,
it is assumed that the non-responding sites are unavailable and the coordinator continues the
update with available sites to terminate the transaction. There are two problems with this
approach. It this approach, it may so happen that one site containing the replica of a data
item has already performed the update and sent the acknowledgement, but as the coordinator
has not received it within the specified time interval, the coordinator has considered the site
as unavailable. Another problem is that some sites might have been unavailable when the
transaction started, but might have recovered since then and might have started executing
transactions. Therefore, the coordinator must perform a validation before committing. To
perform this validation, the coordinator sends an “inquiry” message to all sites. If the
coordinator gets a reply from a site that was unavailable previously (that is, the coordinator
had not received any acknowledgement from this site previously), the coordinator terminates
the transaction by aborting it; otherwise, the coordinator proceed to commit the transaction.
The difficulty with the quorum-consensus protocol is that transactions are required to obtain
a quorum even to read data. This significantly and unnecessarily slows read access to the
database.
Chapter Summary
• The major objective of a recovery manager is to employ an appropriate technique for
recovering the system from a failure, and to preserve the database in the consistent
state that existed prior to the failure.
• Reliability refers to the probability that the system under consideration does not
experience any failures in a given time period.
• Availability refers to the probability that the system can continue its normal execution
according to the specification at a given point in time in spite of failures.
• There are different types of failures that may occur in a distributed database
environment such as site failure, link failure, loss of message and network partition.
• Two distributed recovery protocols are 2PC protocol and 3PC protocol. 2PC is a
blocking protocol whereas 3PC is a non-blocking protocol.
• The termination protocols that are used to handle network partition can be classified
into two categories: pessimistic protocols and optimistic protocols. In the case of non-
replicated databases, the termination protocols that are used to deal with network
partition are pessimistic. In the case of replicated databases, both types of termination
protocols can be used.
Chapter 11. Distributed Query Processing
This chapter introduces the basic concepts of distributed query processing. A query involves
the retrieval of data from the database. In this chapter, different phases and sub phases of
distributed query processing are briefly discussed with suitable examples. An important
phase in distributed query processing is distributed query optimization. This chapter focuses
on different query optimization strategies, distributed cost model and cardinalities of
intermediate result relations. Finally, efficient algorithms for centralized and distributed
query optimization are also presented in this chapter.
The organization of this chapter is as follows. Section 11.1 introduces the fundamentals of
query processing, and the objectives of distributed query processing are presented in Section
11.2. The different steps for distributed query processing such as query transformation, query
fragmentation, global query optimization and local query optimization are briefly discussed
in Section 11.3. In Section 11.4, join strategies in fragmented relations are illustrated. The
algorithms for global query optimization are represented in Section 11.5.
Example 11.1.
Let us consider the following two relations which are stored in a centralized DBMS
• “Retrieve the names of all employees whose department location is ‘inside’ the
campus”
where, empid and deptno are the primary keys for the relations Employee and Department
respectively and deptno is a foreign key of the relation Employee.
Two equivalent relational algebraic expressions that correspond to the above SQL statement
are as follows:
In the first relational algebraic expression the projection and the selection operations have
been performed after calculating the Cartesian product of two relations Employee and
Department, whereas in the second expression the Cartesian Product has been performed
after performing the selection and the projection operation from individual relations.
Obviously, the use of computing resources is lesser in the second expression. Thus, in a
centralized DBMS, it is easier to choose the optimum execution strategy based on a number
of relational algebraic expressions that are equivalent to the same query. In a distributed
context, the query processing is significantly more difficult because the choice of optimum
execution strategy depends on some other factors such as data transfer among sites and the
selection of the best site for query execution. The problem of distributed query processing is
discussed below with the help of an example.
Example 11.2.
Let us consider the same example 11.1 in a distributed database environment where the
Employee and Department relations are fragmented and stored at different sites. For
simplicity, let us assume that the Employee relation is horizontally fragmented into two
partitions EMP1 and EMP2, which are stored at site1 and site2 respectively, and the
Department relation is horizontally fragmented into two relations DEPT1 and DEPT2, which
are stored in site3 and site4 respectively as listed below.
• EMP1 = σdeptno <= 10 (Employee)
• EMP2 = σdeptno > 10 (Employee)
• DEPT1 = σdeptno <= 10 (Department)
• DEPT2 = σdeptno > 10 (Department)
Further assume that the above query is generated at Site5 and the result is required at that
site. Two different strategies to execute the query in the distributed environment are depicted
in figure 11.2.
In the first strategy, all data are transferred to Site5 before processing the query. In the
second strategy, selection operations are performed individually on the fragmented relations
DEPT1 and DEPT2 at Site3 and Site4 respectively, and then the resultant data D1 and D2 are
transferred to Site1 and Site2 respectively. After evaluating the join operations D1 with
EMP1 at Site1 and D2 with EMP2 at Site2, the resultant data are transmitted to Site5 and the
final projection operation is performed.
To calculate the costs of the above two different execution strategies, let us assume that the
cost of accessing a tuple from any relation is 1 unit, and the cost of transferring a tuple
between any two sites is 10 units. Further consider that the number of tuples in Employee
and Department relations are 1,000 and 20 respectively, among which the location of 8
departments are inside the campus. For the sake of simplicity, assume that the tuples are
uniformly distributed among sites, and the relations Employee and Department are locally
clustered on attributes “deptno” and “location” respectively.
1. The cost of transferring 10 tuples of DEPT1 from Site3 and 10 tuples of DEPT2 from
Site4 to Site5 = (10 + 10) * 10 = 200
2. The cost of transferring 500 tuples of EMP1 from Site1 and 500 tuples of EMP2 from
Site2 to Site5 = (500 + 500) * 10 = 10,000
3. The cost of producing selection operations from DEPT1 and DEPT2 = 20 * 1 = 20
4. The cost of performing join operation of EMP1 and EMP2 with resultant selected data
from Department relation = 1,000 * 8 * 1 = 8,000
5. The cost of performing projection operation at Site5 to retrieve the employee names =
8*1=8
The cost of performing the projection operation and the selection operation on a tuple is
same because in both cases it is equal to the cost of accessing a tuple. Obviously, the
execution strategy B is much more cost-beneficial than the strategy A. Furthermore, the
slower communication between the sites and the higher degree of fragmentation may
increase the cost difference between the alternative query processing/execution strategies.
Query Decomposition
Query decomposition is the first phase in distributed query processing. The objective of this
phase is to transform a query in a high-level language on global relations into a relational
algebraic query on global relations. The information required for the transformation is
available in the global conceptual schema describing the global relations. In this phase, the
syntactical and semantic correctness of the query is also checked. In the context of both
centralized DBMS and distributed DBMS, the query decomposition phase is the same. The
four successive steps of query decomposition are normalization, analysis,
simplification and query restructuring, which are briefly discussed in the following
sections.
NORMALIZATION
In normalization step, the query is converted into a normalized form to facilitate further
processing in an easier way. A given query can be arbitrarily complex depending on the
predicates (WHERE clause in SQL) specified in the query. In this step, the complex query is
generally converted into one of the two possible normal forms by using a few transformation
rules, which are listed below.
• P1 ∧ P2 ⇔ P2 ∧ P1
• P1 ∨ P2 ⇔ P2 ∨ P1
• P1 ∧ (P2 ∧ P3) ⇔ (P1 ∧ P2) ∧ P3
• P1 ∨ (P2 ∨ P3) ⇔ (P1 ∨ P2) ∨ P3
• P1 ∧ (P2 ∨ P3) ⇔ (P1 ∧ P2) ∨ (P1 ∧ P3)
• P1 ∨ (P2 ∧ P3) ⇔ (P1 ∨ P2) ∧ (P1 ∨ P3)
• ¬(P1 ∧ P2) ⇔ ¬P1 ∨ ¬P2
• ¬(P1 ∨ P2) ⇔ ¬P1 ∧ ¬P2
• ¬(¬P1) ⇔ P1
The two possible normal forms are conjunctive normal form and disjunctive normal
form.
Conjunctive Normal Form: In conjunctive normal form, preference is given to the AND (∧
predicate) operator, and it is a sequence of conjunctions that are connected by the ∧ (AND)
operator. Each conjunction contains one or more terms connected by the ∨ (OR) operator.
For instance,
where Pi represents a simple predicate, is in disjunctive normal form. In this normal form, a
query can be processed as independent conjunctive subqueries connected by union
operations.
Example 11.3.
Here, in the above disjunctive normal form, each disjunctive connected by ∨ (OR) operator
can be processed as independent conjunctive subqueries.
ANALYSIS
The objective of the analysis step is to reject normalized queries that are incorrectly
formulated or contradictory. The query is lexically and syntactically analyzed in this step by
using the compiler of the high-level query language in which the query is expressed. In
addition, this step verifies whether the relations and attributes specified in the query are
defined in the global conceptual schema or not. It is also checked in the analysis step
whether the operations on database objects specified in the given query are correct for the
object type. When incorrectness is detected in any of the above aspects, the query is returned
to the user with an explanation; otherwise, the high-level query is transformed into an
internal form for further processing. The incorrectness in the query is detected based on the
corresponding query graph or relation connection graph, which can be constructed as
follows:
• A node is created in the query graph for the result and for each base relation specified
in the query.
• An edge between two nodes is drawn in the query graph for each join operation and
for each project operation in the query. An edge between two nodes that are not result
nodes represents a join operation, whereas an edge whose destination node is the result
node represents a project operation.
• A node in the query graph that is not the result node is labelled by a select operation or
a self-join operation specified in the query.
The relation connection graph is used to check the semantic correctness of the subset of
queries that do not contain disjunction or negation. Such a query is semantically incorrect, if
its relation connection graph is not connected. A join graph for a query is a subgraph of the
relation connection graph which represents only join operations specified in the query, and it
can be derived from the corresponding query graph. The join graph is useful in the query
optimization phase. A query is contradictory if its normalized attribute connection
graph [Rosenkrantz & Hunt, 1980] contains a cycle for which the valuation sum is negative.
The construction of normalized attribute connection graph for a query is described in the
following:
• A node is created for each attribute referenced in the query and an additional node is
created for a constant 0.
• A directed edge between two attribute nodes is drawn to represent each join operation
in the query. Similarly, a directed edge between an attribute node and a constant 0
node is drawn for each select operation specified in the query.
• A weight is assigned to each edge depending on the inequality condition mentioned in
the query. A weight v is assigned to the directed edge a1 → a2, if there is an
inequality condition in the query that satisfies a1 ≤ a2 + v. Similarly, a weight −v is
assigned to the directed edge 0 → a1, if there is an inequality condition in the query
that satisfies a1 ≥ v.
Example 11.4.
• “Retrieve the names, addresses and course names of all those students whose year
of admission is 2008 and course duration is 4 years”.
• Select sname, address, course-name from Student, Course where year = 2008 and
duration = 4 and Student.course-id = Course.course-id.
Here, course-id is a foreign key of the relation Student. The above SQL query can be
syntactically and semantically incorrect for several reasons. For instance, the attributes
sname, address, course-name, year and duration may not be declared in the corresponding
schema or the relations Student, Course may not be defined in the global conceptual schema.
Furthermore, if the operations “= 2008” and “= 4” are incompatible with data types of the
attributes year and duration respectively, then the above SQL query is incorrect.
The query graph and the join graph for the above SQL query are depicted in figures
11.4(a) and 11.4(b)respectively.
In the above SQL query, if the join condition between two relations (that is, Student.course-
id = Course.course-id) is missing, then there would be no line between the nodes
representing the relations Student and Course in the corresponding query graph [figure
11.4(a)]. Therefore, the SQL query is deemed semantically incorrect as the relation
connection graph is disconnected. In this case, either the query is rejected or an implicit
Cartesian product between the relations is assumed.
Example 11.5.
Let us consider the query “Retrieve all those student names who are admitted into courses
where the course duration is greater than 3 years and less than 2 years”, that involves the
relations Student and Course.
• Select sname from Student, Course where duration > 3 and Student.course-id =
Course.course-id and duration ≤ 2.
The normalized attribute connection graph for the above query is illustrated in figure 11.5.
In the above normalized attribute connection graph, there is a cycle between the nodes
duration and 0 with a negative valuation sum, which indicates that the query is contradictory.
SIMPLIFICATION
In this step, all redundant predicates in the query are detected and the common
subexpressions are eliminated to transform the query into a simpler and efficiently
computable form. This transformation must achieve the semantic correctness. Typically,
view definitions, access restrictions and integrity constraints are considered in this step, some
of which may introduce redundancy in the query. The well-known idempotency rules of
Boolean algebra are used to eliminate redundancies from the given query, which are listed
below.
• P∧P⇔P
• P∨P⇔P
• P ∧ true ⇔ P
• P ∧ false ⇔ false
• P ∨ true ⇔ true
• P ∨ false ⇔ P
• P ∧ (~P) ⇔ false
• P ∨(~P) ⇔ true
• P ∧ (P ∨ Q) ⇔ P
• P ∨ (P ∧ Q) ⇔ P
Example 11.6.
Let us consider the following view definition and a query on the view that involves the
relation Employee (empid, ename, salary, designation, deptno).
• Create view V1 as select empid, ename, salary from Employee where deptno =
10;
• Select * from V1 where deptno = 10 and salary > 10000;
• Select empid, ename, salary from Employee where (deptno = 10 and salary >
10000) and deptno = 10;
Here, the predicates are redundant and the WHERE condition reduces to “deptno = 10 and
salary > 10,000”.
QUERY RESTRUCTURING
In this step, the query in the high-level language is rewritten into an equivalent relational
algebraic form. This step involves two substeps. Initially, the query is converted into an
equivalent relational algebraic form and then the relational algebraic query is restructured to
improve performance. The relational algebraic query is represented by a query
tree or operator tree which can be constructed as follows:
In relational data model, the conversion from SQL query to relational algebraic form can be
done in an easier way. The leaf nodes in the query tree are created from the FROM clause of
the SQL query. The root node is created as a project operation involving the result attributes
from the SELECT clause specified in SQL query. The sequence of relational algebraic
operations, which depends on the WHERE clause of SQL query, is directed from the leaves
to the root of the query tree. After generating the equivalent relational algebraic form from
the SQL query, the relational algebraic query is restructured by using transformation
rules from relational algebra, which are listed below. In listing these rules, three different
relations R, S and T are considered, where the relation R is defined over the attributes A=
{A1, A2,.. . ., An} and the relation S is defined over the attributes B = {B1, B2,.. . .. . .. . ., Bn}.
Similarly, R ∪ S ⇔ S ∪ R and R ∩ S ⇔ S ∩ R
This rule is not applicable to set difference and semijoin operations of relational
algebra.
2. Associativity of binary operators–. Cartesian product and natural join operation are
always associative:
0. (R × S) × T ⇔ R × (S × T) and
1. (R ⋈ S) ⋈ T ⇔ R ⋈ (S ⋈ T)
3. Idempotence of unary operators–. Several successive unary operations such as
selection and projection on the same relation may be grouped together. Conversely, a
single unary operation on several attributes can be separated into several successive
unary operations as shown below:
0. σp(A1)(σq(A2)(R)) = σp(A1) ∧ q(A2)(R), where p and q denote predicates.
Where A, A′ and A″ are sets of attributes defined on relation R and A′ and A″ are
subsets of A, and A″ is a subset of A′.
Similarly, selection operation can be commuted with union and set difference
operations if both relations are defined over the same schema:
Similarly, projection operation can be commuted with union and set difference
operations:
2. ПC(R ∪ S) ⇔ ПC(R) ∪C(S).
Proofs of the above transformation rules are available in [Aho et al., 1979]. By applying
these rules, a large number of equivalent query trees can be constructed for a given query
among which the most cost-effective one is chosen at the optimization phase. However, in
this approach, generation of excessively large number of operator trees for the same query is
not realistic. Therefore, the above transformation rules are used in a methodical way to
construct the query tree for a given query. The transformation rules are used in the following
sequence.
• Unary operations in the query are separated first to simplify the query.
• Unary operations on the same relation are grouped so that common expressions are
computed only once [rule no. (iii)].
• Unary operations are commuted with binary operations [rule no. (v) & (vi)].
• Binary operations are ordered.
Example 11.7.
Let us consider the following SQL query which involves the relations Student (s-id, sname,
address, course-id, year) and Course (course-id, course-name, duration, course-fee,
intake-no, coordinator):
• Select sname, course-name from Student, Course where s-id > 0 and year = 2007
and Student.course-id = Course.course-id and duration = 4 and intake-no = 60.
The query tree for the above SQL query is depicted in figure 11.6.
Figure 11.6. Query tree
An equivalent query tree is constructed by applying the above transformation rules as shown
in figure 11.7.
Figure 11.7. Equivalent Query Tree of Figure 11.6
Here, unary operations specified in the query are separated first, and all unary operations on
the same relation are grouped together to reconstruct the query tree. However, the query tree
in the figure 11.7need not necessarily represent the optimal tree.
Query Fragmentation
In query fragmentation phase a relational algebraic query on global relations is converted
into an algebraic query expressed on physical fragments, called a fragment query,
considering data distribution in the distributed database. A global relation can be
reconstructed from its fragments using the reconstruction rules of fragmentation [as
described in Chapter 5, Section 5.3.2]. For horizontal fragmentation, the reconstruction rule
is the union operation of relational algebra, and for vertical fragmentation the reconstruction
rule is the join operation of relational algebra. Thus, query fragmentation is defined through
fragmentation rules, and it uses the information stored in the fragment schema of the
distributed database. For the sake of simplicity, the replication issue is not considered here.
An easier way to generate a fragment query is to replace the global relations at the leaves of
the query tree or the operator tree of the distributed query with their reconstruction rules. The
relational algebraic query tree generated by applying the reconstruction rules is known
as generic tree. This approach is not too efficient because more simplifications and
reconstructions are possible in generic trees. Therefore, to generate a simpler and optimized
query from a generic query, reduction techniques are used where the reduction techniques
are dependent on the types of fragmentation. Reduction techniques for different types of
fragmentation are illustrated with examples in the next section.
Example 11.8.
Let us consider the following SQL query that involves the relation Employee (empid,
ename, salary, designation, deptno):
Assume that the Employee relation is partitioned into two horizontal fragments EMP1 and
EMP2depending on the selection predicates as mentioned below:
Now, the relation Employee can be reconstructed from its horizontal fragments EMP1 and
EMP2 by using the following reconstruction rule.
Therefore, in the generic tree of the above SQL query, the leaf node corresponding to the
Employee relation can be replaced by the reconstruction rule EMP1 ∪ EMP2. Here, the
selection predicate contradicts the definition of horizontal fragment EMP1, thereby producing
an empty relation. This operation can be eliminated from the generic tree as shown in figure
11.8.
Figure 11.8. Reduction for Horizontal Fragmentation with Selection
Example 11.9.
Let us assume that the relation Employee (empid, ename, salary, designation, deptno) is
horizontally fragmented into two partitions EMP1 and EMP2 and the relation Department
(deptno, dname, location) is horizontally fragmented into two relations DEPT1 and
DEPT2 respectively. These horizontally fragmented relations are defined in the following:
The reconstruction rules for the above horizontal fragments are as follows:
Let us consider the following SQL query in a distributed environment, which involves the
relations Employee and Department.
The generic query tree and the reduced query tree for the above query are depicted in figure
11.9. Thus, the commutativity of join operation with union operation is very important in
distributed DBMSs, because it allows a join operation of two relations to be implemented as
a union operation of partial join operations, where each part of the union operation can be
executed in parallel.
Figure 11.9. Reduction for Horizontal Fragmentation with Join
Example 11.10.
Let us assume that the Employee relation is vertically fragmented into two relations
EMP1 and EMP2 as defined below:
In this query, the projection operation on relation EMP2 is redundant, since the attributes
ename and salary are not part of EMP2. The generic query tree of the above query is
illustrated in figure 11.10(a). By commutating the projection operation with join operation
and removing the vertical fragment EMP2the reduced query tree is produced as shown
in figure 11.10(b).
Example 11.11.
Let us assume that the relation Department (deptno, dname, location) is horizontally
fragmented into two relations DEPT1 and DEPT2 respectively. These horizontally fragmented
relations are defined in the following:
Further assume that the fragmentation of Employee relation is derived from Department
relation.
The generic query tree for the derived fragmentation is depicted in figure 11.11(a).
In the above generic query tree, the selection operation on fragment DEPT1 is redundant and
can be eliminated. Similarly, as the relation DEPT2 is defined depending on the selection
predicate “deptno > 10”, the entire selection operation can be eliminated from the above
generic tree. By eliminating selection operation and commutating join operation with union
operation, the reduced query tree is produced which is shown in figure 11.11(b).
SEARCH SPACE
The search space is defined as the set of equivalent query trees for a given query that can be
generated by using transformation rules. In query optimization, join trees are particularly
important, because they determine the join order of relations involved in a given query,
which affects the performance of query processing. If a given query involves many operators
and many relations, then the search space is very large, because it contains a large number of
equivalent query trees for the given query. Hence, the query optimization process becomes
more expensive than the actual execution, and therefore query optimizers typically impose
some restrictions on the size of the search space to be considered. Most query optimizers use
heuristics rules that order the relational algebraic operations (selection, projection and
Cartesian product) in the query. Another restriction is one that is imposed on the shape of the
join tree. There are two different kinds of join trees, known as linear join trees and bushy
join trees. In a linear join tree, at least one operand of each operator node is a base relation.
On the other hand, a bushy join tree is more general and may have operators with no base
relations as operands. Linear join trees reduce the size of the search space, whereas bushy
join trees facilitate parallelism. The examples of a linear join tree and a bushy join tree are
illustrated in figures 11.12(a)and 11.2(b) respectively.
OPTIMIZATION STRATEGY
There are three different kinds of optimization strategies, known as static optimization
strategy, dynamic optimization strategy and randomized optimization strategy, which
are described in the following.
Cost functions
In a distributed system, the cost of processing a query can be expressed in terms of the total
cost measures or the response time measures [Yu and Chang, 1984]. The total cost measure
is the sum of all cost components. If no relation is fragmented in the distributed system and
the given query includes selection and projection operations, then the total cost measure
involves the local processing cost only. However, when join and semijoin operations are
executed, communication costs between different sites may be incurred in addition to the
local processing cost. Local processing costs are usually evaluated in terms of the number of
disk accesses and the CPU processing time, whereas communication costs are expressed in
terms of the total amount of data transmitted. For geographically dispersed computer
networks, communication cost is normally the dominant consideration, but local processing
cost is of greater significance for local networks. Thus, most early distributed DBMSs
designed for wide area networks have ignored the local processing cost and concentrated on
minimizing the communication cost. Therefore, the total cost measure can be represented by
using the following formula.
where TCPU is the CPU processing cost per instruction, insts represents the total number of
CPU instructions, TI/O is the I/O processing cost per I/O operation, ops represents the total
number of I/O operations, C0 is the start-up cost of initiating transmission, C1 is a
proportionality constant, and X is the amount of data to be transmitted. For wide area
networks, the above formula is simplified as follows.
The response time measure is the time from the initiation of the query to the time when the
answer is produced. The response time measure must consider the parallel local processing
costs and the parallel communication costs. A general formula for response time can be
expressed as follows.
where seq_insts represents the maximum number of CPU instructions that can be performed
sequentially, seq_ops represents the maximum number of I/O operations that can be
performed sequentially, and seq_X indicates the amount of data that can be transmitted
sequentially. If the local processing cost is ignored, then the above formula is simplified into
In this case, any processing or communication that is done in parallel is ignored. The
following example illustrates the difference between total cost measure and response time
measure.
Example 11.12.
Let us consider that the amount of data to be transmitted from site1 to site2 is P, and the
amount of data to be transmitted from site3 to site2 is Q for the execution of a query as
shown in the figure 11.13.
Figure 11.13. Data Transmission for a Query
Database statistics
The cost of an execution strategy depends on the size of the intermediate result relations that
are produced during the execution of a query. In distributed query optimization, it is very
important to estimate the size of the intermediate results of relational algebraic operations to
minimize the data transmission cost, because the intermediate result relations are transmitted
over the network. This estimation is done based on the database statistics of base relations
stored in the system catalog and the formulas used to predict the cardinalities of the results of
the relational algebraic operations. Typically, it is expected that a distributed DBMS will
hold the following information in its system catalog to predict the size of intermediate result
relations.
In addition to the above information, sometimes the system catalog also holds the join
selectivity factor for some pairs of relations. The join selectivity factor of two
relations R and S is a real value between 0 and 1, and can be defined as follows.
1. Selection Operation.
where SFS (F) is dependent on the selection formula. The SFS(F)s can be calculated as
follows.
In this case, Ai and Aj are two different attributes, and p(Ai) and p(Aj) denotes the
selection predicates.
2. Projection Operation
0. card(ПA(R)) = card(R).
3. Cartesian Product Operation
There is no general way to evaluate the cardinality of the join operation without
additional information about the join operation. Typically, the upper bound of the
cardinality of the join operation is the cardinality of the Cartesian product. However,
there is a common case in which the evaluation is simple. If the relation R with its
attribute A is joined with the relation S with its attribute B via an equijoin operation,
where A is a key of the relation R, and B is a foreign key of relation S, then the
cardinality of the join operation can be evaluated as follows.
The above formula depends on only the attribute A of the relation S. Thus, it is often
called the selectivity factor of attribute A of S and is denoted by SFSJ (S.A). Now, the
cardinality of the semijoin operation can be expressed by the following formula.
This formula is applicable for the common case where the attribute R.A is a foreign
key of the relation S. In this case, the semijoin selectivity factor is 1, because ПA(S) =
card(dom[A]).
6. Union Operation
It is very difficult to evaluate the cardinality of union operation. The simple formulas
for evaluating the upper bound and lower bound cardinalities of union operation are
listed in the following.
Like union operation, the formulas for evaluating the upper bound and lower bound
cardinalities of set difference operation, denoted by card(R − S), are card(R) and 0
respectively.
INGRES ALGORITHM
INGRES uses a dynamic query optimization strategy that partitions the high-level query into
smaller queries recursively. In this approach, a query is first decomposed into a sequence of
queries having a unique relation in common. Each of these single relation queries are then
processed by a one-variable query processor (OVQP). The OVQP optimizes the access to a
single relation by selecting, based on the selection predicates, the best access method to that
relation. For instance, if there is a selection predicate in the query of the form <A = value>,
an index available on attribute A will be used. This algorithm first executes the unary
operations and tries to minimize the size of the intermediate results before performing binary
operations.
where Ai and Ai′ represent lists of attributes of relation Ri, P1 is a selection predicate involving
the attribute A1, from the relation R1, and P2 is a selection predicate involving attributes of
relations R1, R2,. . ., Rn, V1, . . ., Vn represent new names of relation R1, . . ., Rn after
decomposition. This step is necessary to reduce the size of relations before performing
binary operations. Detachment technique extracts the selection operations, which are usually
the most selective ones.
Multi-relation queries that cannot be further detached are called irreducible. A query is said
to be irreducible if and only if the corresponding query graph is a chain with two nodes.
Irreducible queries are converted into single relation queries by tuple substitution. In tuple
substitution, for a given n-relation query, the tuples of one relation are substituted by their
values, thereby generating a set of (n − 1)-relation queries. Tuple substitution can be
implemented in the following way. Assume that the relation R is chosen for tuple
substitution in an n-relation query q1. For each tuple in R, the attributes referred to in q1 are
replaced by their actual values in tuples thereby producing a query q1′ with n − 1 relations.
Therefore, the total number of queries produced by tuple substitution is card(R). An example
of the INGRES algorithm is illustrated in the following.
Example 11.13.
Let us consider the following SQL query that involves three different
relations Student(sreg-no, sname, street, city, course-id), Course(course-id, cname,
duration, fees), and Teacher(T-id, name, designation, salary, course-id).
Using detachment technique, the above query can be replaced by the following queries, q1
and q2, where Course1 is an intermediate relation.
q2: Select sname, name from Student, Course1, Teacher where Student.course-id =
Course1.course-id and Course1.course-id = Teacher.course-id.
Similarly, the successive detachment of q2 may generate the queries q21 and q22 as follows.
q21: Select name, Teacher.course-id into Teacher1 from Teacher, Course1 where
Teacher.course-id = Course1.course-id.
Assume that in query q22, three tuples are selected where course-id are C01, C03 and C06
respectively. The tuple substitution of Teacher1 relation produces three one-relation
subqueries which are listed below.
Let us consider a simple query that involves the joining of two relations R and S that are
stored at different sites. In performing R ⋈ S, the smaller relation should be transferred to the
site of the larger relation. Therefore, it is necessary to calculate the size of the
relations R and S. If the query involves one more relation T, then the obvious choice is to
transfer the intermediate result R ⋈ S or the relation T, whichever is smaller.
The difficulty with the simple join strategy is that the join operation may reduce or increase
the size of the intermediate results. Hence, it is necessary to estimate the size of the results of
join operations. One solution is to estimate the communication costs of all alternative
strategies and select the best one for which the communication cost is the minimum.
However, the number of alternative strategies will increase as the number of relations
increases.
Semijoin Strategy
The main drawback of simple join strategy is that the entire operand relation has to be
transmitted between the sites. The objective of the semijoin strategy is to minimize the
communication cost by replacing join operations of a query with semijoin operations. The
join operation between two relations R and S over the attribute A, which are stored at
different sites of the distributed system, can be replaced by semijoin operations as follows.
• R ⋈A S ⇔ (R ⋉A S) ⋈A S ⇔ R ⋉A S ⋈ A (S ⋉A R) ⇔ R ⋉A S ⋈A S ⋉A R.
It is necessary to estimate the cost of the above semijoin operations to understand the
benefits of semijoin operations over a join operation. The local processing costs are not
considered here for simplicity. The join operation R ⋈A S and the semijoin operation (R ⋉A S)
⋈A S can be implemented in the following way assuming the relations R and S are stored at
site1 and site2 respectively, and size(R) < size(S).
R ⋈AS:
(R ⋈AS) ⋈AS:
1. The projection operation ПA(S) is performed at site2 and result is sent to site1.
2. The site1 computes R ⋉A S, and the result is, say, T.
3. The result T is transferred to site2.
4. The join operation T ⋈A S is computed at site2.
The communication cost for the above join operation is C0 + C1 * size(R), whereas the
communication cost for the semijoin operation is 2C0 + C1*(size(ПA(S)) + size(R ⋉A S)).
Thus, the second operation is beneficial if size(ПA(S)) + size(R ⋉A S) is less than size(R),
because C0 is negligible compared to the cost of data transfer.
In general, the semijoin operation is useful to reduce the size of operand relations involved in
multiple join queries. The optimal semijoin program is called a full reducer, which reduces
a relation more than others [Chiu and Ho, 1980]. The determination of full reducers is a
difficult task. One solution is to evaluate the size of reduction for all possible semijoin
strategies and select the best one. Full reducers cannot be found in the group of queries that
have cycles in their join graph, known as cyclic queries. For other groups of queries,
called tree queries, full reducers exist, but the number of alternative semijoin strategies
increases as the number of relations increases, which complicates the above solution.
The semijoin strategy is beneficial if only a few number of tuples participate in the join
operation, whereas the simple join strategy is beneficial if most of the tuples participate in
the join operation, because semijoin strategy involves an additional data transfer cost.
Example 11.14.
Let us consider the following join operation that involves three different relations Student,
Course, and Teacher, and over the attribute course-id.
Further assume that the relations Student, Course and Teacher are stored at site 1, site 2 and
site 3 respectively. The corresponding join graph is depicted in figure 11.14.
There are various ways to perform the above join operation, but to select the best one, some
more information must be known: size(Student), size(Teacher), size(Course),
size(Student ⋈ Course) and size(Course ⋈ Teacher). Moreover, all data transfer can be done
in parallel.
If the above join operation is replaced by semijoin operations, then the number of operations
will be increased but possibly on smaller operands. In this case, instead of sending the entire
Student relation to site2, only the values for the join attribute course-id of Student relation
will be sent to site2. If the length of the join attribute is significantly less than the length of
the entire tuple, then the semijoin has a good selectivity in this case, and it reduces the
communication cost significantly. This is also applicable for performing join operation
between the relations Course and Teacher. However, the semijoin operation may increase the
local processing time, because one of the two operand relations must be accessed twice.
For a given query, all monorelation queries (unary operations such as selection and
projection) can be detached first and these are processed locally. Then the reduction
algorithm [Wong and Youssefi, 1976] is applied to the original query, and it produces two
different kinds of subqueries: irreducible subqueries and monorelation subqueries. Reduction
is a technique that separates all irreducible subqueries and monorelation subqueries from the
original query by detachment technique. Since monorelation subqueries are already
processed locally, no action is taken with respect to them. Assume that the reduction
technique produces a sequence of irreducible subqueries q1 → q2 → . . . → qn, where at
least one relation is common between two consecutive subqueries. It is mentioned in [Wong
and Youssefi, 1976] that such a sequence is unique.
Based on the sequence of irreducible subqueries and the size of each fragment, one subquery
is chosen from the list, say qi, that has at least two variables. For processing the subquery qi,
initially the best strategy is determined for the subquery. This strategy is expressed by a list
of pairs (F, S), where Fdenotes a fragment that is to be transferred to the processing site S.
After transferring all fragments to the corresponding processing sites, finally the subquery
qi is executed. This procedure is repeated for all irreducible subqueries in the sequence q1 →
q2 → . . .→ qn, and the algorithm terminates. This algorithm is represented in the following:
Begin
1. Detach all monorelation queries from the given query q and execute by OVQPs (one-variable
query processors) at local sites the same way as in centralized INGRES algorithm.
2. Apply reduction technique to q, which will produce a list of monorelation subqueries and a
sequence of irreducible subqueries q1 → q2 →. . .→ qn.
3. Ignore all monorelation subqueries (since they are already processed by OVQPs at local sites).
4. for I = 1 to n, repeat Step5 to Step8 [n is the total number of irreducible subqueries].
5. Choose the irreducible subquery qi involving the smallest fragments.
6. Determine the best strategy, pairs of (F, S), for qi. [F represents a fragment and Srepresents the
processing sites.]
7. For each pair (F, S), transfer the fragment F to the corresponding processing site S.
8. Execute the query qi. Check for the termination of for loop.
9. Output the result.
End.
The query optimization is basically done in step5 and step6 in distributed INGRES
algorithm. In this algorithm, subqueries are produced depending on several components and
their dependency order. Since the relation involved in a subquery may be fragmented and
stored at different sites, the subquery cannot be further subdivided. The main difficulty in
step6 is to determine how to execute the subquery by selecting the fragments that will be
transferred to sites where the processing will take place. For an n-relation subquery,
fragments from n − 1 relations must be transferred to the site(s) of fragments of the
remaining relation, and then replicated there. Further, the non-transferred relations may be
divided into several equal fragments to increase parallelism. This approach is
called fragment-and-replicate, which performs a substitution of fragments rather than of
tuples as in centralized INGRES algorithm. The selection of non-transferred relations and the
number of processing sites depends on the objective function and the communication
scheme. The choice of the number of processing sites is a trade-off between the total time
measure and the response time measure. If the number of sites increases, the response time
decreases by parallelism, but the total time increases which leads to higher communication
cost. The cost of producing the result is ignored here.
The distributed INGRES algorithm is characterized by a limited search of the solution space,
an optimization decision taken for each step without considering the consequences of that
decision on global optimization. The exhaustive search approach in which all possible
strategies are evaluated to determine the best one is an alternative to the limited search
approach. However, dynamic optimization strategy is beneficial, because the exact sizes of
the intermediate result relations are known.
Distributed R* Algorithm
The objective of distributed R* algorithm [Selinger and Adiba, 1980] is to reduce the total
cost measure which includes the local processing costs and communication costs. Distributed
R* algorithm uses an exhaustive search approach to select the best strategy. Although
exhaustive search incurs an overhead, it can be worthwhile, if the query is executed
frequently. The implemented version of distributed R* algorithm does not support
fragmentation or replication; thus, it involves relations as its basic units. This algorithm
chooses one site as the master site where the query is initiated. The query optimizer at the
master site can take all inter-site decisions, such as the selection of the execution sites, the
fragments which will be used, and the method for transferring data. The other participating
sites that store the relations involved in the query, called apprentice sites, can make the
remaining local decisions and generate local access plans for the query execution.
Distributed R* algorithm is implemented in the following way.
Distributed R* Algorithm
Begin
1. For each base relation Ri in the query tree, repeat Step2 and Step3.
2. Find each access path of Ri and determine the cost of each access path.
3. Determine the access path with the minimum cost.
4. For I = 1 to n, repeat Step5.
5. For each order (Ri1, Ri2,. . ., Rin) of the relation Ri, build the best strategy (((APi1, . . . ., ⋈
Ri2) ⋈ Ri3) ⋈ . . . . . . . . ⋈ Rin) and compute the cost of the strategy.
6. Select the strategy with minimum cost.
7. For J = 1 to M, repeat step8. [M is the total number of sites in the distributed system storing a
relation involved in the query]
8. Determine the local optimization strategy at site J and send.
End
In this algorithm, based on database statistics and formulas used to estimate the size of
intermediate results, and access path information, the query optimizer determines the join
ordering, the join algorithms and the access path for each fragment. In addition, it also
determines the sites of join results and the method of data transfer between sites. For
performing the join between two relations, there may be three candidate sites. These are the
site of the first relation, the site of the second relation, and the site of the result relation. Two
inter-site data transfer techniques are supported by distributed R* algorithm: ship-
whole and fetch-as-needed. In ship-whole technique, the entire relation is moved to the join
site and stored in a temporary relation before performing the join operation. If the join
algorithm is merge join, the incoming tuples are processed in a pipeline mode as they arrive
at the join site. In this case, the relation need not be stored in the temporary relation. The
fetch-as-needed technique is the same as semijoin operation of an internal relation with an
external node. In this method, the external relation is sequentially scanned, and for each tuple
the join value is sent to the site of the internal relation. The internal tuples that match with
the value of the external tuple are selected there, and then the selected tuples are sent to the
site of the external relation.
Ship-whole technique requires larger data transfer but fewer message transfers than fetch-as-
needed technique. Obviously, ship-whole technique is beneficial for smaller relations. On the
other hand, the fetch-as-needed technique is beneficial if the relation is larger and the join
operation has good selectivity. In fetch-as-needed technique, there are four possible join
strategies for joining of the external relation with the internal relation over an attribute value.
These join strategies and their corresponding costs are calculated in the following, where LC
denotes the local processing cost, which involves CPU cost and I/O cost, CC represents the
communication cost, and A denotes the average number of tuples of internal relation S that
matches with the value of one tuple of external relation R.
Strategy 1: Ship the entire external relation to the site of the internal relation.
In this case, the tuples of external relation R can be joined with the tuples of internal
relation S, as they arrive. Therefore,
Strategy 2: Ship the entire internal relation to the site of the external relation.
In this case, the tuples of internal relation S cannot be joined with the tuples of external
relation R, as they arrive. The entire internal relation S has to be stored in a temporary
relation. Hence,
Strategy 3: Fetch tuples of the internal relation as needed for each tuple of the external
relation.
In this case, for each tuple in R, the join attribute value is sent to the site of the internal
relation S. Then the A number of tuples from S that matches with this value are retrieved and
sent to the site of Rto be joined as they arrive. Therefore,
Strategy 4: Move both relations to a third site and compute the join there.
In this case, the internal relation is first moved to the third site and stored in a temporary
relation. Then, the external relation is moved to the third site and the join is performed as the
tuples arrive. Hence,
The cost of producing the final result is not considered here. In the case of distributed R*
algorithm, the complexity increases as the number of relations increases, as it uses the
exhaustive search approach based on several factors such as join order, join algorithm,
access path, data transfer mode and result site.
SDD-1 Algorithm
SDD-1 query optimization algorithm [Bernstein et al., 1981] is derived from the first
distributed query processing algorithm “hill-climbing”. In hill-climbing algorithm, initially a
feasible query optimization strategy is determined, and this strategy is refined recursively
until no more cost improvements are possible. The objective of this algorithm is to minimize
an arbitrary function which includes the total time measure as well as response time measure.
This algorithm does not support fragmentation and replication, and does not use semijoin
operation. The “hill-climbing” algorithm works in the following way, where the inputs to the
algorithm are the query graph, location of relations involved in the query, and relation
statistics.
First, this algorithm selects an initial feasible solution, which is a global execution schedule
that includes all inter-site communication. It is obtained by computing the cost of all the
execution strategies that transfer all the required relations to a single candidate result site and
selecting the minimum-cost strategy. Assume that this initial solution is S. The query
optimizer splits S into two strategies, S1 followed by S2, where S1 consists of sending one of
the relations involved in the join operation to the site of the other relation. There the two
relations are joined locally, and the resulting relation is sent to the chosen result site. If the
sum of the costs of the execution strategies S1 and S2 and the cost of local join processing is
less than the cost of S, then S is replaced by the schedule of S1 and S2. This process is
continued recursively until no more beneficial strategy is obtained. It is to be noted that if
an n-way join operation is involved in the given query, then S will be divided
into n subschedules instead of two.
The main disadvantage of this algorithm is that it involves higher initial cost, which may not
be justified by the strategies produced. Moreover, the algorithm gets stuck at a local
minimum-cost solution and fails to achieve the global minimum-cost solution.
In SDD-1 algorithm, lots of modifications to the hill-climbing algorithm have been done. In
SDD-1 algorithm, semijoin operation is introduced to improve join operations. The objective
of SDD-1 algorithm is to minimize the total communication time and the local processing
time, and response time is ignored here. This algorithm uses the database statistics in the
form of database profiles, where each database profile is associated with a relation. SDD-1
algorithm can be implemented in the following way.
In SDD-1 algorithm also, one initial solution is selected, and it is refined recursively. One
post-optimization phase is added here to improve the total time of the selected solution. The
main step of this algorithm consists of determining and ordering semijoin operations to
minimize cost. There are four phases in SDD-1 algorithm, known as initialization, selection
of beneficial semijoin operations, result site selection and post-optimization. In the
initialization phase, an execution strategy is selected that includes only local processing. In
this phase, a set of beneficial semijoins BS are also produced for use in the next phase. The
second phase selects the beneficial semijoin operations from BS recursively and modifies the
database statistics and BS accordingly. This step is terminated when all semijoin operations
in BS are appended to the execution strategy. The execution order of semijoin operations is
determined by the order in which the semijoins are appended to the execution strategy. In the
third phase, the result site is decided based on the cost of data transfer to each candidate site.
Finally, in the post-optimization phase, the semijoin operations are removed from the
execution strategy, which then involves only the relations stored at the result site. This is
necessary because the result site is chosen after all semijoin operations have been ordered.
Chapter Summary
• Query processing involves the retrieval of data from the database, in the contexts of
both centralized and distributed DBMSs. A Query processor is a software module that
performs processing of queries. Distributed query processing involves four phases:
query decomposition, query fragmentation, global query optimization and local query
optimization.
• The objective of query decomposition phase is to transform a query in a high-level
language on global relations into a relational algebraic query on global relations. The
four successive steps of query decomposition are normalization, analysis,
simplification and query restructuring.
• In the normalization step, the query is converted into a normalized form to facilitate
further processing in an easier way. The two possible normal forms are conjunctive
normal form and disjunctive normal form.
• The objective of the analysis step is to reject normalized queries that are incorrectly
formulated or contradictory.
• In the simplification step, all redundant predicates in the query are detected and
common subexpressions are eliminated to transform the query into a simpler and
efficiently computable form.
• In the query restructuring step, the query in the high-level language is rewritten into an
equivalent relational algebraic form.
• In query fragmentation phase a relational algebraic query on global relations is
converted into an algebraic query expressed on physical fragments, called fragment
query, considering data distribution in distributed databases.
• A query typically has many possible execution strategies, and the process of choosing
a suitable one for processing a query is known as query optimization. Both centralized
query optimization and distributed query optimization are very important in the
context of distributed query processing.
• The optimal query execution strategy is selected by a software module, known as
query optimizer, which can be represented by three components: search space, cost
model and optimization strategy.
• The search space is defined as the set of equivalent query trees for a given query,
which can be generated by using transformation rules.
• There are three different kinds of optimization strategies, known as static optimization
strategy, dynamic optimization strategy and randomized optimization strategy.
• In a distributed system, the cost of processing a query can be expressed in terms of the
total cost measure or the response time measure.
Chapter 12. Distributed Database Security and
Catalog Management
This chapter focuses on distributed database security. Database security is an integral part of
any database system, and it refers to the preventive measures to ensure data consistency. In
this chapter, view management in distributed DBMSs is discussed in detail. The chapter also
introduces authorization control and data protection both in centralized and in distributed
contexts. In a DBMS, all security constraints and semantic integrity constraints are stored in
the system catalog. In this context, catalog management in distributed DBMS also is
represented in this chapter.
The outline of this chapter is as follows. Section 12.1 introduces the concept of database
security. View management in distributed database context is described in Section 12.2.
In Section 12.3, authorization control and data protection are discussed, and Section
12.4 represents semantic integrity constraints. Catalog management in distributed database
systems is focused on in Section 12.5.
The rules that are used to control data manipulation is a part of the database administration,
and generally these are defined by the database administrator (DBA). The cost of enforcing
semantic integrity constraints in a centralized DBMS is very high in terms of resource
utilization, and it can be prohibitive in a distributed environment. The semantic integrity
constraints are stored in the system catalog. In a distributed DBMS, the global system
catalog contains data distribution details in addition to all information that are stored in a
centralized system catalog. Maintaining system catalog in a distributed database environment
is a very complicated task.
View Management
A view can be defined as a virtual relation or table that is derived from one or more base
relations. A view does not necessarily exist in the database as a stored set of data values;
only view definitions are stored in the database. When a DBMS encounters a reference to a
view, it can be resolved in two different ways. One approach is to look up the definition of
the view and translate the definition into an equivalent request against the source tables of
the view, and then perform that request. This process is known as view resolution. Another
alternative approach stores the view as a temporary table in the database and maintains the
changes of the view as the underlying base tables are updated. This process is called view
materialization.
In relational data model, a view is derived as the result of a relational query on one or more
base relations. It is defined by associating the name of the view with the relational query that
specifies it. Views provide several advantages, which are listed in the following:
• Improved Security–. The major advantage of a view is that it provides data security.
It is possible to grant each user the privileges to access the database only through a
small set of views that contain the appropriate data for the user, thereby restricting and
controlling each user’s access to the database.
• Reduced Complexity–. A view simplifies queries by deriving data from more than
one table into a single table and thus converts multi-table queries into single-table
queries.
• Data sharing–. The same underlying base tables can be shared by different users in
different ways through views. Therefore, it provides the facility for data sharing and
for customizing the appearance of the database.
• Convenience–. Using views only that portion of the data is presented to the users, in
which they are interested. Thus, it reduces the complexity from the user’s point of
view and provides greater convenience.
In some cases, a view defined by a complex, multi-table query may take a long time to
process, because view resolution must join the tables together every time the view is
accessed.
Example 12.1.
Let us consider the relational schema Employee which is defined in Chapter 1, example 1.1.
Using the SQL query “Retrieve the name, designation and department number of all
employees whose designation is Manager from Employee relation”, a view with the name
V1 can be created as follows:
The execution of the above SQL statement will store a view definition into the system
catalog with the name V1. The result generated against the user request “Select * from V1”
is shown in figure 12.1.
Table 12.1. Result of Query Involving View V1
V1
J. Lee Manager 10
D. Davis Manager 30
A. Sasmal Manager 20
Example 12.2.
Let us consider the query “Retrieve the name, designation and department name of all
employees whose designation is Manager”. The query involves the base relation Department
and the view V1. Using SQL, the query can be expressed as follows:
Assume that a view V2 is created based on the above SQL query. Now any request against
the view V2 will be converted into a data request against the base relations upon which the
view V2 is created. Thus, the above query will be modified as listed below, and it will
produce the output shown in figure 12.2.
V2
View Updatability
All updates made to a base relation must be reflected in all views that encompass that base
relation. Similarly, it is expected that if a view is updated, then the changes will be reflected
in the underlying base relations, but all views cannot be manipulated in such a way. Views
are classified into two categories, known as updatable views and non-updatable views. The
updatability of a view depends on the query expression based on which the view is created.
A view is updatable only if the updates to the view can be propagated correctly to the base
relations without ambiguity. A view is non-updatable if the corresponding query based on
which the view is generated has the following properties.
In addition, no tuple that is added through a view must violate the integrity constraints of the
base table.
Example 12.3.
The view V1 in example 12.1 is updatable, whereas the view V2 in example 12.2 is non-
updatable because the query expression for creating view V2 involves one base relation and
one view. The insertion of a new tuple <D. Jones, Manager, 40> into V1 can be propagated
to the base relation Employee without any ambiguity. The following view is a non-updatable
view.
The above view is non-updatable because it is derived from another view V2 which is non-
updatable. It must be noted that views derived by join are updatable if they include the keys
of the base relations.
The difficulty with view materialization is in maintaining the changes of the view while the
base tables are being updated. The process of updating a snapshot in response to changes to
the underlying data is called view maintenance. The objective of view maintenance is to
apply only those changes that are necessary to keep the view current. However, this can be
done when the system is idle.
The granting of rights or privileges that enable users to have legitimate access to a system or
a system’s objects is called authorization. Authorization ensures that only the authorized
users are accessing the data. An authorization control must have the ability to identify
authorized users and thereby to restrict unauthorized accessing of data. Authorization control
was being provided by operating systems for a long time, and recently by distributed
operating systems as a service of the file system. Generally, a centralized approach is used
for authorization control. In this approach, the centralized control creates database objects
and provides permission to other users to access these objects. Database objects are
identified by their external names. Another aspect of authorization is that different users can
have different privileges on the same database objects in a database environment.
A database is a collection of database objects. In relational data model, a database object can
be defined by its type, which is expressed as (view, relation, tuple, attribute), as well as by its
content using selection predicates. A right or privilege represents a relationship between a
user and a database object for a particular set of operations. In SQL, an operation is defined
by a high-level statement such as INSERT, DELETE, UPDATE, ALTER, SELECT,
GRANT, REFERENCES or ALL and privileges are defined using GRANT and REVOKE
statements. The keyword public is used to mean all users in the system. In centralized
authorization control, the DBA has all privileges on all database objects, and he/she is
allowed to grant (or take away) permissions to (from) other users.
In decentralized authorization control, the creator of a database object is the owner of that
object. The owner has the right to grant permission to other users to perform certain
operations on the database object. In this case, the authorization control is distributed among
the owners of database objects. If the owner grants the GRANT permission on a database
object to some other user, then that specified user can subsequently grant permissions to
other users on this specified database object. The revoking process must be recursive, and to
perform revoking the system must maintain a hierarchy of grants per database object where
the owner of the database object is the root.
The privileges on database objects are stored in the system catalog as authorization rules.
There are various ways to store authorization rules in the system catalog; authorization
matrix is one of them. In the authorization matrix, each row represents a user (or a subject),
each column represents a database object and each cell, which is a pair (user, database
object), indicates the authorized operations for a particular user on a particular database
object. The authorized operations are specified by their operation type, and in the case of
SELECT operation type, sometimes the selection predicates are also mentioned to further
restrict the access to database objects. The authorization matrix can be stored in three
different ways: by row, by column and by element. When the authorization matrix is stored
by row, each user is associated with the list of objects that can be accessed with the
corresponding access rights. Similarly, when the authorization matrix is stored by column,
each object is associated with the list of users who can access it with the corresponding
access rights. In both of the above two approaches, authorizations can be enforced
efficiently, but the manipulation of access rights per database object is not efficient as all
user profiles must be accessed. This disadvantage can be overcome if the authorization
matrix is stored by element, that is, by a relation (user, database object, right).
This approach provides faster access in right manipulation per user per database object. A
sample authorization matrix is shown in figure 12.3.
Example 12.4.
The following SQL statement grants SELECT, UPDATE and DELETE permissions on the
database object Employee to all users.
grants all permission on the database object Employee to the user user2.
The above SQL statement allows user1 to grant SELECT permission on Employee database
object to other users. The following SQL statement takes away SELECT and INSERT rights
on Employee database object from the user user4.
• In the first approach, the information that are required for authenticating users (i.e.,
username and password) is replicated and stored at all sites in the system catalog.
• In the second case, each user in the distributed system is identified by its home site.
Whenever a user wants to login from a remote site a message is sent to its home site
for authentication, and then the user is identified by the home site.
The first approach is costly in terms of catalog management. The second approach is more
reasonable, because it restricts each user to identifying him(her)self at the home site.
However, site autonomy is not preserved in the second approach.
Authorization rules are used to restrict the actions performed by users on database objects
after authentication. In a distributed DBMS, the allocation of authorization rules is a part of
system catalog, and enforcement of these rules can be implemented in two different ways.
These are full replication and partial replication as listed in the following:
• In full replication, authorization rules are replicated and stored at all sites of the
distributed system.
• In the case of partial replication, the authorization rules are replicated at the sites
where the referenced database objects are distributed.
The full replication approach allows authorization to be checked at the beginning of the
compile time, but this approach is costly in terms of data replication. The latter approach is
better if localization of reference is high, but it does not allow distributed authorization to be
controlled at compile time.
Views can be considered as composite database objects; thus, granting access to a view
translates into granting access to the underlying objects. The authorization granted on a view
depends on the access rights of the view creator on the underlying objects. If the view
definitions and authorization rules are fully replicated, then this translation becomes simpler,
and it can be done locally. If the view definitions and their underlying objects are distributed,
then the translation is difficult. In this case, the association information can be stored at the
site of each underlying object.
To simplify the authorization control and to reduce the amount of data stored, individual
users are typically divided into different classes known as groups, which are all granted the
same privileges. Like centralized DBMS, in a distributed DBMS all users as a group can be
referred to as public, and all users at a particular site may be referred to as public@sitei. The
management of groups in distributed systems is very difficult, because the users belonging to
the same group may be distributed at different sites, and the access to a database object can
be granted to several groups, where the groups can also be distributed. There are several
alternative solutions as listed below.
The last two solutions decrease the degree of autonomy. It is obvious that full replication of
authorization information simplifies authorization control, as it can be done at compile time.
However, the overhead cost for maintaining replicas is significant, if there are many sites in
the distributed system.
• Enforcement of individual assertions–. There are two methods for the enforcement
of individual assertions in a distributed DBMS. If the update is an insert operation, all
individual assertions can be enforced at the site where the update is issued. If the
update is a delete or a modify operation, it is sent to the sites where the relation resides
and update should be performed there. The query processor performs update operation
for each fragment. The resulting tuples at each site are combined into one temporary
relation in the case of a delete statement, whereas in the case of a modify statement,
the resulting tuples are combined into two temporary relations. Each site involved in
the distributed update enforces the assertions relevant to that site.
• Enforcement of set-oriented assertions–. Two cases are considered here. In the case
of multiple-variable single-relation set-oriented assertions, the update (may be insert,
delete or modify) is sent to the sites where the relation resides, and they return one or
two temporary relations after performing the update, as in case of individual
assertions. These temporary relations are then sent to all sites storing the specified
relation, and each site enforces the assertions locally. Each site storing the specified
relation then sends a message to the site where the update is issued indicating whether
the assertions are satisfied or not. If any assertion is not satisfied at any site, then it is
the responsibility of the semantic integrity control subsystem to reject the entire
update program. In the case of multiple-variable multiple-relation set-oriented
assertions, the enforcement of assertions is done at the site where the update is issued.
Hence, after performing update at all fragments of the involved relations, all results
are centralized at the site where the update is issued, called query master site. The
query master enforces all assertions on the centralized result, and if any inconsistency
is found the update is rejected. On the other hand, if no inconsistencies are found, the
resultant tuples are sent to the corresponding sites where the fragments of the relation
reside.
• Enforcement of assertions involving aggregates–. The testing of these assertions is
the costliest, because it requires the calculation of aggregate functions. To efficiently
enforce these assertions, it is possible to produce compiled assertions that isolate
redundant data. These redundant data, known as concrete views, can then be stored at
each site where the associated relation resides [Bernstein and Blaustein, 1982].
Example 12.5.
The following is an example of an individual assertion that involves a single attribute salary
and a single relation Employee.
• Create assertion salary_constraint check (not exists (select * from Employee
where salary < 20000)).
The above assertion imposes a restriction that salary should be greater than 20,000 and this
condition must hold true in every database state for the assertion to be satisfied.
Example 12.6.
The following is an example of a set-oriented assertion that involves two variables salary and
deptno and two relations Employee and Department.
The above assertion also represents a referential integrity constraint involving two relations
Employee and Department.
• Centralized–. In this approach, the global system catalog is stored at a single site. All
other sites in the distributed system access catalog information from this central site.
This approach is very simple, but it has several limitations. The system is vulnerable
to the failure of the central site. The availability and reliability are very low in this
case. The major drawback of this approach is that it decreases site autonomy.
• Fully replicated–. In this case, catalog information are replicated at each site of the
distributed system. The availability and reliability are very high in this approach. The
site autonomy increases in this case. The disadvantage of this approach is the
information update overhead of the global system catalog.
• Fragmented and distributed–. This approach is adopted in the distributed system R*
to overcome the drawbacks of the above two approaches. In this approach, there is a
local catalog at each site that contains the metadata related to the data stored at that
site. Thus, the global system catalog is fragmented and distributed at the sites of a
distributed system. For database objects created at any site (the birth-site), it is the
responsibility of that site’s local catalog to store the definition, replication details and
allocation details of each fragment or replica of that object. Whenever a fragment or a
replica is moved to a different location, the local catalog at the corresponding site must
be updated. The birth-site of each global database object is recorded in each local
system catalog. Thus, local references to data can be performed locally. On the other
hand, if remote reference is required, the system-wide name [system-wide name is
discussed in Chapter 5, Section 5.5.1] reveals the birth-site of the database object,
and then the catalog information are accessed from that site.
Chapter Summary
• Database security refers to the preventive measures for maintaining data integrity.
Database security typically includes view management, data protection and
authorization control, and semantic integrity constraints, all of .which can be defined
as database rules that the system will automatically enforce.
• A view can be defined as a virtual relation or table that is derived from one or more
base relations. A data request or an update request against a view can be resolved in
two different ways: view resolution and view materialization. Views can be
classified as updatable views and non-updatable views.
• Protection is required to prevent unauthorized disclosure, alteration or destruction of
data.
• The granting of rights or privileges that enable users to have legitimate access to a
system or a system’s objects is called authorization. The authorization control in
distributed DBMSs involves remote user authentication, management of distributed
authorization rules, view management and control of user groups.
• Database consistency is ensured by a set of restrictions or constraints, known
as semantic integrity constraints. There are two different types of semantic integrity
constraints: structural constraints and behavioural constraints. Semantic integrity
constraints can be expressed using assertions.
• An assertion is a predicate expressing an integrity constraint that must be satisfied by
the database. Assertions can be classified into three different categories: individual
assertions, set-oriented assertions and assertions involving aggregates.
• The global system catalog stores all the information that are required to access data
correctly and efficiently and to control authorization. There are three alternative
approaches for catalog management in a distributed system: centralized, fully
replicated, and fragmented and distributed.
Chapter 13. Mobile Databases and Object-
Oriented DBMS
The fundamentals of mobile databases and object-oriented database management systems
(OODBMSs) are introduced in this chapter. Portable computing devices coupled with
wireless communications allow clients to access data in databases from virtually anywhere
and at any time. However, communications are still restricted owing to security reasons,
cost, limited battery power and other factors. Mobile databases overcome some of these
limitations of mobile computing. OODBMSs provide an environment where users can avail
the benefits of both object-orientation as well as database management systems. The features
of an OODBMS proposed by the Object-Oriented Database Manifesto are described here.
The benefits and problems of OODBMSs over conventional DBMSs are also discussed in
this chapter.
This chapter is organized as follows. Section 13.1 introduces the basic concepts of mobile
databases. In Section 13.2, basic object-oriented concepts are discussed. The details of
OODBMSs are represented in Section 13.3.
Mobile Databases
Mobile computing has been gaining an increased amount of attention owing to its usefulness
in many applications such as medical emergencies, disaster management and other
emergency services. Recent advancements in portable and wireless technology have led to
mobile computing becoming a new dimension in data communication and processing.
Wireless computing creates a situation where machines no longer have fixed locations and
network addresses. The rapid expansion of cellular, wireless and satellite communications
makes it possible to access any data from anywhere, at any time. This feature is especially
useful to geographically dispersed organizations. However, business etiquette, practical
situations, security and costs may still limit communications. Furthermore, energy (battery
power) is a scarce resource for most of the mobile devices. These limitations are listed in the
following:
• The wireless networks have restricted bandwidth. The cellular networks have
bandwidths of the order of 10 Kbps, whereas wireless local area networks have
bandwidths of the order of 10 Mbps.
• The power supplies (i.e., batteries) in mobile stations have limited lifetimes. Even with
the new advances in battery technology, the typical lifetime of a battery is only a few
hours, which is reduced further with increased computation and disk operations.
• Mobile stations are not available as widely as stationary ones, because of power
restrictions. Owing to the same reason, the amount of computation that can be
performed on mobile stations is restricted.
• As mobile stations can move, additional system functionality is required to track them.
Moreover, managing mobility is a complicated task, because it requires the
management of the heterogeneity of the base stations where the topology of the
underlying network changes continuously.
Therefore, it is not possible to establish online connections for as long as the users want and
whenever they want. In this context, mobile databases offer a solution for some of these
limitations.
A mobile database is a database that is portable and physically separate from a centralized
database server, but capable of communicating with that server from remote sites allowing
the sharing of corporate data. Mobile databases help users by facilitating the sharing of
corporate data on their laptops, PDA (Personal Digital Assistant) or other internet access
devices from remote sites. A mobile database environment has the following components:
• Corporate database server and DBMS –. The corporate database server stores the
corporate data, and the DBMS is used to manage it. This component provides
corporate applications.
• Remote database server and DBMS –. This server stores the mobile data, and the
DBMS is used to control mobile data. This component provides mobile applications.
• Mobile database platform –. This component involves laptops, PDAs or any other
internet access devices.
• Both-way communication link –. To establish a connection between the corporate
and mobile DBMSs, a both-way communication link is required.
In some mobile applications, the mobile user may logon to a corporate database server from
his/her mobile device and can work with data there, whereas in other applications the user
may download (or upload) data from (to) the corporate database server at the remote site, and
can work with it on a mobile device. The connection between the corporate and the mobile
databases is established for short periods of time at regular intervals. The main issues
associated with mobile databases are management of mobile databases and the
communication between the mobile and corporate databases.
Mobile DBMS
A mobile DBMS provides database services for mobile devices, and now most of the major
DBMS vendors offer mobile DBMSs. A mobile DBMS must have the capability to
communicate with a range of major relational DBMSs, and it must work properly with
limited computing resources, which is required to match with the environment of mobile
devices. The additional functionalities that are required for a mobile DBMS are listed in the
following:
1. The major function of a mobile DBMS is to manage mobile databases and to create
customized mobile applications.
2. It must be able to communicate with the centralized database server through wireless
or internet access.
3. A mobile DBMS can capture data from various sources such as the internet.
4. It has the ability to analyse data on a mobile device.
5. A mobile DBMS must be able to replicate and synchronize data on the centralized
database server and on the mobile device.
Currently, most of the mobile DBMSs provide only prepackaged SQL functions for mobile
applications, rather than supporting any extensive database querying or data analysis.
1. The entire database is distributed among the wired (fixed) components, possibly with
full or partial replication. A base station (or fixed host managers or corporate database
server) manages its own database with additional functionality for locating mobile
units, and additional query and transaction management features to meet the
requirements of mobile environments.
2. The entire database is distributed among the wired and wireless components. Data
management responsibility is shared among the base stations or fixed hosts and the
mobile units.
Objects with the same properties and behaviours are grouped into classes. An object can be
an instance of only one class or an instance of several classes. Classes are organized in class
hierarchies. The process of forming a superclass is referred to as generalization, and the
process of forming a subclass is called specialization. A subclass inherits properties and
methods from a superclass, and in addition, a subclass may have specific properties and
methods. This is the principle of substitutability, which helps to take advantage of code
reusability. In some systems, a class may have more than one superclass, called multiple
inheritance, whereas in others it is restricted to only one superclass, called single
inheritance. Most models allow overriding of inherited properties and
methods. Overriding is the substitution of a property domain with a new domain or the
substitution of a method implementation with a different one. The subclass can override the
implementation of an inherited method or instance variable by providing an alternative
definition or implementation of the base class. The ability to apply the same methods to
different classes, or rather the ability to apply different methods with the same name to
different classes, is referred to as polymorphism. The process of selecting the appropriate
method based on an object’s type is called binding. If the determination of the object’s type
is carried out at run-time, then it is called dynamic binding or late binding. On the other
hand, if the object’s type is decided before run-time (i.e., at compile time), then it is
called static binding or early binding.
Database technology focuses on the static aspects of information storage, whereas software
engineering concentrates on modeling the dynamic aspects of software. Currently, it is
possible to combine both of these technologies that allow the concurrent modeling of both
data and the processes acting upon the data. Two such technologies are Object-Oriented
Database Management Systems (OODBMSs) and Object–Relational Database
Management Systems (ORDBMSs). The basic concepts of OODBMS are described in the
following section.
Features of OODBMS
The Object-Oriented Database System Manifesto proposes 13 mandatory features for
OODBMS based on two criteria [Atkinson et al., 1989]. The first criterion is that OODBMS
should be an object-oriented system, and the second criterion is that it should be a DBMS.
The first eight rules are applicable for object-oriented features and the last five rules are
applicable for DBMS characteristics. These features are listed in the following:
In addition, the manifesto also proposes some optional features such as multiple inheritance,
type checking and type inferencing, distribution across a network, design transactions and
versions. However, the manifesto does not mention any direct rules for the support of
integrity, security, or views.
Benefits of OODBMS
An OODBMS provides several benefits, which are listed in the following:
Disadvantages of OODBMS
Although an OODBMS provides several benefits over relational data model, it has a number
of disadvantages also, as listed below.
Chapter Summary
• A mobile database is a database that is portable and physically separate from a
centralized database server but capable of communicating with that server from
remote sites allowing the sharing of corporate data. A mobile database environment
has four components. These are corporate database server and DBMS, remote
database and DBMS, mobile database platform, and both-way communication link.
• A mobile DBMS provides database services for mobile devices.
• An object-oriented database is a persistent and sharable collection of objects based on
object-oriented data model.
• An object-oriented DBMS (OODBMS) is used to manage object-oriented databases,
and provides both the features of object-orientation and database management
systems.
Chapter 14. Distributed Database Systems
This chapter focuses on two popular distributed database systems SDD-1 and R*.
Architectures, transaction management, concurrency control techniques and query
processing of these distributed database systems are discussed in this chapter.
The outline of this chapter is as follows. Section 14.1 introduces the basic concepts of SDD-
1 distributed database system. The architecture of SDD-1 is introduced in Section
14.2. Section 14.2 also describes the concurrency control, reliability, catalog management
and query processing techniques of SDD-1 distributed database system. In Section 14.3,
another distributed database system R* is presented. The architecture of R* system is
described in Section 14.4, and the query processing of R* is discussed in Section 14.5. The
transaction management of R* distributed database system is illustrated in Section 14.6.
SDD-1 distributed database system supports the relational data model. An SDD-1 database
consists of a number of (logical) relations, where each relation is partitioned into a number of
subrelations called logical fragments, which are the units of data distribution. A logical
fragment is derived from a relation by using two different steps. Initially, the relation is
horizontally divided into a number of subrelations by using a simple predicate or selection
operation of relational algebra, and then it is again vertically partitioned into logical
fragments by using a projection operation. To facilitate the reconstruction of a relation from
its logical fragments, a unique tuple identifier is attached to each logical fragment. A logical
fragment can be stored in one or several sites in an SDD-1 distributed database system, and it
may be replicated. The allocation of logical fragments is done during database design. A
stored copy of a logical fragment at a site is called a stored fragment. SDD-1 distributed
database system provides data distribution transparency to the users; thus, users are unaware
of the details of fragmentation, replication and distribution of data.
General Architecture of SDD-1 Database
System
The general architecture of SDD-1 is a collection of three independent virtual
machines: Transaction Modules or Transaction Managers (TMs), Data Modules or
Data Managers (DMs) and a Reliable Network (RelNet) [Bernstein et al., 1980]. DMs are
responsible for managing all data that are stored in an SDD-1 database system. The
execution of any transaction in the SDD-1 database system is controlled by the TMs. The
RelNet facilitates the necessary communications among different sites that are required to
maintain an SDD-1 distributed database system. The architecture of SDD-1 distributed
database system is illustrated in figure 14.1.
Data Module (DM) – In an SDD-1 distributed database system, each DM controls local data
that are stored at that local site. Hence, each DM is a back-end DBMS that responds to
commands from TMs. Generally, each DM responds to four types of commands from TMs,
which are listed in the following:
1. Reading data from the local database into the local workspace at that DM assigned to
each transaction.
2. Moving data between workspaces at different DMs.
3. Manipulate data in a local workspace at the DM.
4. Writing data from the local workspace into the permanent local database stored at that
DM.
Transaction Module (TM) – Each TM in an SDD-1 database system plans and manages the
execution of transactions in the system. A TM performs the following tasks:
The Reliable Network (RelNet) – This module interconnects TMs and DMs and provides
the following services:
1. Guaranteed delivery –. RelNet guarantees that the message will be delivered even if
the recipient is down at the time of the message sent, and even if the sender and the
receiver are never up simultaneously.
2. Transaction Control –. This module also ensures that either updates at all the
multiple DMs are posted or none at all, thus, maintaining atomicity of transactions.
3. Site Monitoring –. RelNet keeps information regarding the site failures in the system
and informs other sites about it.
4. A Global Network Clock –. It provides a virtual clock which is roughly synchronized
at all the sites in the system.
This architecture divides the SDD-1 distributed database into three subsystems, namely,
database management, management of distributed transactions and distributed DBMS
reliability with limited interactions.
The basic unit of user computation in SDD-1 is the transaction. A transaction essentially
corresponds to a program in a high-level host language with several DML statements
sprinkled within it. In SDD-1 database system, the execution of each transaction is
supervised by a TM (called transaction coordinator also) and proceeds in three phases,
known as read, execute and write. Each of these phases deals with individual problems in the
system. The read phase deals with concurrency control, the execute phase deals with
distributed query execution, and the write phase deals with the execution of updates at all
replicas of modified data. In read phase, the transaction coordinator (TC) (transaction
manager of the site where the query is initiated) determines which portion of the (logical)
database is to be read by the transaction, called the read set of the transaction. In addition,
the TC decides which stored fragments (physical data) are to be accessed to obtain the data
and then issues read commands to the corresponding DMs. To maintain the consistency of
the data, the TC instructs each DM to set aside a private copy of the fragment for use during
subsequent processing phases. The private copies obtained by the read phase are guaranteed
to be consistent even though the copies reside at distributed sites. As the data are consistent
and the copies are private, any operation on the data can be performed freely without any
interference from other transactions. Each DM sets aside the specified data in a private
workspace, and for each DM the private workspace is implemented by using a differential
file mechanism so that data are not actually copied. A page map is a function that associates
a physical storage location with each page and it behaves like a private copy, because pages
are never updated in original. Whenever a transaction wants to update a page, a new block of
secondary storage is allocated, and the modified page is written there. These pages cannot be
modified by other transactions, because if another transaction wants to update them, it has to
allocate a new physical storage location and write the updated page on it.
The second phase, called the Execute phase, implements distributed query processing. In
this phase, a distributed program that takes as input the distributed workspace created by the
read phase is compiled and an access plan for the execution is generated and executed. The
execution of the program is supervised by the TC, and it involves several DMs at the
different sites. During the execute phase, all actions are performed in the local workspaces of
the transaction. The output of the program is a list of data items to be written into the
database (in the case of update transactions) or displayed to the user (in the case of
retrievals). This output list is produced in a workspace, not in the permanent database.
The final phase, called the write phase, either writes the modified data into permanent
database and/or displays retrieved data to the user. In the write phase, the output list
produced by the transaction is broadcast to the “relevant” DMs as write messages. A DM is
relevant if it contains a physical copy of some logical data item that is referenced in the
output list. The updated data are written in SDD-1 database using a write-all approach. The
atomicity of transactions is preserved by using a specialized commit protocol that allows the
commitment of transactions in case of involved site failures.
The three-phase processing of transactions in SDD-1 neatly partitions the key technical
challenges of distributed database management such as distributed concurrency control,
distributed query processing and distributed reliability.
Figure 14.2 illustrates the conflict graphs for transactions T1 and T2, and T3 and T4
respectively.
The nodes of a conflict graph represent the read sets and write sets of transactions, and edges
represent conflicts among these sets. There is also an edge between the read set and write set
of each transaction. Here, the important point is that different kinds of edges require different
levels of synchronization, and strong synchronization such as locking is required only for
edges that participate in cycles. In figure 14.2, transactions T1 and T2 do not require
synchronization as strong as locking, but transactions T3 and T4 do require it.
It is not appropriate to do conflict graph analysis at run-time because too much inter-site
communication would be required to exchange information about conflicts. The conflict
graph analysis is done offline in the following way.
The database administrator defines transaction classes, which are named groups of
commonly executed transactions, according to their names, read sets, write sets, and the TM
at which they run. A transaction is a member of a class if the transaction’s read set and write
set are contained in the class’s read set and write set, respectively. Conflict graph analysis is
actually performed on these transaction classes instead of individual transactions. Two
transactions are said to be conflicting if their classes are conflicting. The output of the
conflict analysis is a table that indicates for each class which other classes are conflicting and
how much synchronization is required for each of such conflict to ensure serializability.
TIMESTAMP-BASED PROTOCOLS
To synchronize two transactions that conflict dangerously, SDD-1 uses timestamp values of
transactions. In SDD-1, the order is determined by total ordering of the transaction’s
timestamps. Each transaction submitted to SDD-1 is assigned a globally unique timestamp
value by its own TM, and these timestamp values are generated by concatenating a TM
identifier to the right of the network clock time, so that timestamp values from different TMs
always differ in their lower-order bits. The timestamp of a transaction is also attached to all
read and write commands sent to the DMs. In addition, each read command contains a list of
classes that conflict dangerously with the transaction issuing the read, and this list is
generated using conflict graph analysis. The DM delays the execution of a read command
until it has processed all earlier write commands but not the later write commands for the
specified class. The DM–TM communication discipline is called piping, and it requires that
each TM send its write commands to DMs in their timestamp order. Moreover, the RelNet
guarantees that messages are received in the order they are sent.
The idle TMs may send null (empty) timestamp write commands to avoid excessive delays
in waiting for write commands. This synchronization protocol corresponds to locking and is
designed to avoid “race conditions”. However, there are several variations of this protocol
depending on the type of timestamp attached to read commands and the interpretation of the
timestamps by DMs. For example, read-only transactions can use a less expensive protocol
in which the DM selects the timestamp, thereby avoiding the possibility of rejection and
reducing delays. An important feature of concurrency control of SDD-1 database is the
availability of a variety of synchronization protocols. When all read commands have been
processed, consistent private copies of the read set have been set aside at all necessary DMs,
and the read phase is complete at this point.
ACCESS PLANNING
The simplest way to execute a distributed transaction is to move all of its read set to a single
DM and then execute the transaction at that DM. This approach is very simple, but it has two
drawbacks. The first one is that the read set of the distributed transaction may be very large,
and moving it between sites can be exorbitantly expensive. The second drawback is that
parallel processing is put to very little use. To overcome these drawbacks, the access planner
generates object programs in two phases, known as reduction and final processing.
In the reduction phase, data is eliminated from the read set of the transaction as much as is
economically feasible without changing the output of the transaction. In the final processing
phase, the reduced read set is moved to the designated final DM where the transaction is
executed. This technique improves the performance of the above simple approach by
decreasing communication cost and increasing parallelism. To reduce the volume of the read
sets of transaction, the reduction technique employs selection, projection and semijoin
operations of relational algebra. The cost-effectiveness of a semijoin operation depends on
the database state. The challenge of access planning is to construct a program of cost-
beneficial semijoin operations for a given transaction and a database state. The hill-climbing
algorithm is used here to construct a program of cost-beneficial semijoin operations. In hill-
climbing algorithm, the optimization starts with an initial feasible solution and recursively
improves it. This process terminates when no further cost-beneficial semijoin operations can
be found [The details of hill-climbing procedure is described in Chapter 11, Section 11.5.3.]
A final stage reorders the semijoin operations for execution to take maximal advantage of
their reductive power.
DISTRIBUTED EXECUTION
The programs produced by the access planner are non-looping parallel programs and can be
represented as data flow graphs. To execute the programs, the TC issues commands to the
DMs involved in each operation as soon as all predecessors of the operation are ready to
produce output. The output of this phase is either stored as a temporary file at the final DM
that is to be written into the permanent database (if it is an update operation) or displayed to
the user (if it is a data retrieval request). The execute phase is complete at this point.
GUARANTEED DELIVERY
In the communication field, there are well-known techniques for reliable message delivery as
long as both sender and receiver are up. In ARPANET, errors due to duplicate messages,
missing messages, and damaged messages are detected and corrected by the network
software. In SDD-l, guaranteed delivery is ensured by an extended communication facility
called reliable network or RelNet, even when the sender and the receiver are not up
simultaneously to exchange messages. The RelNet employs a mechanism called spooler for
guaranteed message delivery.
TRANSACTION CONTROL
Transaction control handles failures of the final DM during the write phase. If the final DM
fails after sending some files but not all, the database becomes inconsistent as it stores the
partial effects of the transaction. Transaction control rectifies this type of inconsistencies in a
timely fashion. The basic technique employed here is a variant of the two-phase commit
protocol. During phase1, the final DM transmits the files, but the receiving DMs do not
perform the updates. During phase 2, the final DM sends commit messages to involved DMs,
and each receiving DM updates its data upon receiving this message. If some DM, say DMi,
has received files but not a commit message and the final DM fails, then the data manager
DMi can consult with other DMs. If any DM has received a commit message, the data
manager DMi performs the updates. If none of the DMs have received a commit message,
none of them performs the update, thereby aborting the transaction. This technique offers
complete protection against failures of the final DM but is susceptible to multi-site failures.
For each data item X in the write command, the value of X is modified at the DM if and only
if the timestamp value of X is less than the timestamp value of the write command. Thus,
recent updates are never overwritten by write commands with the older timestamp value.
This has the same effect as processing write commands according to the timestamp order. A
major disadvantage of this approach is the apparent high cost of storing timestamps for every
data item in the database. However, this cost can be reduced to an acceptable level by
caching the timestamps. The write phase is completed when updates are made at all DMs,
and at this point the transaction execution is complete.
One simple approach to maintain the system catalog is to treat catalog information as user
data and manage it in the same way as database information. This approach allows catalogs
to be fragmented, distributed with arbitrary replication, and updated from arbitrary TMs.
This approach has some limitations. First, the performance may degrade because every
catalog access incurs general transaction overhead and every access to remotely stored
catalogs incurs communication delays. These limitations can be overcome by caching
recently referenced catalog fragments at each TM, discarding them if rendered obsolete by
catalog updates.
This solution is more appropriate, because the catalogs are relatively static. However, it
requires a directory that tells where each catalog fragment is stored. This directory is called
the catalog locator, and a copy of it is stored at every DM. This solution is appropriate
because catalog locators are relatively small and quite static.
In R* distributed database system, data are stored as relations or tables. The R* distributed
database system currently does not support fragmentation and replication of data. Moreover,
in R* it is not necessary that sites are geographically separated; different sites can be on the
same computer. The most important feature of R* distributed database system is that it
provides site autonomy. Site autonomy requires that the system be able to expand
incrementally and operate continuously. It is possible to add new sites without requiring the
existing sites to agree with the joining sites on global data structures or definitions. Another
important feature of R* distributed system is that it provides location transparency. The
major achievement of R* is that it provides most of the functionalities of a relational
database management system in a distributed environment.
Architecture of R*
The general architecture of R* distributed database system consists of three major
components [Astrahan et al., 1976]. These are a local database management system, a
data communication system that facilitates message transmission, and a transaction
manager that controls the implementation of multi-site transactions. The local DBMS
component is further subdivided into two components: storage system and database
language processor. The storage system of R* is responsible for retrieval and storage of
data, and this is known as Relational Storage System (RSS*). The database language
processor translates high-level SQL statements into operations on the storage system of R*.
The architecture of R* distributed database system is illustrated in figure 14.3.
The relational storage interface (RSI) is an internal interface that handles access to single
tuples of base relations. This interface and its supporting system, the RSS*, is actually a
complete storage subsystem that manages devices, space allocation, storage buffers,
transaction consistency and locking, deadlock detection, transaction recovery and system
recovery. Furthermore, it maintains indexes on selected fields of base relations and pointer
chains across relations. The relational data interface (RDI) is the external interface that can
be called directly from a programming language or used to support various emulators and
other interfaces. The relational data system (RDS), which supports the RDI, provides
authorization, integrity enforcement, and support for alternative views of data. The high-
level SQL language is embedded within the RDI, and is used as the basis for all data
definition and manipulation. In addition, the RDS maintains the catalogs of external names,
as the RSS* uses only system-generated internal names. The RDS contains an optimizer
which chooses an appropriate access path for any given request among the paths supported
by the RSS*.
In R* distributed system, sites can communicate with each other using inter-system
communication (ISC) facility of customer information control system (CICS). Each site in
R* runs in a CICS address space, and CICS handles I/Os and message communications.
Unlike in RelNet, here the communication is not assumed to be reliable. However, the
delivered messages are correct, not replicated, and received in the same order in which they
are sent. All database access requests are made through an application program at each local
site of the R* system. All inter-site communications are between R* systems at different
sites, thus, remote application programs are not required in R* distributed database
environment.
Any transaction generated at a particular site is initiated and controlled by the transaction
manager at that site (known as transaction coordinator). The transaction manager at that site
implicitly performs a begin_transaction, and the implicit end_transaction is assumed when
the user completes a session. An R* database system can also be invoked using a user-
friendly interface (UFI), and in this case, SQL statements are individually submitted from the
UFI to R* storage system. Here each individual SQL statement is considered as an individual
transaction. The transaction manager at each site assigns a unique identifier to each
transaction originating at that site; this is generated by the concatenation of a local counter
value and the site identifier.
In R* database system, for processing any request the “process” model of computation is
used. According to this computation model, a process is created for each user on the first
request for a remote site and maintained until the end of the application, instead of allocating
a different process to each individual data request. Thus, a limited number of processes are
assigned to applications, thereby reducing the cost of process creation. User identification is
verified only once – at the time of creation of the process. A process activated at one site can
request the activation of another process at another site, in turn. Therefore, a computation in
R* may require the generation of a tree of processes, all of which belong to the same
application. In R* system, communication between processes can be done using sessions,
and these sessions are established when the remote process is created and are retained
throughout the life of the process. Thus, processes and sessions constitute a stable
computational structure for an application in R* distributed database system.
Query Processing in R*
In R* distributed database system, there are two main steps for query processing known
as compilation and execution [Chamberlin et al., 1981]. These are described in the
following.
Query compilation includes the generation and distribution of low-level programs for
accessing the RSS* for each high-level SQL query. These programs can be executed
repeatedly, and recompilation is required only when definitions of data used by the SQL
statement change. To determine whether recompilation is required or not, it is necessary to
store the information about the dependencies of the compiled query on data definitions. If the
query is interpreted, the determination of an access plan is done for every execution of the
query.
In R* distributed database system, there are several options for compilation of a query. In a
centralized approach, a single site is responsible for the compilation of all queries generated
at any site. This centralized approach is unacceptable, because it compromises the site
autonomy. The compilation of a query can be done at the site where it is generated; this
approach is also not acceptable because it does not preserve the local autonomy of other
sites. The access plan for a query execution determined by the originating site may be
rejected by other participating sites. In recursive compilation approach, each site performs
part of the query compilation and asks some other site to compile the remaining subquery.
This recursive approach also has the drawback in that it requires negotiation among sites for
compilation. The approach that is used for compilation of queries in R* distributed database
environment distributes the query compilation responsibilities among a master
site and apprentices sites. The site, where the query is generated, is called the master, and is
responsible for the global compilation of the query. The global plan specifies the order of
operations and the method that should be used to execute them. The other participating sites,
called apprentices, are responsible for selecting the best plan to access their local data for the
portion of the global plan that pertains to them. These local access plans must be consistent
with the global access plan.
The query compilation at the master site includes the following steps.
1. Parsing of the query – This step involves syntax verification of SQL query and the
resolution of database objects’ names, which corresponds to the transformation from
print names to system-wide names.
2. Accessing of catalog information – If the query involves only local data, then the
local catalog is accessed; otherwise, catalog information can be retrieved either from
local caches of remote data or from remote catalogs at the sites where the data are
stored.
3. Authorization checking – Authorization checking is done to determine whether the
user compiling the query has proper access privileges or not. It is performed on local
relations, initially at the master site and later at the apprentice sites.
4. Query optimization – In this step, an optimized global access plan is determined
based on I/O cost, CPU cost and communication cost.
5. Plan distribution – In this step, the full global plan is distributed to all apprentices
that specifies the order of operations and the methods to be used for them, together
with some guidelines for performing local operations, which may or may not be
followed by the apprentices. The global plan also includes the original SQL query,
catalog information, and the specification of parameters to be exchanged between sites
when the query is executed.
6. Local binding – The names used in the part of the global plan that is executed at the
master site are bound to internal names of database objects at that local site.
7. Storage of plan and dependencies – The local plan at the master site is converted
into a local access module that will be executed at the local instance of the RSS*. This
plan is stored together with the related dependencies that indicate the relations and
access methods used by the plan.
The master distributes to the apprentices portions of the global plan that contain the original
SQL statement, the global plan from which each apprentice can isolate its portion, and the
catalog information used by the master. It is to be noted that passing of high-level SQL
statement by the master provides more independence and portability, so that the apprentice
can generate a different local version of the optimized plan. The steps that are performed by
apprentice sites are parsing, local catalog accessing, authorization checking and local access
optimization. Finally, local names of database objects are bound, and plans and dependencies
are stored.
When all SQL statements of the same application program have been compiled by all
apprentice sites, the master requests the transaction manager to commit the compilation
considering it as an atomic unit. Here the two-phase commit protocol is used to ensure that
the compilation is successful at all sites or at none at all.
Execution –. In this step, plans sent to all apprentices in the compilation step are retrieved
and executed. The master sends execute requests to all apprentices, which in turn may send
the same request to their subordinates, thereby activating a tree of processes. Each
intermediate-level process receives result data, which is packed in blocks, from its
subordinates, and starts execution as soon as the first results are received. Then the
intermediate-level process sends the result to the higher-level processes, also packed in
blocks, thus, pipelining is achieved. A superior in the tree may send signals to its
subordinates to stop the transmission of results.
Transaction Management in R*
A transaction in R* system is a sequence of one or more SQL statements enclosed within
begin_transaction and end_transaction. It also serves as a unit of consistency and recovery
[Mohan et al., 1986]. Each transaction is assigned a unique transaction identifier, which is
the concatenation of a sequence number and the site identifier of the site at which the
transaction is initiated. Each process of a transaction is able to provisionally perform the
operations of the transaction in such a way that they can be undone if the transaction needs to
be aborted. Also, each database of the distributed database system has a log that is used for
recovery. The R* distributed database system, which is an evolution of the centralized
DBMS System R, like its predecessor, supports transaction serializability and uses the two-
phase locking (2PL) protocol as the concurrency control mechanism. In R*, concurrency
control is provided by 2PL protocol where transactions hold their locks until the commit or
abort, to provide isolation. The use of 2PL introduces the possibility of deadlocks in R*. The
R* distributed database system, instead of preventing deadlocks, allows them to occur and
then resolves them by deadlock detection and victim transaction abort.
The log records are carefully written sequentially in a file that is kept in non-volatile storage.
When a log record is written, the write can be done synchronously or asynchronously. In
synchronous write, called forcing a log record, the forced log record and all preceding ones
are immediately moved from the virtual memory buffers to stable storage. The transaction
writing the log record is not allowed to continue execution until this operation is completed.
On the other hand, in asynchronous write, the record gets written to the virtual memory
buffer storage and is allowed to migrate to the stable storage later on. The transaction writing
the record is allowed to continue execution before the migration takes place. It is to be noted
that a synchronous write increases the response time of the transaction compared to an
asynchronous write. The former case is called force-write, and the latter is called, simply,
write.
In R* distributed database system, transactions are committed using two variations of the
standard two-phase commit protocol. These variations differ in terms of the number of
messages sent, the time for completion of the commit processing, the level of parallelism
permitted during the commit processing, the number of state transitions that the protocols go
through, the time required for recovery once a site becomes operational after a failure, the
number of log records written, and the number of those log records that are force-written to
stable storage. The conventional two-phase commit protocol is extended for transaction
commitment in R* system, which supports a tree of processes. The presumed abort
(PA) and the presumed commit (PC) protocols are defined to improve the performance of
distributed transaction commit.
The PA protocol does not change the performance in terms of log writes and message
sending of the two-phase commit protocol with respect to committing transactions.
Each coordinator must record the names of its subordinates safely before any of the latter
could get into the prepared state. Then, when the coordinator site aborts on recovery from a
crash that occurred after the sending of the prepare message, the restart process will know
whom to inform about the abort message and from which subordinates acknowledgements
should be received. These modifications complete the PC protocol. The name arises from the
fact that in the no-information case the transaction is presumed to have committed, and
hence the response to an inquiry is a commit.
The PC protocol is more efficient than the PA protocol in cases where subordinates perform
update transactions successfully, as it does not require acknowledgements and forcing of
commit records. On the other hand, the PA protocol is beneficial for read-only transactions,
because the additional record with the participant information need not be forced in the log
of the coordinator. R* distributed database system provides facilities to select the protocol
for each individual transaction. The PA protocol is suggested for read-only transactions,
whereas the PC protocol is suggested for all other transactions.
Chapter Summary
• SDD-1 is a distributed database management system developed by Computer
Corporation of America between 1976 and 1978. It supports relational data model.
• The general architecture of SDD-1 is a collection of three independent virtual
machines, namely, the Transaction Modules (TMs), Data Modules (DMs) and a
Reliable Network (RelNet).
• In SDD-1 database system, the execution of each transaction is processed in three
phases known as read, execute, and write. The read phase deals with concurrency
control, the execute phase deals with distributed query execution, and the write phase
deals with the execution of updates at all replicas of the modified data.
• R* is a distributed database system developed by and operational at the IBM San Jose
Research Laboratory at California. The objective of R* is to develop a cooperative
distributed database system where each site is a relational database system with site
autonomy.
• The general architecture of R* distributed database system consists of three major
components. These are a local database management system, a data communication
system, and a transaction manager. The local DBMS component is further subdivided
into two components, known as R* storage system (RSS*) and database language
processor.
• There are two steps involved in the processing of a query in R* environment, known
as compilation and execution. Sometimes, recompilation is also required.
• In R* system, two-phase commit protocol is used with variations for transaction
commitment. These are presumed abort protocol and presumed commit protocol.
Chapter 15. Data Warehousing and Data
Mining
This chapter focuses on the basic concepts of data warehousing, data mining and online
analytical processing (OLAP). A data warehouse is a subject-oriented, time-variant,
integrated, non-volatile repository of information for strategic decision-making. Data in
warehouses may be multi-dimensional. OLAP is the dynamic synthesis, analysis and
consolidation of large volumes of multi-dimensional data. Data mining is the process of
extracting valid, previously unknown, comprehensible and actionable information from large
volumes of multi-dimensional data, and it is used for making crucial business decisions.
Therefore, all the above three concepts are related to each other. Benefits and problems of
data warehouses and architecture of a data warehouse are briefly explained in this chapter.
The concepts of data mart and data warehouse schemas, and different data mining techniques
are also discussed.
The organization of this chapter is as follows. Section 15.1 introduces the concepts of data
warehousing. The architecture and the different components of a data warehouse are
described in Section 15.2. Database schemas for data warehouses are illustrated in Section
15.3, and the concept of a data mart is introduced in Section 15.4. OLAP technique and
OLAP tools are discussed in Section 15.5. In Section 15.6, the basic concept of data mining
is represented, and the different data mining techniques are focused on in Section 15.7.
In the 1990s, organizations began moving from traditional database systems to data
warehouse systems to achieve competitive advantage. A data warehouse integrates data from
multiple, heterogeneous sources and transforms them into meaningful information, thereby
allowing business managers to perform more substantive, accurate and consistent analysis.
Data warehousing improves the productivity of corporate decision-makers through
consolidation, conversion, transformation and integration of operational data, and provides a
consistent view of an enterprise. Data warehousing is an environment, not a product. It is an
architectural construct of information systems that provides current and historical decision
support information to users. In 1993, W.H. Inmon offered the following formal definition
for the data warehouse.
A data warehouse is a subject-oriented, integrated, time-variant, non-volatile collection
of data in support of management’s decision-making process.
There are numerous definitions of data warehouse, but the earlier definition focuses on the
characteristics of the data that are stored in the data warehouse. The data in the warehouse is
used for analysis purpose, thus, it involves complex queries on data. In recent years, a new
concept is associated with the data warehouse, known as data web house. A data web
house is a distributed data warehouse that is implemented on the Web with no central data
repository.
Holds current data and detailed Holds detailed data, historical data and
data summarized data
An ODS is a repository of current and integrated operational data that are used for analysis.
The data that are coming from an ODS are often structured and already extracted from data
sources and cleaned; therefore, the work of the data warehouse is simplified. The ODS is
often created when legacy operational systems are found to be incapable of achieving
reporting requirements.
Load Manager
The load manager (also known as front-end component) performs all the operations
associated with the extraction and loading of data from different data sources into the data
warehouse. The data may be extracted from data sources as well as ODSs. The load manager
is constructed by combining vendor data loading tools and custom-built programs. The size
and the complexity of this component may vary for different data warehouses.
Query Manager
The query manager (also known as back-end component) is responsible for managing user
queries. This component performs various operations such as directing queries to the
appropriate tables, scheduling the execution of queries, and generating query profiles to
allow the warehouse manager to determine which indexes and aggregations are appropriate.
The query profile is generated based on information that describes the characteristics of the
query such as frequency, target tables and size of the result sets. The query manager is
constructed by combining vendor end-user access tools, data warehouse monitoring tools,
database facilities and custom-built programs. The complexity of this component depends on
the functionalities provided by the end-user tools and the database.
Warehouse Manager
The warehouse manager is responsible for the overall management of the data in the data
warehouse. This component is constructed by using data management tools and custom-build
programs. The operations that are performed by this component are as follows:
In some cases, the query manager is also responsible for generating query profiles.
Detailed Data
The data warehouse stores all detailed data in the database schema. In most cases, the
detailed data is not stored online, but is made available by aggregating the data to the next
detailed level. On a regular basis, detailed data is added to the warehouse to supplement the
aggregated data.
Summarized Data
The data warehouse stores all summarized data generated by the warehouse manager.
Summarized data may be lightly summarized or highly summarized. The objective of
summarized data is to speed up query processing. Although summarizing of data involves
increased operational costs initially, this is compensated, as it removes the requirement of
repeatedly performing summary operations (such as sorting, grouping and aggregating)
while answering user queries. The summary data is updated continuously as new data are
loaded into the data warehouse.
Archive/Backup Data
The data warehouse also stores the detailed data and the summarized data for archiving and
backup. Although the summarized data can be generated from the detailed data, it is
necessary to backup the summarized data online. The data is transferred to storage archives
such as magnetic tapes or optical disks.
Metadata
Data about data are called metadata. The data warehouse stores the metadata information
used by all processes in the warehouse. A metadata repository should contain the following
information.
• A description of the structure of the data warehouse that includes the warehouse
schema, views, dimensions, hierarchies and derived data definitions, data marts
locations and contents etc.
• Operational metadata that involves data linkage, currency of data, warehouse usage
statistics, and error reports and their trails.
• The summarization processes that involve dimension definitions and data on
granularity, partitions, summary measures, aggregation, and summarization.
• Details of data sources that include source databases and their contents, gateway
descriptions, data partitions, rules for data extractions, cleaning and transformation,
and defaults.
• Data related to system performance such as indexes, query profiles and rules for
timing and scheduling of refresh, update and replication cycles.
• Business metadata that includes business terms and definitions, data ownership
information and changing policies.
Metadata are used for a variety of purposes, which are listed below.
• In data extraction and loading processes, to map data sources to a common integrated
format of the data within the data warehouse.
• In warehouse management process, to automate the generation of summary tables.
• In query management process, to direct a query to the most appropriate data sources.
• In end-user access tools, to understand how to build a query.
The structure of metadata differs between processes, because the purpose is different. Based
on the variety of metadata, they are classified into three different categories: build-time
metadata, usage metadata and control metadata. The metadata generated at the time of
building a data warehouse is termed as build-time metadata. The usage metadata is derived
from the build-time metadata when the warehouse is in production and these metadata are
used by user queries and for data administration. The control metadata is used by the
databases and other tools to manage their own operations.
1. Reporting and query tools –. Reporting tools include production reporting tools and
report writers. Production reporting tools are used to generate regular operational
reports such as employee pay cheques, customer orders and invoices. Report writers
are inexpensive desktop tools used by end-users to generate different types of reports.
For relational databases, query tools are also designed to accept SQL or generate SQL
statements to query the data stored in the warehouse.
2. Application development tools –. In addition to reporting and query tools, user
access may require the development of in-house applications using graphical data
access tools designed primarily for client/server environments. Some of these
applications tools are also integrated with OLAP tools to access all major database
systems.
3. EIS tools –. These tools were originally developed to support high-level strategic
decision-making. EIS tools are associated with mainframes and facilitate users for
building customized graphical decision support applications that provide an overview
of the organization’s data and access to external data sources.
4. OLAP tools –. These are very important tools based on multi-dimensional data
models that assists a sophisticated user by facilitating the analysis of the data using
complex, multi-dimensional views. Typical business applications that use these tools
include assessing the effectiveness of a marketing campaign, product sales forecasting
and capacity planning.
5. Data mining tools –. Data mining tools are used to discover new meaningful
correlations, patterns and trends from large amounts of data. Generally, data mining
tools use statistical, mathematical and artificial intelligence techniques for mining
knowledge from large volumes of data.
Star Schema
Star Schema is represented by a starlike structure that has a fact table containing multi-
dimensional data in the centre surrounded by dimension tables containing reference data that
can be de-normalized. Since the bulk of data in a data warehouse is represented as facts, fact
tables are usually very large. The dimension tables are relatively smaller in size. The fact
table contains the details of summary data where each dimension is a single, highly de-
normalized data. In star schema, each tuple in the fact table corresponds to one and only one
tuple in each dimension table, but one tuple in a dimension table may correspond to more
than one tuple in the fact table. Therefore, there is an N:1 relationship between the fact table
and the dimension table. The structure of a star schema is illustrated in example 15.1.
The advantage of a star schema is that it is easy to understand, and hierarchies can be defined
easily. Moreover, it reduces the number of physical joins and requires low maintenance.
Example 15.1.
Snowflake Schema
A variation of the star schema where the dimension tables do not contain de-normalized data
is called snowflake schema. In snowflake schema, as the dimension tables contain
normalized data, it allows dimension tables to have dimensions. To support attribute
hierarchies, the dimension tables can be normalized to create a snowflake schema.
Example 15.2.
In example 15.1, a new dimension table Department of Employee has been created to
convert it into a snowflake schema. The resultant snowflake schema is illustrated in figure
15.4.
Figure 15.4. An Example of Snowflake Schema
Example 15.3.
In example 15.1, assume that the IT Company is also a manufacturer of some products. In
this case, the product information and the project information can be represented by using a
fact constellation schema. The corresponding fact constellation schema is illustrated in figure
15.5.
Figure 15.5. An Example of Fact Constellation Schema
Data Marts
A subset of a data warehouse that supports the requirements of a particular department or
business function is called a data mart. The data mart can be standalone or linked centrally to
the corporate data warehouse. Data marts may contain some overlapping data. The
characteristics that differentiate a data mart from a data warehouse include the following.
• A data mart focuses on only the requirements of users associated with one department
or business function.
• Generally, a data mart does not contain detailed operational data.
• A data mart contains less data compared to a data warehouse; thus, it is easier to
understand and navigate.
There are two different approaches for building a data mart. In the first approach, several
data marts are created, and these are eventually integrated into a data warehouse. In the
second approach, to build a corporate data warehouse, one or more data marts are created to
fulfill the immediate business requirements. The architecture of a data mart may be two-tier
or three-tier depending on database applications. In the three-tier architecture of data marts,
tier 1 represents the data warehouse, tier 2 represents the data mart and tier 3 represents the
end-user tools.
A data mart is often preferable over a data warehouse. The reasons are listed in the
following.
• A data mart satisfies the requirements of a group of users in a particular department or
business function; therefore, it is easier to understand and analyse.
• A data mart improves the response time of end-user queries, as it stores less volume of
data than a data warehouse.
• The cost of implementing a data mart is much lesser than that of a data warehouse, and
the implementation is less time-consuming.
• A data mart normally stores less volume of data; thus, data cleaning, loading,
transformation and integration are much easier in this case. The implementation and
establishment of a data mart is less complicated than that of a data warehouse.
• The potential users of a data mart can be defined easily, and they can be targeted
easily to obtain support for a data mart project rather than for a corporate data
warehouse project.
OLAP Tools
OLAP tools are based on the concepts of multi-dimensional databases and allow a
sophisticated user to analyse the data using elaborate, multi-dimensional, complex views.
There are three main categories of OLAP tools, known as multi-dimensional
OLAP (MOLAP or MD-OLAP), relational OLAP(ROLAP) and hybrid
OLAP (HOLAP) [Berson and Smith, 2004]. These are described in the following.
Multi-dimensional OLAP –. MOLAP tools utilize specialized data structures and multi-
dimensional database management systems to organize, navigate and analyse data. Data
structures in MOLAP tools use array technology and provide improved storage techniques to
minimize the disk space requirements through sparse data management in most cases. The
advantage of using a data cube is that it allows fast indexing to pre-compute summarized
data. This architecture offers excellent performance when the data is used as designed, and
the focus is on data for a specific decision support application. In addition, some OLAP tools
treat time as a special dimension for enhancing their ability to perform time series analysis.
The architecture of MOLAP tools is depicted in figure 15.6.
Figure 15.6. MOLAP Architecture
Applications that require iterative and comprehensive time series analysis of trends are well
suited for MOLAP technology. The underlying data structures in MOLAP tools are limited
in their ability to support multiple subject areas and to provide access to detailed data. In this
case, navigation and analysis of data are also limited, because the data is designed according
to previously determined requirements. To support new requirements, data may need
physical reorganization. Moreover, MOLAP tools require a different set of skills and tools to
build and maintain the database, thereby increasing the cost and complexity of support.
Examples of MOLAP tools include Arbor Software’s Essbase, Oracle’s Express Server, Pilot
Software’s Lightship Server and Sinper’s TM/1.
Relational OLAP –. ROLAP, also called multi-relational OLAP, is the fastest growing style
of OLAP technology. These tools support RDBMS products directly through a dictionary
layer of metadata, thereby avoiding the requirement to create a static multi-dimensional data
structure. ROLAP tools facilitate the creation of multiple multi-dimensional views of the
two-dimensional relations. To improve performance, some ROLAP tools have developed
enhanced SQL engines to support the complexity of multi-dimensional analysis. To provide
flexibility, some ROLAP tools recommend or require the use of highly de-normalized
database designs such as the star schema. The architecture of ROLAP tools is illustrated
in figure 15.7.
ROLAP tools provide the benefit of full analytical functionality while maintaining the
advantages of relational data. These tools depend on a specialized schema design, and their
technology is limited by their non-integrated, disparate tier architecture. As the data is
physically separated from the analytical processing, sometimes the scope of analysis is
limited. In ROLAP tools, any change in the dimensional structure requires a physical
reorganization of the database, which is very time-consuming. Examples of ROLAP tools are
Information Advantage (Axsys), MicroStrategy (DSS Agent/DSS Server), Platinum/Prodea
Software (Beacon) and Sybase (HighGate Project).
Hybrid OLAP –. HOLAP tools, also called Managed Query Environment (MQE) tools, are
relatively a new development. These tools provide limited analysis capability, either directly
against RDBMS products, or by leveraging an intermediate MOLAP server. HOLAP tools
deliver selected data directly from the DBMS or via a MOLAP server to the desktop in the
form of a data cube where it is stored, analyzed and maintained locally. This reduces the
overhead of creating the structure each time the query is executed. Once the data is in the
data cube, users can perform multi-dimensional analysis against it. The architecture for
HOLAP tools is depicted in figure 15.8.
The simplicity of installation and administration of HOLAP tools attracts vendors to promote
this technology with significantly reduced cost and maintenance. However, this architecture
results in significant data redundancy and may cause problems for networks that support
many users. In this technology, only a limited amount of data can be maintained efficiently.
Examples of HOLAP tools are Cognos Software’s PowerPlay, Andyne Software’s Pablo,
Dimensional Insight’s CrossTarget, and Speedware’s Media.
Data mining is the process of extracting valid, previously unknown, comprehensible and
actionable information from large databases and using it to make crucial business decisions.
Data mining is essentially a system that learns from existing data by scanning it and
discovers patterns and correlations between attributes.
Knowledge Discovery in Database (KDD) Vs. Data
Mining
In 1989, KDD was formalized for seeking knowledge from data with reference to the general
concept of being broad and high-level. The term data mining was then coined, and this high-
level technique is used to present and analyze data for business decision-makers. Actually,
data mining is a step involved in knowledge discovery process. There are various steps in
knowledge discovery process, known as selection, preprocessing, transformation, data
mining, and interpretation and evaluation. These steps are described below.
1. In the selection step, the data that are relevant to some criteria are selected or
segmented.
2. Preprocessing is the data cleaning step where unnecessary information are removed.
3. In the transformation step, data is converted to a form that is suitable for data mining.
In this step, the data is made usable and navigable.
4. Data mining step is concerned with the extraction of patterns from the data.
5. In the interpretation and evaluation step, the patterns obtained in the data mining stage
are converted into knowledge, which is used to support decision-making.
KDD and data mining can be distinguished by using the following definitions [Fayyad et al.,
1996].
KDD is the process of identifying a valid, potentially useful and ultimately understandable
structure in data. KDD involves several steps as mentioned above.
Data mining is a step in the KDD process concerned with the algorithmic means by which
patterns or structures are enumerated from the data under acceptable computational
efficiency limitations.
The structures that are the outcomes of the data mining process must meet certain criteria so
that these can be considered as knowledge. These criteria are validity, understandability,
utility, noveltyand interestingness.
Predictive Modeling
Predictive modeling is similar to the human learning experience in using observations to
form a model of the important characteristics of some phenomenon. Predictive modeling is
used to analyse an existing database to determine some essential features (model) about the
data set. The model is developed using a supervised learning approach that has two
different phases. These are trainingand testing. In the training phase, a model is developed
using a large sample of historical data, known as the training set. In the testing phase, the
model is tested on new, previously unseen, data to determine its accuracy and physical
performance characteristics. Two methods are used in predictive modeling. These
are classification and value prediction.
• Classification –. Classification involves finding rules that partition the data into
disjoint groups. The input for the classification is the training data set, whose class
labels are already known. Classification analyses the training data set and constructs a
model based on the class label, and assigns a class label to the future unlabelled
records. There are two techniques for classification, known as tree
induction and neural induction. In tree induction technique, the classification is
achieved by using a decision tree, while in neural induction the classification is
achieved by using a neural network. A neural network contains a collection of
connected nodes with input, output and processing at each node. There may be several
hidden processing layers between the visible input and output layers. The decision tree
is the most popular technique for classification, and it presents the analysis in an
intuitive way.
• Value prediction –. Value prediction is used to estimate a continuous numeric value
that is associated with a database record. There are two different techniques for value
prediction that use the traditional statistical methods, known as linear
regression and non-linear regression. Both these techniques are well-known
methods and are easy to understand. In linear regression, an attempted is made to fit a
straight line by plotting the data such that the line is the best representation of the
average of all observations at that point in the plot. The limitation of linear regression
is that the technique only works well with linear data and is sensitive to the presence
of outliers. The non-linear regression also is not flexible enough to handle all possible
shapes of the data plot. Applications of value prediction are credit card fraud
detection, target mailing list identification etc.
Clustering
Clustering is a method of grouping data into an unknown number of different groups, known
as clusters or segments so that the data in a group represents similar trends and patterns.
This approach uses unsupervised learning to discover homogeneous subpopulations in a
database to improve the accuracy of the profiles. Clustering constitutes a major class of data
mining algorithms. The algorithm attempts to automatically partition the data space into a set
of clusters, either deterministically or probabilitywise. Clustering is less precise than other
operations and is therefore less sensitive to redundant and irrelevant features.
Two different methods are used for clustering based on allowable input data, the methods
used to calculate the distance between records, and the presentation of the resulting clusters
for analysis. These are demographic and neural clustering.
Link Analysis
The objective of link analysis is to establish links, called associations, between the
individual records, or sets of records in a database. There are numerous applications of data
mining that fit into this framework, and one such popular application is market-basket
analysis. Three different techniques are used for link analysis, known as associations
discovery, sequential pattern discovery and similar time sequence discovery.
Associations discovery technique finds items that indicate the presence of other items in the
same event. These affinities between items are represented by association rules. In sequential
pattern discovery, patterns are searched between events such that the presence of one set of
items is followed by another set of items in a database of events over a period of time.
Similarly time sequence discovery is used to discover the links between two sets of data that
are time-dependent and is based on the degree of similarity between the patterns that both
time series demonstrate.
Deviation Detection
Deviation detection identifies outlying points in a particular data set and explains whether
they are due to noise or other impurities being present in the data or due to trivial reasons.
Applications of deviation detection involve forecasting, fraud detection, customer retention
etc. Deviation detection is often a source of true discovery because it identifies outliers that
express deviation from some previously known expectation or norm. Deviation detection can
be performed by using statistical and visualization techniques or as a by-product of data
mining.
Chapter Summary
• A data warehouse is a subject-oriented, time-variant, integrated, non-volatile
repository of information for strategic decision-making.
• A data web house is a distributed data warehouse that is implemented on the Web
with no central data repository.
• The components of a data warehouse are operational data source, load manager, query
manager, warehouse manager, detailed data, summarized data, archive/backup data,
metadata, end-user access tools and data warehouse background processes.
• The three different types of database schemas that are used to represent data in
warehouses are star schema, snowflake schema and fact constellation schema.
• A subset of a data warehouse that supports the requirements of a particular department
or business function is called a data mart.
• OLAP is the dynamic synthesis, analysis and consolidation of large volumes of multi-
dimensional data. There are three different categories of OLAP tools, which are
MOLAP tools, ROLAP tools and HOLAP tools.
• Data mining is the process of extracting valid, previously unknown, comprehensible
and actionable information from large volumes of multi-dimensional data and it is
used to make crucial business decisions. The different data mining techniques are
predictive modeling, database segmentation or clustering, link analysis and deviation
detection.
Bibliography
[bib01_01] Aho A., Y. Sagiv, and J.D. Ullman, “Equivalence among relational
Expressions”, SIAM Journal of computing, Vol. 8, Issue 2, pp 218–246, 1979.
[bib01_04] Apers P.M.G., A.R. Hevner, and S.B. Yao, “Optimization Algorithms for
Distributed Queries”, IEEE Transactions on Software Engineering, Vol. 9, Issue 1, pp
57–68, 1983.
[bib01_05] Astrahan M.M., M.W. Blasgen, D.D. Chamberlin, K.P. Eswaran, J.N. Gray,
P.P. Griffiths, W.F. King, R.A. Lorie, P.R. McJones, J.W. Mehl, G.R. Putzolu, I.L.
Traiger, B.W. Wade, and V. Watson, “System R: A Relational Database Management
System”, ACM Transactions on Database System, Vol. 1, Issue 2, pp 97–137, June 1976.
[bib01_06] Atkinson M., F. Bancilhon, D.J. Dewitt, K. Dittrich, D. Maier, and S. Zdonik,
“The Object-Oriented Database System Manifesto”, Proceedings of 1st International
Conference on Deductive and Object-Oriented Databases, pp 40–57, December 1989.
[bib01_07] Beeri C., P.A. Bernstein, and N. Goodman, “A Model for Concurrency in
Nested Transaction Systems”, Journal of the ACM, Vol. 36, Issue 2, pp 230–269, April
1989.
[bib01_12] Bernstein P.A., N. Goodman, E. Wong, C.L. Reeve, and J.B. Rothnie, “Query
Processing in a System for Distributed Databases (SDD-1)”, ACM Transactions on
Database Systems, Vol. 6, Issue 4, pp 602–625, December 1981.
[bib01_13] Bernstein P.A., David. W. Shipman, and J.B. Rothnie, “Concurrency Control
in a System for Distributed Databases (SDD-1)”, ACM Transactions on Database
Systems, Vol. 5, Issue 1, pp 19–51, March 1980.
[bib01_14] Berson A. and S.J. Smith, Data Warehousing, Data Mining & OLAP, New
Delhi, Tata McGraw-Hill edition, ISBN 0–07–006272–2, 2004.
[bib01_18] Bright M., and A. Hurson, and S.H. Pakzad, “A Taxonomy and Current Issues
in Multidatabase Systems”, Transactions on Computers, Vol. 25, Issue 3, pp 50–60,
March 1992.
[bib01_20] Ceri S. and G. Pelagatti, Distributed Databases – Principles & Systems, New
York, McGraw-Hill International Edition, Computer Science Series, 1985.
[bib01_21] Ceri S., M. Negri, and G. Pelagatti, “Horizontal Data Parttioning in Database
Design”, Proceedings of ACM SIGMOD International Conference on Management of
Data, pp 128–136, June 1982.
[bib01_22] Ceri S. and S. Owicki, “On the Use of Optimistic Methods for Concurrency
Control in Distributed Databases”, Proceedings of 6th Berkeley Workshop on Distributed
Data Management and Computer Networks, pp 117–130, February 1982.
[bib01_23] Chamberlin D. et. al., “support for Repetitive Transaction and Adhoc Queries
in System R”, ACM transaction of Distributed system, Vol. 6, Issue 1, 1981.
[bib01_24] Chiu D.M. and Y.C. Ho, “A Methodology for Interpreting Tree Queries into
Optimal Semi-join Expressions”, Proceedings of ACM SIGMOD International
Conference on Management of Data, pp 169–178, May 1980.
[bib01_25] Codd E., “Twelve Rules for On-line Analytical Processing”, Computer
World, April 1995.
[bib01_26] ConNolly T. and C. Begg, Database Systems – A Practical Approach to
Design, Implementation and Management, New Delhi, Pearson Education, 2003.
[bib01_29] Date C.J., An Introduction to Database Systems, 7th edition, Reading MA,
Addison-werley, 2000.
[bib01_32] Dewitt D.J. and J. Gray, “Parallel Database Systems: The Future of High
Performance Database Systems”, ACM Communications, Vol. 35, Issue 6, pp 85–98, June
1992.
[bib01_33] Elmarsi R. and S.B. Navathe, Fundamentals of Database Systems, 2nd edition,
Menlo Park, CA, Benjamin-Cummings, 1994.
[bib01_36] Fernandez E.B., R.C. Summers, and C. Wood, Database Security and
Integrity, Addision-Wesley, 1981.
[bib01_38] Forouzan B.A., Data Communications and Net-working, New Delhi, Tata
McGraw-Hill Edition, ISBN 0–07–043563–4, 2000.
[bib01_43] Gray J.N., R.A. Lorie, G.R. Putzolu, and I.L. Traiger, “Granularity of Locks
and Degrees of consistency in a shared Database”, in G.N. Nijssen (ed.), Modelling in
Database Management System, Amsterdam, North Holland, pp 365–394, 1976.
[bib01_48] Hevner A.R. and S.B. Yao, “Query Processing in Distributed Database
Systems”, IEEE Transactions on Software Engineering, Vol. 5, Issue 3, pp 177–182,
March 1979.
[bib01_49] Hoffer H.A. and D.G. Severance, “The Use of Cluster Analysis in Physical
Database Design”, Proceedings of 1st International Conference on Very Large Data
Bases, pp 69–86, September 1975.
[bib01_51] Inmon W.H., Building the Data Warehouse, New York, John Wiley and Sons,
1993.
[bib01_55] Lin W.K. and J. Notle, “Basic Timestamp, Multiple Version Timestamp, and
Two-Phase Locking”, Proceedings of 9th International Conference on Very Large Data
Bases, pp 109–119, October–November 1983.
[bib01_56] Lohman G. and L.F. Mackert, “R* Optimizer Validation and Performance
Evaluation for Distributed Queries”, Proceedings of 11th International Conference on
Very Large Data Bases, pp 149–159, August 1986.
[bib01_63] Pujari A.K., Data Mining Techniques, Hyderabad India, Universities Press,
ISBN 81–7371–380–4, 2001.
[bib01_69] Ray C., S. Chattopadhyay, and S. Bhattacharya, “Binding XML Data with
Constraint Information into Java Object Model”, International Workshop on CCCT,
2004, University of Texus, Austin, USA, August, 2004.
[bib01_73] Ray C., S. Tripathi, A. Chatterjee, and A. Das, “An efficient Bi-directional
String Matching Algorithm for Statistical Estimation”, International Symposium on Data,
Information & Knowledge Spectrum (ISDIKS07), Kochi, India, December 2007, pp 73–
79.
[bib01_74] Rosenkrantz D.J. and H.B. Hunt, “Processing Conjunctive Predicates and
Queries”, Proceedings of 6th International Conference on Very Large Data Bases, pp 64–
72, October, 1980.
[bib01_75] Rothnie J.B., Jr. P.A. Bernstein, S. Fox, N. Goodman, M. Hammer, T.A.
Landers, C. Reeve, D.W. Shipman, and E. Wong, “Introduction to a System for
Distributed Databases (SDD-1)”, ACM Transactions on Database Systems, Vol. 5, Issue
1, pp 1–17, March 1980.
[bib01_77] Sacco M.S. and S.B. Yao, “Query Optimization in Distributed Database
Management Systems”, in M.C. Yovits (ed.), Advances in Computers, New York
Academic Press, Vol. 21, pp 225–273, 1982.
[bib01_78] Selinger P.G. and M. Adiba, “Access Path Selection in Distributed Database
Management Systems”, Proceedings of 1st International conference on Databases, pp
204–215, 1980.
[bib01_79] Sheth A.P. and J.A. Larson, “Federated Database Systems for Managing
Distributed, Heterogeneous, and Autonomous Databases”, ACM Computing Surveys, Vol.
22, Issue 3, pp 183–236, September 1990.
[bib01_80] Sheth A.P. and J.A. Larson, “Federated Databases: Architectures and
Integration”, ACM Computing Surveys, Special Issue on Heterogeneous Databases,
September 1990.
[bib01_81] Silberschatz A., H.F. Korth, and S. Sudarshan, Database System Concepts,
New York, McGraw-Hill Publication, 1986.
[bib01_83] Smith J.M., P.A. Bernstein, U. Dayal, N. Goodman, T. Landers, K.Lin, and E.
Wong, “Multibase: Integrating Heterogeneous Distributed Database
Systems”, Proceedings of National Computer Conference, pp 487–499, May 1981.
[bib01_85] Stonebraker M., P. Kreps, W. Wong, and G. Held, “The Design and
Implementation of INGRES”, ACM Transactions on Database Systems, Vol. 1 Issue 3,
pp 198–222, September 1976.
[bib01_86] Tanenbaum A.S., Computer Networks, 3rd edition, Engle wood cliffs NJ
Prentice-Hall, 1997.
[bib01_90] Yu C.T. and C.C. Chang, “Distributed Query Processing”, ACM Computing
Surveys, Vol. 16, Issue 4, pp 399–433, December 1984.