Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
1 views52 pages

Distribution Database

A distributed database is a collection of interconnected databases physically located across various sites, ensuring data integrity and confidentiality while allowing independent management. Data can be stored through replication, which increases availability but complicates updates, or fragmentation, which divides data into smaller parts for efficiency. Distributed databases are utilized in various applications, offering advantages like fast processing and reliability, but also face challenges such as complexity, security issues, and high costs.

Uploaded by

mpoojapooji8143
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views52 pages

Distribution Database

A distributed database is a collection of interconnected databases physically located across various sites, ensuring data integrity and confidentiality while allowing independent management. Data can be stored through replication, which increases availability but complicates updates, or fragmentation, which divides data into smaller parts for efficiency. Distributed databases are utilized in various applications, offering advantages like fast processing and reliability, but also face challenges such as complexity, security issues, and high costs.

Uploaded by

mpoojapooji8143
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 52

UNIT:1 DISTRIBUTED DATA PROCESSING:

Distribution Database:
A distributed database is a collection of multiple interconnected databases, which are spread
physically across various locations that communicate via a computer network.

Features
 Databases in the collection are logically interrelated with each other. Often they represent
a single logical database.
 Data is physically stored across multiple sites. Data in each site can be managed by a
DBMS independent of the other sites.
 The processors in the sites are connected via a network. They do not have any
multiprocessor configuration.
 A distributed database is not a loosely connected file system.
 A distributed database incorporates transaction processing, but it is not synonymous with
a transaction processing system.
 It ensures that the data modified at any site is universally updated.
 It is used in application areas where large volumes of data are processed and accessed by
numerous users simultaneously.
 It is designed for heterogeneous database platforms.
 It maintains confidentiality and data integrity of the databases.
Distributed Data Storage :
There are 2 ways in which data can be stored on different sites. These are:
1. Replication –
In this approach, the entire relationship is stored redundantly at 2 or more sites. If the entire
database is available at all sites, it is a fully redundant database. Hence, in replication, systems
maintain copies of data.
This is advantageous as it increases the availability of data at different sites. Also, now query
requests can be processed in parallel.
However, it has certain disadvantages as well. Data needs to be constantly updated. Any
change made at one site needs to be recorded at every site that relation is stored or else it may
lead to inconsistency. This is a lot of overhead. Also, concurrency control becomes way more
complex as concurrent access now needs to be checked over a number of sites.
2. Fragmentation –
In this approach, the relations are fragmented (i.e., they’re divided into smaller parts) and
each of the fragments is stored in different sites where they’re required. It must be made sure
that the fragments are such that they can be used to reconstruct the original relation (i.e, there
isn’t any loss of data).
Fragmentation is advantageous as it doesn’t create copies of data, consistency is not a
problem.

Fragmentation of relations can be done in two ways:

 Horizontal fragmentation – Splitting by rows –


The relation is fragmented into groups of tuples so that each tuple is assigned to at least
one fragment.
 Vertical fragmentation – Splitting by columns –
The schema of the relation is divided into smaller schemas. Each fragment must contain a
common candidate key so as to ensure a lossless join.
In certain cases, an approach that is hybrid of fragmentation and replication is used.
Applications of Distributed Database:
 It is used in Corporate Management Information System.
 It is used in multimedia applications.
 Used in Military’s control system, Hotel chains etc.
 It is also used in manufacturing control system.

Advantages of Distributed Database System :


1) There is fast data processing as several sites participate in request processing.
2) Reliability and availability of this system is high.
3) It possess reduced operating cost.
4) It is easier to expand the system by adding more sites.
5) It has improved sharing ability and local autonomy.
Disadvantages of Distributed Database System :
1)The system becomes complex to manage and control.
2) The security issues must be carefully managed.
3) The system require deadlock handling during the transaction processing
otherwise
the entire system may be in inconsistent state.
4) There is need of some standardization for processing of distributed database
system.
Types Of DDBMS:
Distributed databases can be broadly classified into homogeneous and heterogeneous distributed
database environments, each with further sub-divisions, as shown in the following illustration.

Homogeneous Distributed Databases


In a homogeneous distributed database, all the sites use identical DBMS and operating systems.
Its properties are −
 The sites use very similar software.
 The sites use identical DBMS or DBMS from the same vendor.
 Each site is aware of all other sites and cooperates with other sites to process user
requests.
 The database is accessed through a single interface as if it is a single database.
Types of Homogeneous Distributed Database
There are two types of homogeneous distributed database −
 Autonomous − Each database is independent that functions on its own. They are
integrated by a controlling application and use message passing to share data updates.
 Non-autonomous − Data is distributed across the homogeneous nodes and a central or
master DBMS co-ordinates data updates across the sites.
Heterogeneous Distributed Databases
In a heterogeneous distributed database, different sites have different operating systems, DBMS
products and data models. Its properties are −
 Different sites use dissimilar schemas and software.
 The system may be composed of a variety of DBMSs like relational, network,
hierarchical or object oriented.
 Query processing is complex due to dissimilar schemas.
 Transaction processing is complex due to dissimilar software.
 A site may not be aware of other sites and so there is limited co-operation in processing
user requests.
Types of Heterogeneous Distributed Databases
 Federated − The heterogeneous database systems are independent in nature and
integrated together so that they function as a single database system.
 Un-federated − The database systems employ a central coordinating module through
which the databases are accessed.
problem areas of distributed database system
Distributed Database Systems is a kind of DBMS where databases are present at different
locations and connected via a network. Each site in a Distributed Database is capable of
accessing and processing local data as well as remote data. Although, distributed DBMS is
capable of effective communication and data sharing still it suffers from various
disadvantages are as following below.
 Complex nature :
Distributed Databases are a network of many computers present at different locations
and they provide an outstanding level of performance, availability, and of course
reliability. Therefore, the nature of Distributed DBMS is comparatively more complex
than a centralized DBMS. Complex software is required for Distributed DBMS. Also,
It ensures no data replication, which adds even more complexity in its nature.

 Overall Cost :
Various costs such as maintenance cost, procurement cost, hardware cost,
network/communication costs, labor costs, etc, adds up to the overall cost and make it
costlier than normal DBMS.

 Security issues:
In a Distributed Database, along with maintaining no data redundancy, the security of
data as well as a network is a prime concern. A network can be easily attacked for data
theft and misuse.

 Integrity Control:
In a vast Distributed database system, maintaining data consistency is important. All
changes made to data at one site must be reflected on all the sites. The communication
and processing cost is high in Distributed DBMS in order to enforce the integrity of
data.

 Lacking Standards:
Although it provides effective communication and data sharing, still there are no
standard rules and protocols to convert a centralized DBMS to a large Distributed
DBMS. Lack of standards decreases the potential of Distributed DBMS.
 Lack of Professional Support:
Due to a lack of adequate communication standards, it is not possible to link different
equipment produced by different vendors into a smoothly functioning network. Thus
several good resources may not be available to the users of the network.
 Data design complex:
Designing a distributed database is more complex compared to a centralized database.

Distributed DBMS Architectures
DDBMS architectures are generally developed depending on three parameters −
 Distribution − It states the physical distribution of data across the different sites.
 Autonomy − It indicates the distribution of control of the database system and the degree
to which each constituent DBMS can operate independently.
 Heterogeneity − It refers to the uniformity or dissimilarity of the data models, system
components and databases.
Architectural Models
Some of the common architectural models are −

 Client - Server Architecture for DDBMS


 Peer - to - Peer Architecture for DDBMS
 Multi - DBMS Architecture
Client - Server Architecture for DDBMS
This is a two-level architecture where the functionality is divided into servers and clients. The
server functions primarily encompass data management, query processing, optimization and
transaction management. Client functions include mainly user interface. However, they have
some functions like consistency checking and transaction management.
The two different client - server architecture are −

 Single Server Multiple Client


 Multiple Server Multiple Client
 Single Server Multiple Client:

 Multiple Server Multiple Client :


Peer- to-Peer Architecture for DDBMS
In these systems, each peer acts both as a client and a server for imparting database services. The
peers share their resource with other peers and co-ordinate their activities.
This architecture generally has four levels of schemas −
 Global Conceptual Schema − Depicts the global logical view of data.
 Local Conceptual Schema − Depicts logical data organization at each site.
 Local Internal Schema − Depicts physical data organization at each site.
 External Schema − Depicts user view of data.

Multi - DBMS Architectures


This is an integrated database system formed by a collection of two or more autonomous
database systems.
Multi-DBMS can be expressed through six levels of schemas −
 Multi-database View Level − Depicts multiple user views comprising of subsets of the
integrated distributed database.
 Multi-database Conceptual Level − Depicts integrated multi-database that comprises of
global logical multi-database structure definitions.
 Multi-database Internal Level − Depicts the data distribution across different sites and
multi-database to local data mapping.
 Local database View Level − Depicts public view of local data.
 Local database Conceptual Level − Depicts local data organization at each site.
 Local database Internal Level − Depicts physical data organization at each site.
There are two design alternatives for multi-DBMS −

 Model with multi-database conceptual level.


 Model without multi-database conceptual level.

Transparency :
Transparency in DDBMS refers to the transparent distribution of information to the
user from the system. It helps in hiding the information that is to be implemented by
the user.
In Distributed Database Management System, there are four types of transparencies, which
are as follows –
 Transaction transparency
 Performance transparency
 DBMS transparency
 Distribution transparency

1. Transaction transparency-
This transparency makes sure that all the transactions that are distributed preserve
distributed database integrity and regularity. Also, it is to understand that distribution
transaction access is the data stored at multiple locations It is very complex due to the use
of fragmentation, allocation, and replication structure of DBMS.

2. Performance transparency-
This transparency requires a DDBMS to work in a way that if it is a centralized database
management system. Also, the system should not undergo any downs in performance as
its architecture is distributed. Likewise,. This has another complexity to take under
consideration which is the fragmentation, replication, and allocation structure of DBMS.

3. DBMS transparency-
This transparency is only applicable to heterogeneous types of DDBMS (Databases that
have different sites and use different operating systems, products, and data models) as it
hides the fact that the local DBMS may be different. This transparency is one of the most
complicated transparencies to make use of as a generalization.

4. Distribution transparency-
Distribution transparency helps the user to recognize the database as a single thing or a
logical entity, and if a DDBMS displays distribution data transparency, then the user does
not need to know that the data is fragmented.
Distribution transparency has its 5 types, which are discussed below –
 Fragmentation transparency-
In this type of transparency, the user doesn’t have to know about fragmented data and, due
to which, it leads to the reason why database accesses are based on the global schema..
 Location transparency-
If this type of transparency is provided by DDBMS, then it is necessary for the user to
know how the data has been fragmented, but knowing the location of the data is not
necessary.
 Replication transparency-
In replication transparency, the user does not know about the copying of fragments.
Replication transparency is related to concurrency transparency and failure transparency.
Local Mapping transparency-
In local mapping transparency, the user needs to define both the fragment names, location
of data items while taking into account any duplications that may exist. This is a more
difficult and time-taking query for the user in DDBMS transparency.
 Naming transparency-
We already know that DBMS and DDBMS are types of centralized database system. It
means that each item in this database must consist of a unique name
Global Directory Issues:
Global Directory is an extension of the normal directory, including information about the
location of the fragments as well as the makeup of the fragments, for cases of distributed
DBMS or a multi-DBMS, that uses a global conceptual schema,
• Relevant for distributed DBMS or a multi-DBMS that uses a global conceptual schema
• Includes information about the location of the fragments as well as the makeup of
fragments.
• Directory is itself a database that contains meta-data about the actual data stored in
database.
Three issues –
 A directory may either be global to the entire database or local to each site. –
Directory may be maintained centrally at one site, or in a distributed fashion by
distributing it over a number of sites.
➢If system is distributed, directory is always distributed – Replication may be single
copy or multiple copies.
➢ Multiple copies would provide more reliability
UNIT-II DISTRIBUTED DATABASE DESIGN
DISTRIBUTED DATABASE DESIGN:

The strategies can be broadly divided into replication and fragmentation. However, in most
cases, a combination of the two is used.
Data Replication
Data replication is the process of storing separate copies of the database at two or more sites. It is
a popular fault tolerance technique of distributed databases.
Advantages of Data Replication
 Reliability − In case of failure of any site, the database system continues to work since a
copy is available at another site(s).
 Reduction in Network Load − Since local copies of data are available, query processing
can be done with reduced network usage, particularly during prime hours. Data updating
can be done at non-prime hours.
 Quicker Response − Availability of local copies of data ensures quick query processing
and consequently quick response time.
 Simpler Transactions − Transactions require less number of joins of tables located at
different sites and minimal coordination across the network. Thus, they become simpler
in nature.
Disadvantages of Data Replication
 Increased Storage Requirements − Maintaining multiple copies of data is associated
with increased storage costs. The storage space required is in multiples of the storage
required for a centralized system.
 Increased Cost and Complexity of Data Updating − Each time a data item is updated,
the update needs to be reflected in all the copies of the data at the different sites. This
requires complex synchronization techniques and protocols.
 Undesirable Application – Database coupling − If complex update mechanisms are not
used, removing data inconsistency requires complex co-ordination at application level.
This results in undesirable application – database coupling.
Some commonly used replication techniques are −

 Snapshot replication
 Near-real-time replication
 Pull replication
Fragmentation
Fragmentation is the task of dividing a table into a set of smaller tables. The subsets of the table
are called fragments. Fragmentation can be of three types: horizontal, vertical, and hybrid
(combination of horizontal and vertical). Horizontal fragmentation can further be classified into
two techniques: primary horizontal fragmentation and derived horizontal fragmentation.
Fragmentation should be done in a way so that the original table can be reconstructed from the
fragments. This is needed so that the original table can be reconstructed from the fragments
whenever required. This requirement is called “reconstructiveness.”
Advantages of Fragmentation
 Since data is stored close to the site of usage, efficiency of the database system is
increased.
 Local query optimization techniques are sufficient for most queries since data is locally
available.
 Since irrelevant data is not available at the sites, security and privacy of the database
system can be maintained.
Disadvantages of Fragmentation
 When data from different fragments are required, the access speeds may be very low.
 In case of recursive fragmentations, the job of reconstruction will need expensive
techniques.
 Lack of back-up copies of data in different sites may render the database ineffective in
case of failure of a site.
Vertical Fragmentation
In vertical fragmentation, the fields or columns of a table are grouped into fragments. In order to
maintain reconstructiveness, each fragment should contain the primary key field(s) of the table.
Vertical fragmentation can be used to enforce privacy of data.
For example, let us consider that a University database keeps records of all registered students in
a Student table having the following schema.
STUDENT

Regd_No Name Course Address Semester Fees Marks

Now, the fees details are maintained in the accounts section. In this case, the designer will
fragment the database as follows −
CREATE TABLE STD_FEES AS
SELECT Regd_No, Fees
FROM STUDENT;
Horizontal Fragmentation
Horizontal fragmentation groups the tuples of a table in accordance to values of one or more
fields. Horizontal fragmentation should also confirm to the rule of reconstructiveness. Each
horizontal fragment must have all columns of the original base table.
For example, in the student schema, if the details of all students of Computer Science Course
needs to be maintained at the School of Computer Science, then the designer will horizontally
fragment the database as follows −
CREATE COMP_STD AS
SELECT * FROM STUDENT
WHERE COURSE = "Computer Science";
Hybrid Fragmentation
In hybrid fragmentation, a combination of horizontal and vertical fragmentation techniques are
used. This is the most flexible fragmentation technique since it generates fragments with minimal
extraneous information. However, reconstruction of the original table is often an expensive task.
Hybrid fragmentation can be done in two alternative ways −
 At first, generate a set of horizontal fragments; then generate vertical fragments from one
or more of the horizontal fragments.
 At first, generate a set of vertical fragments; then generate horizontal fragments from one
or more of the vertical fragments.
semantic data control in distributed database:
 Semantic datais the information,that allows machine to understand the meaning of
information
 It describes the Technology and methods that convey the meaning of information
 Semantic data systems are designed to represent the real world as accurately as
possible with in the dataset
 Data modeling is the process of creating and extending data models which are visual
representations of data and its organization
Example:ER diagrams
 The semantic data control are similar to the ones in centralized database system
 Consider the fragmentation of relations and distribution of fragments across
multiple sites
 Semantic data control include
 View management
 Data security
 Semantic integrity control

View management:
View Mnagement in Distributed

 Definition of views in DDBMS is similar as in centralized DBMS


 However, a view in a DDBMS may be derived from fragmented relations stored at

different sites.

• Views are conceptually the same as the base relations, therefore we store them in the (possibly)
distributed directory.

– Thus, views might be centralized at one site, partially replicated, fully replicated

– Queries on views are translated into queries on base relations, yielding distributed queries due
to possible fragmentation of data.

• Views derived from distributed relations may be costly to evaluate

– Optimizations are important, e.g., snapshots

– A snapshot is a static view

∗ does not reflect the updates to the base relations

∗ managed as temporary relations: the only access path is sequential scan


∗ typically used when selectivity is small (no indices can be used efficiently)

∗ is subject to periodic recalculation

Semantic integrity control:

Semantic Integrity Control Semantic integrity control defines and enforces the integrity
constraints of the database system.
The integrity constraints are as follows –
 Data type integrity constraint
 Entity integrity constraint
 Referential integrity constraint
 Data Type Integrity Constraint
A data type constraint restricts the range of values and the type of operations that can be applied
to the field with the specified data type.
For example, let us consider that a table "HOSTEL" has three fields - the hostel number, hostel
name and capacity. The hostel number should start with capital letter "H" and cannot be NULL,
and the capacity should not be more than 150. The following SQL command can be used for data
definition
CREATE TABLE HOSTEL ( H_NO VARCHAR2(5) NOT NULL, H_NAME
VARCHAR2(15), CAPACITY INTEGER, CHECK ( H_NO LIKE 'H%’ ), CHECK
( CAPACITY <= 150) );
Entity Integrity Control :
Entity integrity control enforces the rules so that each tuple can be uniquely identified from other
tuples. For this a primary key is defined. A primary key is a set of minimal fields that can
uniquely identify a tuple. Entity integrity constraint states that no two tuples in a table can have
identical values for primary keys and that no field which is a part of the primary key can have
NULL value
For example, in the above hostel table, the hostel number can be assigned as the primary key
through the following SQL statement
CREATE TABLE HOSTEL ( H_NO VARCHAR2(5) PRIMARY KEY,
H_NAME VARCHAR2(15), CAPACITY INTEGER);
Referential Integrity Constraint:
Referential integrity constraint lays down the rules of foreign keys. A foreign key is a field in a
data table that is the primary key of a related table. The referential integrity constraint lays down
the rule that the value of the foreign key field should either be among the values of the primary
key of the referenced table or be entirely NULL.
For example, let us consider a student table where a student may opt to live in a hostel. To
include this, the primary key of hostel table should be included as a foreign key in the student
table. The following SQL statement incorporates this
CREATE TABLE STUDENT ( S_ROLL INTEGER PRIMARY KEY,
S_NAME VARCHAR2(25) NOT NULL, S_COURSE VARCHAR2(10),
S_HOSTEL VARCHAR2(5) REFERENCES HOSTEL);
Query processing issues:
In a distributed database system, processing a query comprises of optimization at both the global
and the local level. The query enters the database system at the client or controlling site. Here,
the user is validated, the query is checked, translated, and optimized at a global level.
The architecture can be represented as −

Mapping Global Queries into Local Queries


The process of mapping global queries to local ones can be realized as follows −
 The tables required in a global query have fragments distributed across multiple sites.
The local databases have information only about local data. The controlling site uses the
global data dictionary to gather information about the distribution and reconstructs the
global view from the fragments.
 If there is no replication, the global optimizer runs local queries at the sites where the
fragments are stored. If there is replication, the global optimizer selects the site based
upon communication cost, workload, and server speed.
 The global optimizer generates a distributed execution plan so that least amount of data
transfer occurs across the sites. The plan states the location of the fragments, order in
which query steps needs to be executed and the processes involved in transferring
intermediate results.
 The local queries are optimized by the local database servers. Finally, the local query
results are merged together through union operation in case of horizontal fragments and
join operation for vertical fragments.

objectives of Query Processing.


Ans. Objectives of Query Processing The main objectives of query processing in a distributed
environment is to form a high level query on a distributed database, which is seen as a single
database by the users, into an efficient execution strategy expressed in a low level language in
local databases.

An important point of query processing is query optimization. Because many execution


strategies are correct transformations of the same high level query the one that optimizes
(minimizes) resource consumption should be retained.

The good measure of resource consumption are:

o The total cost that will be incurred in processing the query. It is the dome of all times incurred
in processing the operations of the query at various sites and intrinsic communication.

o The resource time of the query. This is the time elapsed for executing the query. Since
operations can be executed in parallel at different sited, the response time of a query may be
significantly less than its cost. Obviously the total cost should be minimized.

o In a distributed system, the total cost to be minimized includes CPU, I\O, and communication
costs. This cost can be minimized by reducing the number of I\O operation through fast access
methods to the data and efficient use of main memory.

The communication cost is the time needed for exchanging the data between sited participating
in the execution of the query.

o In centralized systems, only CPU and I\O cost have to be considered.

Layers of Query Processing


Query processing has 4 layers:
• Query Decomposition
• Data Localization
• Global Query Optimization
• Distribution Query Execution
Query Decomposition
The first layer decomposes the calculus query into an algebraic query on global relations. The
information needed for this transformation is found in the global conceptual schema describing
the global relations.
• Query decomposition can be viewed as four successive steps.
• Normalization
• Analysis
• Simplification
• Restructure

Data Localization
• The input to the second layer is an algebraic query on global relations. The main role of the
second layer is to localize the query's data using data distribution information in the fragment
schema.
• This layer determines which fragments are involved in the query and transforms the distributed
query into a query on fragments.

Data Localization
• The input to the second layer is an algebraic query on global relations. The main role of the
second layer is to localize the query's data using data distribution information in the fragment
schema.
• This layer determines which fragments are involved in the query and transforms the distributed
query into a query on fragments.
Generating a query on fragments is done in two steps
• First, the query is mapped into a fragment query by substituting each relation by its
reconstruction program (also called materialization program).
• Second, the fragment query is simplified and restructured to produce another "good" query.

Global Query Optimization


• The input to the third layer is an algebraic query on fragments. The goal of query optimization
is to find an execution strategy for the query which is close to optimal.
• The previous layers have already optimized the query, for example, by eliminating redundant
expressions. However, this optimization is independent of fragment characteristics such as
fragment allocation and cardinalities.
Query optimization consists of finding the "best" ordering of operators in the query, including
communication operators that minimize a cost function.

• The output of the query optimization layer is a optimized algebraic query with communication
operators included on fragments. It is typically represented and saved (for future executions) as a
distributed query execution plan.

Distribution Query Execution


• The last layer is performed by all the sites having fragments involved in the query.
• Each sub queryexecuting atone site, called a local query, is then optimized using the local
schema of the site and executed.
Characterization of Query Processors

The first four characteristics hold for both centralized and distributed query processors while the
next four characteristics are particular to distributed query processors in tightly-integrated
distributed DBMSs.

Languages

Types of Optimization

Optimization Timing

Statistics

Decision Sites

Exploitation of the Network Topology


Exploitation of Replicated Fragments

Use of Semijoins

Types of Optimization

Exhaustive search

query optimization aims at choosing the “best” point in the solution space of all possible
execution strategies.

search the solution space to predict the cost of each strategy

select the strategy with minimum cost.

Although this method is effective in selecting the best strategy, it may incur a significant
processing cost for the optimization it self

The problem is that the solution space can be large that is, there may be many equivalent
strategies, even with a small number of relations..
Heuristics

popular way of reducing the cost of exhaustive search

restrict the solution space so that only a few strategies are considered

regroup common sub expressions

perform selection, projection first

replace a join by a series of semijoins

reorder operations to reduce intermediate relation size

optimize individual operations to minimize data communication.


Randomized strategies

Find a very good solution, not necessarily the best one, but avoid the high cost of optimization,
in terms of memory and time consumptionOptimization Timing

Optimization can be done staticallybefore executing the query or dynamicallyas the query is
executed.

Static
Static query optimization is done at query compilation time.

Thus the cost of optimization may be amortized over multiple query executions.

this timing is appropriate for use with the exhaustive search method.

Since the sizes of the intermediate relations of a strategy are not known until run time, they must
be estimated using database statistics Optimization Timing

Optimization can be done statically before executing the query or dynamically as the query is
executed.

Optimization Timing

Dynamic

run time optimization

database statistics are not needed to estimate the size of intermediate results

The main advantage over static query optimization is that the actual sizes of intermediate
relations are available to the query processor, thereby minimizing the probability of a bad choice.

The main shortcoming is that query optimization, an expensive task, must be repeated for each
execution of the query. Therefore, this approach is best for ad-hoc queries.
Hybrid

provide the advantages of static query optimization

The approach is basically static, but dynamic query optimization may take place at run time
when a high difference between predicted sizes and actual size of intermediate relations is
detected.

if the error in estimate sizes > threshold, reoptimize at run time
Statistics

The effectiveness of query optimization relies on statistics on the database.

Dynamic query optimization requires statistics in order to choose which operators should be
done first.

Static query optimization is even more demanding since the size of intermediate relations must
also be estimated based on statistical information.

statistics for query optimization typically bear on fragments, and include fragment cardinality
and size as well as the size and number of distinct values of each attribute.

To minimize the probability of error, more detailed statistics such as histograms of attribute
values are sometimes used.

The accuracy of statistics is achieved by periodic updating.

With static optimization, significant changes in statistics used to optimize a query might result in
query reoptimization
.
Decision Sites

Centralized decision approach

single site generates the strategy that is determines the “best” schedule Simpler

need knowledge about the entire distributed database

Distributed decision approach

cooperation among various sites to determine the schedule (elaboration of the best strategy)

need only local information

Hybrid decision approach

one site makes the major decisions that is determines the global schedule

Other sites make local decisions that is optimizes the local sub querie
Network Topology

distributed query optimization be divided into two separate problems:

selection of the global execution strategy, based on inter site communication, and selection of
each local execution strategy, based on a centralized query processing algorithm.

Wide area networks (WAN) –point-to-point

communication cost will dominate; ignore all other cost factors

global schedule to minimize communication cost

local schedules according to centralized query optimization

DDBMS Query Optimization in Centralized Systems


The optimal access path is determined once the alternative access paths are derived for
processing a relational algebra expression. We will discuss query optimization in a centralized
system in this chapter, while we will research query optimization in a distributed system in the
next chapter.
UNIT-III QUERY OPTIMIZATION

DDBMS Query Optimization in Centralized Systems


The optimal access path is determined once the alternative access paths are derived for
processing a relational algebra expression. We will discuss query optimization in a centralized
system in this chapter, while we will research query optimization in a distributed system in the
next chapter.

Query processing is performed in a centralized system with the following goal:

 Minimization of Query Response Time (time taken to generate user query results).
 Maximize the throughput of the system (the number of requests handled within a given
amount of time).
 Reduce the amount of storage and memory required for processing.
 Increase parallelism.

Query Parsing and Translation


Initially, the SQL query is scanned. Then it is parsed to look for syntactical errors and
correctness of data types. If the query passes this step, the query is decomposed into smaller
query blocks. Each block is then translated to equivalent relational algebra expression.
Steps for Query Optimization
Query optimization involves three steps, namely query tree generation, plan generation, and
query plan code generation.
Step 1 − Query Tree Generation
A query tree is a tree data structure representing a relational algebra expression. The tables of the
query are represented as leaf nodes. The relational algebra operations are represented as the
internal nodes. The root represents the query as a whole.
During execution, an internal node is executed whenever its operand tables are available. The
node is then replaced by the result table. This process continues for all internal nodes until the
root node is executed and replaced by the result table.
Initially, the SQL query is scanned. Then it is parsed to look for syntactical errors and
correctness of data types. If the query passes this step, the query is decomposed into smaller
query blocks. Each block is then translated to equivalent relational algebra expression.
Steps for Query Optimization
Query optimization involves three steps, namely query tree generation, plan generation, and
query plan code generation.
Step 1 − Query Tree Generation
A query tree is a tree data structure representing a relational algebra expression. The tables of the
query are represented as leaf nodes. The relational algebra operations are represented as the
internal nodes. The root represents the query as a whole.
During execution, an internal node is executed whenever its operand tables are available. The
node is then replaced by the result table. This process continues for all internal nodes until the
root node is executed and replaced by the result table.
Approaches to Query Optimization
Among the approaches for query optimization, exhaustive search and heuristics-based algorithms
are mostly used.
Exhaustive Search Optimization
In these techniques, for a query, all possible query plans are initially generated and then the best
plan is selected. Though these techniques provide the best solution, it has an exponential time
and space complexity owing to the large solution space. For example, dynamic programming
technique.
Heuristic Based Optimization
Heuristic based optimization uses rule-based optimization approaches for query optimization.
These algorithms have polynomial time and space complexity, which is lower than the
exponential complexity of exhaustive search-based algorithms. However, these algorithms do not
necessarily produce the best query plan.
Some of the common heuristic rules are −
 Perform select and project operations before join operations. This is done by moving the
select and project operations down the query tree. This reduces the number of tuples
available for join.
 Perform the most restrictive select/project operations at first before the other operations.
 Avoid cross-product operation since they result in very large-sized intermediate tables
Distributed query optimization algorithm:

1. Semi-join Algorithm
2. INGRES Algorithm
3. System R Algorithm
4. System R* Algorithm
5. Hill Climbing Algorithm
6. SDD-1 Algorithm

Hill Climbing Algorithm

1. It refinements of an initial feasible solution are recursively computed until no more cost
improvements can be made.
2. In hill climbing algorithm, Semijoins, data replication, and fragmentation are not used.
3. It devised for wide area point-to-point networks.
4. It is the first distributed query processing algorithm.
5. The hill-climbing algorithm proceeds as follows:
i. Select initial feasible execution strategy ES0
• i.e., a global execution schedule that includes all intersite communication
• Determine the candidate result sites, where a relation referenced in the query exist
• Compute the cost of transferring all the other referenced relations to each candidate site
• ES0 = candidate site with minimum cost
ii. Split ES0 into two strategies: ES1 followed by ES2
• ES1: send one of the relations involved in the join to the other relation’s site –
• ES2: send the join result to the final result site
iii. Replace ES0 with the split schedule which gives
iv. Recursively apply steps 2 and 3 on ES1 and ES2 until no more benefit can be gained
v. Check for redundant transmissions in the final plan and eliminate them
Transactions
A transaction is a program including a collection of database operations, executed as a logical
unit of data processing. The operations performed in a transaction include one or more of
database operations like insert, delete, update or retrieve data. It is an atomic process that is
either performed into completion entirely or is not performed at all. A transaction involving only
data retrieval without any data update is called read-only transaction.
Each high level operation can be divided into a number of low level tasks or operations. For
example, a data update operation can be divided into three tasks −
 read_item() − reads data item from storage to main memory.
 modify_item() − change value of item in the main memory.
 write_item() − write the modified value from main memory to storage.
Database access is restricted to read_item() and write_item() operations. Likewise, for all
transactions, read and write forms the basic database operations.
Transaction Operations
The low level operations performed in a transaction are −
 begin_transaction − A marker that specifies start of transaction execution.
 read_item or write_item − Database operations that may be interleaved with main
memory operations as a part of transaction.
 end_transaction − A marker that specifies end of transaction.
 commit − A signal to specify that the transaction has been successfully completed in its
entirety and will not be undone.
 rollback − A signal to specify that the transaction has been unsuccessful and so all
temporary changes in the database are undone. A committed transaction cannot be rolled
back.
Transaction States
A transaction may go through a subset of five states, active, partially committed, committed,
failed and aborted.
 Active − The initial state where the transaction enters is the active state. The transaction
remains in this state while it is executing read, write or other operations.
 Partially Committed − The transaction enters this state after the last statement of the
transaction has been executed.
 Committed − The transaction enters this state after successful completion of the
transaction and system checks have issued commit signal.
 Failed − The transaction goes from partially committed state or active state to failed state
when it is discovered that normal execution can no longer proceed or system checks fail.
 Aborted − This is the state after the transaction has been rolled back after failure and the
database has been restored to its state that was before the transaction began.

Goals of transaction management in distributed database:


The goal of transaction management in a distributed database is to control the execution of
transactions so that: 1. Transactions have atomicity, durability, serializability and isolation
properties

objectives of Distributed Transaction Management


This section introduces the objectives of a transaction management subsystem of adistributed
DBMS. The responsibilities of the transaction manager in a distributed DBMS issame as those of
a corresponding subsystem in a centralized database system, but its taskis complicated owing to
fragmentation, replication and the overall distributed nature of thedatabase. The consistency of
the database should not be violated because of transactionexecutions. Therefore, the primary
objective of a transaction manger is to executetransactions in an efficient way in order to
preserve the ACID properties of transactions(local as well as global transactions) irrespective of
concurrent execution of transactions andsystem failures
CPU and main memory utilization should be improved
Most of the typicaldatabase applications spend much of their time waiting for I/O operations
ratherthan on computations. In a large system, the concurrent execution of these I/Obound
applications can turn into a bottleneck in main memory or in CPU timeutilization. To alleviate
this problem, that is, to improve CPU and main memoryutilization, a transaction manager should
adopt specialized techniques to deal withspecific database applications.
Response time should be minimized
To improve the performance of transactionexecutions, the response time of each individual
transaction must be considered andshould be minimized. In the case of distributed applications, it
is very difficult toachieve an acceptable response time owing to the additional time required
tocommunicate between different sites.

Availability should be maximized


Although the availability in a distributed systemis better than that in a centralized system, it must
be maximized for transactionrecovery and concurrency control in distributed databases. The
algorithmsimplemented by the distributed transaction manager must not block the execution
of those transactions that do not strictly need to access a site that is not operational.

Communication cost should be minimized


In a distributed system an additionalcommunication cost is incurred for each distributed or global
application, because anumber of message transfers are required between sites to control the
execution of aglobal application.
important features of transaction processing system:

The following are the important features of Transaction Processing System

 Rapid response

The response time of a TPS is important because a business cannot afford to have their
customers waiting for long periods of time before making a transaction.TPS system are designed
to process transactions virtually instantly. It takes only few seconds for a transaction.

 Reliability

Transaction processing system should be reliable. Customers will not tolerate mistakes. TPS
system is always should contain sufficient safety and security measures.

 Inflexibility

A TPS wants every transaction to be processed in exactly the same way, regardless of the user,
the customer or the time of day. Transaction must be processed in the same way each time to
maximize efficiency. If a TPS was flexible, different types of data would be entered in different
orders.

 Historical data

TPS produces information on the historical basis. Because TPS generate information taking into
account transactions already taken place in the organization.
 Link with external environment

Transaction processing system establishes the relation with external environment Because TPS
distributes information to its customers and suppliers.

 Distribution of information to other systems

TPS produces information and distributes it to the other type of systems. For ex: sales processing
systems supply information to the general ledger systems.

 Controlled access

Since TPS systems can be such a powerful business tool, it must be able to allow only authorized
employees to access it at any time. Restricted access to the system ensures that employees that
employees who have the authority will only be able to process and control transaction.

concurrency control in centralized database system:Concurrency Control Techniques:


oncurrency controlling techniques ensure that multiple transactions are executed simultaneously
while maintaining the ACID properties of the transactions and serializability in the schedules.

Various concurrency control techniques are:


1. Two-phase locking Protocol
2. Time stamp ordering Protocol
3. Multi version concurrency control
4. Validation concurrency control
These are briefly explained below.
1. Two-Phase Locking Protocol : Locking is an operation which secures: permission to read,
OR permission to write a data item. Two phase locking is a process used to gain ownership of
shared resources without creating the possibility of deadlock. The 3 activities taking place in
the two phase update algorithm are:
(i). Lock Acquisition
(ii). Modification of Data
(iii). Release Lock
Two phase locking prevents deadlock from occurring in distributed systems by releasing all the
resources it has acquired, if it is not possible to acquire all the resources required without
waiting for another process to finish using a lock. This means that no process is ever in a state
where it is holding some shared resources, and waiting for another process to release a shared
resource which it requires. This means that deadlock cannot occur due to resource contention.
A transaction in the Two Phase Locking Protocol can assume one of the 2 phases:
 i) Growing Phase: In this phase a transaction can only acquire locks but cannot release any
lock. The point when a transaction acquires all the locks it needs is called the Lock Point.
 (ii) Shrinking Phase: In this phase a transaction can only release locks but cannot acquire
any.
2. Time Stamp Ordering Protocol : A timestamp is a tag that can be attached to any
transaction or any data item, which denotes a specific time on which the transaction or the data
item had been used in any way. A timestamp can be implemented in 2 ways. One is to directly
assign the current value of the clock to the transaction or data item. The other is to attach the
value of a logical counter that keeps increment as new timestamps are required. The timestamp
of a data item can be of 2 types:
 (i) W-timestamp(X): This means the latest time when the data item X has been written
into.
 (ii) R-timestamp(X): This means the latest time when the data item X has been read from.
These 2 timestamps are updated each time a successful read/write operation is performed
on the data item X.
3. Multiversion Concurrency Control: Multiversion schemes keep old versions of data item
to increase concurrency. Multiversion 2 phase locking: Each successful write results in the
creation of a new version of the data item written. Timestamps are used to label the versions.
When a read(X) operation is issued, select an appropriate version of X based on the timestamp
of the transaction.
4. Validation Concurrency Control : The optimistic approach is based on the assumption that
the majority of the database operations do not conflict. The optimistic approach requires
neither locking nor time stamping techniques. Instead, a transaction is executed without
restrictions until it is committed. Using an optimistic approach, each transaction moves
through 2 or 3 phases, referred to as read, validation and write.
 (i) During read phase, the transaction reads the database, executes the needed computations
and makes the updates to a private copy of the database values. All update operations of
the transactions are recorded in a temporary update file, which is not accessed by the
remaining transactions.
 (ii) During the validation phase, the transaction is validated to ensure that the changes
made will not affect the integrity and consistency of the database. If the validation test is
positive, the transaction goes to a write phase. If the validation test is negative, he
transaction is restarted and the changes are discarded.
 (iii) During the write phase, the changes are permanently applied to the database.

1. Distributed Concurrency control agorithms: Timestamp Concurrency Control


Algorithms
Timestamp-based concurrency control algorithms use a transaction’s timestamp to coordinate
concurrent access to a data item to ensure serializability. A timestamp is a unique identifier given
by DBMS to a transaction that represents the transaction’s start time.
These algorithms ensure that transactions commit in the order dictated by their timestamps. An
older transaction should commit before a younger transaction, since the older transaction enters
the system before the younger one. Timestamp-based concurrency control techniques generate
serializable schedules such that the equivalent serial schedule is arranged in order of the age of
the participating transactions.
Some of timestamp based concurrency control algorithms are −
• Basic timestamp ordering algorithm.
• Conservative timestamp ordering algorithm.
• Multiversion algorithm based upon timestamp ordering.
Timestamp based ordering follow three rules to enforce serializability –
• Access Rule − When two transactions try to access the same data item simultaneously, for
conflicting operations, priority is given to the older transaction. This causes the younger
transaction to wait for the older transaction to commit first.
• Late Transaction Rule − If a younger transaction has written a data item, then an older
transaction is not allowed to read or write that data item. This rule prevents the older transaction
from committing after the younger transaction has already committed.
• Younger Transaction Rule − A younger transaction can read or write a data item that has
already been written by an older transaction.

1. Optimistic Concurrency Control Algorithm

In systems with low conflict rates, the task of validating every transaction for serializability may
lower performance. In these cases, the test for serializability is postponed to just before commit.
Since the conflict rate is low, the probability of aborting transactions which are not serializable is
also low. This approach is called optimistic concurrency control technique.
In this approach, a transaction’s life cycle is divided into the following three phases −
Execution Phase − A transaction fetches data items to memory and performs operations upon
them.
• Validation Phase − A transaction performs checks to ensure that committing its changes to the
database passes serializability test.
• Commit Phase − A transaction writes back modified data item in memory to the disk.
This algorithm uses three rules to enforce serializability in validation phase −
• Rule 1 − Given two transactions Ti and Tj, if Ti is reading the data item which Tj is writing,
then Ti’s execution phase cannot overlap with Tj’s commit phase. Tj can commit only after Ti
has finished execution.
• Rule 2 − Given two transactions Ti and Tj, if Ti is writing the data item that Tj is reading, then
Ti’s commit phase cannot overlap with Tj’s execution phase. Tj can start executing only after Ti
has already committed.
• Rule 3 − Given two transactions Ti and Tj, if Ti is writing the data item which Tj is also
writing, then Ti’s commit phase cannot overlap with Tj’s commit phase. Tj can start to commit
only after Ti has already committed.
deadlock management in distributed dbms:
Deadlock is a state of a database system having two or more transactions, when each transaction
is waiting for a data item that is being locked by some other transaction. A deadlock can be
indicated by a cycle in the wait-for-graph. This is a directed graph in which the vertices denote
transactions and the edges denote waits for data items.
For example, in the following wait-for-graph, transaction T1 is waiting for data item X which is
locked by T3. T3 is waiting for Y which is locked by T2 and T2 is waiting for Z which is locked
by T1. Hence, a waiting cycle is formed, and none of the transactions can proceed executing.

Deadlock Handling in Centralized Systems


There are three classical approaches for deadlock handling, namely −

 Deadlock prevention.
 Deadlock avoidance.
 Deadlock detection and removal.
All of the three approaches can be incorporated in both a centralized and a distributed database
system.
Deadlock Prevention
The deadlock prevention approach does not allow any transaction to acquire locks that will lead
to deadlocks. The convention is that when more than one transactions request for locking the
same data item, only one of them is granted the lock.
Deadlock Avoidance
The deadlock avoidance approach handles deadlocks before they occur. It analyzes the
transactions and the locks to determine whether or not waiting leads to a deadlock.
.
There are two algorithms for this purpose, namely wait-die and wound-wait. Let us assume that
there are two transactions, T1 and T2, where T1 tries to lock a data item which is already locked
by T2. The algorithms are as follows −
 Wait-Die − If T1 is older than T2, T1 is allowed to wait. Otherwise, if T1 is younger than
T2, T1 is aborted and later restarted.
 Wound-Wait − If T1 is older than T2, T2 is aborted and later restarted. Otherwise, if T1
is younger than T2, T1 is allowed to wait.
Deadlock Detection and Removal
The deadlock detection and removal approach runs a deadlock detection algorithm periodically
and removes deadlock in case there is one.
Since there are no precautions while granting lock requests, some of the transactions may be
deadlocked. To detect deadlocks, the lock manager periodically checks if the wait-forgraph has
cycles. If the system is deadlocked, the lock manager chooses a victim transaction from each
cycle. The victim is aborted and rolled back; and then restarted later. Some of the methods used
for victim selection are −

 Choose the youngest transaction.


 Choose the transaction with fewest data items.
 Choose the transaction that has performed least number of updates.
 Choose the transaction having least restart overhead.
 Choose the transaction which is common to two or more cycles.
This approach is primarily suited for systems having transactions low and where fast response to
lock requests is needed.
Deadlock Handling in Distributed Systems
Transaction processing in a distributed database system is also distributed, i.e. the same
transaction may be processing at more than one site. The two main deadlock handling concerns
in a distributed database system that are not present in a centralized system are transaction
location and transaction control.

Transaction Location
Transactions in a distributed database system are processed in multiple sites and use data items
in multiple sites. The amount of data processing is not uniformly distributed among these sites.
The time period of processing also varies.
Thus the same transaction may be active at some sites and inactive at others. When two
conflicting transactions are located in a site, it may happen that one of them is in inactive state.
This condition does not arise in a centralized system. This concern is called transaction location
issue.
Transaction Control
Transaction control is concerned with designating and controlling the sites required for
processing a transaction in a distributed database system. There are many options regarding the
choice of where to process the transaction and how to designate the center of control, like −

 One server may be selected as the center of control.


 The center of control may travel from one server to another.
 The responsibility of controlling may be shared by a number of servers.
Distributed Deadlock Prevention
Just like in centralized deadlock prevention, in distributed deadlock prevention approach, a
transaction should acquire all the locks before starting to execute. This prevents deadlocks.
The site where the transaction enters is designated as the controlling site. The controlling site
sends messages to the sites where the data items are located to lock the items. Then it waits for
confirmation. When all the sites have confirmed that they have locked the data items, transaction
starts. If any site or communication link fails, the transaction has to wait until they have been
repaired.
Though the implementation is simple, this approach has some drawbacks −
 Pre-acquisition of locks requires a long time for communication delays. This increases the
time required for transaction.
 In case of site or link failure, a transaction has to wait for a long time so that the sites
recover. Meanwhile, in the running sites, the items are locked. This may prevent other
transactions from executing.
 If the controlling site fails, it cannot communicate with the other sites. These sites
continue to keep the locked data items in their locked state, thus resulting in blocking.
Distributed Deadlock Avoidance
As in centralized system, distributed deadlock avoidance handles deadlock prior to occurrence.
Additionally, in distributed systems, transaction location and transaction control issues needs to
be addressed. Due to the distributed nature of the transaction, the following conflicts may occur

 Conflict between two transactions in the same site.


 Conflict between two transactions in different sites.
In case of conflict, one of the transactions may be aborted or allowed to wait as per distributed
wait-die or distributed wound-wait algorithms.
Let us assume that there are two transactions, T1 and T2. T1 arrives at Site P and tries to lock a
data item which is already locked by T2 at that site. Hence, there is a conflict at Site P. The
algorithms are as follows −
 Distributed Wound-Die
o If T1 is older than T2, T1 is allowed to wait. T1 can resume execution after Site P
receives a message that T2 has either committed or aborted successfully at all sites.
o If T1 is younger than T2, T1 is aborted. The concurrency control at Site P sends a
message to all sites where T1 has visited to abort T1. The controlling site notifies
the user when T1 has been successfully aborted in all the sites.
 Distributed Wait-Wait
o If T1 is older than T2, T2 needs to be aborted. If T2 is active at Site P, Site P
aborts and rolls back T2 and then broadcasts this message to other relevant sites. If
T2 has left Site P but is active at Site Q, Site P broadcasts that T2 has been
aborted; Site L then aborts and rolls back T2 and sends this message to all sites.
o If T1 is younger than T1, T1 is allowed to wait. T1 can resume execution after Site
P receives a message that T2 has completed processing.
Distributed Deadlock Detection
Just like centralized deadlock detection approach, deadlocks are allowed to occur and are
removed if detected. The system does not perform any checks when a transaction places a lock
request. For implementation, global wait-for-graphs are created. Existence of a cycle in the
global wait-for-graph indicates deadlocks. However, it is difficult to spot deadlocks since
transaction waits for resources across the network.
Alternatively, deadlock detection algorithms can use timers. Each transaction is associated with a
timer which is set to a time period in which a transaction is expected to finish. If a transaction
does not finish within this time period, the timer goes off, indicating a possible deadlock.
Another tool used for deadlock handling is a deadlock detector. In a centralized system, there is
one deadlock detector. In a distributed system, there can be more than one deadlock detectors. A
deadlock detector can find deadlocks for the sites under its control. There are three alternatives
for deadlock detection in a distributed system, namely.
 Centralized Deadlock Detector − One site is designated as the central deadlock detector.
 Hierarchical Deadlock Detector − A number of deadlock detectors are arranged in
hierarchy.
 Distributed Deadlock Detector − All the sites participate in detecting deadlocks and
removing them.

UNIT-IV DDBS:

Reliability issues in ddbs’s:


Reliability is defined as a measure of the success with which the system conforms
to some authoritative specification of its behavior. When the behavior deviates from that
which is specified for it, this is called as Failure. The reliability of the system is inversely
related to the frequency of failures.
The reliability of a system can be measured in several ways, which are based on
the incidence of failures. Measures include Mean Time Between Failure(MTBF), Mean
Time To Repair(MTTR), and availability, defined as the fraction of the time that the
system meets its specification. MTBF is the amount of time between system failures in a
network. MTTR is the amount of time system takes to repair the failed systems.

A reliable DDBMS is one that can continue to process user requests even when the
underlying system is unreliable, i.e., failures occur
Data replication + Easy scaling = Reliable system
Distribution enhances system reliability (not enough)
◦ Need number of protocols to be implemented to exploit distribution and
replication
Reliability is closely related to the problem of how to maintain the atomicity and
durability properties of transactions
◦ System, State and Failure
◦ Reliability refers to a system that consists of a set of components.
◦ The system has a state, which changes as the system operates.
◦ The behavior of the system : authoritative specification indicates the valid
behavior of each system state.
◦ Any deviation of a system from the behavior described in the specification is
considered a failure.
◦ The internal state of a system such that there exist circumstances in which further
processing, by the normal algorithms of the system, will lead to a failure which is
not attributed to a subsequent fault, is called erroneous state.
◦ The part of the state which is incorrect is an error.
◦ An error in the internal states of the components of a system or in the design of a
system is a fault.
causes results in
Fault Error Failure


Hard faults
Permanent
Resulting failures are called hard failures
Soft faults
Transient or intermittent
Account for more than 90% of all failures
Resulting failures are called soft failures

Types of Failures Transaction Failures

Transactions can fail for a number of reasons. Failure can be due to an error in the transaction
caused by incorrect input data as well as the detection of a present or potential deadlock.
Furthermore, some concurrency control algorithms do not permit a transaction to proceed or
even to wait if the data that they attempt to access are currently being accessed by another
transaction. This might also be considered a failure. The usual approach to take in cases of
transaction failure is to abort the transaction, thus resetting the database to its state prior to the
start of this transaction.

The frequency of transaction failures is not easy to measure. An early study reported that in
System R, 3% of the transactions aborted abnormally . In general, it can be stated that
(1) within a single application, the ratio of transactions that abort themselves is rather constant,
being a function of the incorrect data, the available semantic data control features, and so on; and
(2) the number of transaction aborts by the DBMS due to concurrency control considerations
(mainly deadlocks) is dependent on the level of concurrency (i.e., number of concurrent
transactions), the interference of the concurrent applications, the granularity of locks, and so on
.

12.2.2 Site (System) Failures

The reasons for system failure can be traced back to a hardware or to a software failure. The
important point from the perspective of this discussion is that a system failure is always assumed
to result in the loss of main memory contents. Therefore, any part of the database that was in
main memory buffers is lost as a result of a system failure. However, the database that is stored
in secondary storage is assumed to be safe and correct. In distributed database terminology,
system failures are typically referred to as site failures, since they result in the failed site being
unreachable from other sites in the distributed system

We typically differentiate between partial and total failures in a distributed system. Total
failure refers to the simultaneous failure of all sites in the distributed system; partial failure
indicates the failure of only some sites while the others remain operational. As indicated in
Chapter 1, it is this aspect of distributed systems that makes them more available.

12.2.3 Media Failures

Media failure refers to the failures of the secondary storage devices that store the database. Such
failures may be due to operating system errors, as well as to hardware faults such as head
crashes or controller failures. The important point from the perspective of DBMS reliability is
that all or part of the database that is on the secondary storage is considered to be destroyed and
inaccessible. Duplexing of disk storage and maintaining archival copies of the database are
common techniques that deal with this sort of catastrophic problem.

Media failures are frequently treated as problems local to one site and therefore not specifically
addressed in the reliability mechanisms of distributed DBMSs. We consider techniques for
dealing with them in Section 12.3.5 under local recovery management. We then turn our
attention to site failures when we consider distributed recovery functions.

12.2.4 Communication Failures


The three types of failures described above are common to both centralized and distributed
DBMSs. Communication failures, however, are unique to the distributed case. There are a
number of types of communication failures. The most common ones are the errors in the
messages, improperly ordered messages, lost (or undeliverable) messages, and communication
line failures. As discussed in Chapter 2, the first two errors are the responsibility of the computer
network; we will not consider them further. Therefore, in our discussions of distributed DBMS
reliability, we expect the underlying computer network hardware and software to ensure that two
messages sent from a process at some originating site to another process at some destination site
are delivered without error and in the order in which they were sent.

commit protocol recovery prostocols in Dbbs:

commit protocal:

In a local database system, for committing a transaction, the transaction manager has to only
convey the decision to commit to the recovery manager. However, in a distributed system, the
transaction manager should convey the decision to commit to all the servers in the various sites
where the transaction is being executed and uniformly enforce the decision. When processing is
complete at each site, it reaches the partially committed transaction state and waits for all other
transactions to reach their partially committed states. When it receives the message that all the
sites are ready to commit, it starts to commit. In a distributed system, either all sites commit or
none of them does.
The different distributed commit protocols are −

 One-phase commit
 Two-phase commit
 Three-phase commit
Distributed One-phase Commit
Distributed one-phase commit is the simplest commit protocol. Let us consider that there is a
controlling site and a number of slave sites where the transaction is being executed. The steps in
distributed commit are −
 After each slave has locally completed its transaction, it sends a “DONE” message to the
controlling site.
 The slaves wait for “Commit” or “Abort” message from the controlling site. This waiting
time is called window of vulnerability.
 When the controlling site receives “DONE” message from each slave, it makes a decision
to commit or abort. This is called the commit point. Then, it sends this message to all the
slaves.
 On receiving this message, a slave either commits or aborts and then sends an
acknowledgement message to the controlling site.
Distributed Two-phase Commit
Distributed two-phase commit reduces the vulnerability of one-phase commit protocols. The
steps performed in the two phases are as follows −
Phase 1: Prepare Phase
 After each slave has locally completed its transaction, it sends a “DONE” message to the
controlling site. When the controlling site has received “DONE” message from all slaves,
it sends a “Prepare” message to the slaves.
 The slaves vote on whether they still want to commit or not. If a slave wants to commit, it
sends a “Ready” message.
 A slave that does not want to commit sends a “Not Ready” message. This may happen
when the slave has conflicting concurrent transactions or there is a timeout.
Phase 2: Commit/Abort Phase
 After the controlling site has received “Ready” message from all the slaves −
o The controlling site sends a “Global Commit” message to the slaves.
o The slaves apply the transaction and send a “Commit ACK” message to the
controlling site.
o When the controlling site receives “Commit ACK” message from all the slaves, it
considers the transaction as committed.
 After the controlling site has received the first “Not Ready” message from any slave −
o The controlling site sends a “Global Abort” message to the slaves.
o The slaves abort the transaction and send a “Abort ACK” message to the
controlling site.
o When the controlling site receives “Abort ACK” message from all the slaves, it
considers the transaction as aborted.
Distributed Three-phase Commit
The steps in distributed three-phase commit are as follows −
Phase 1: Prepare Phase
The steps are same as in distributed two-phase commit.
Phase 2: Prepare to Commit Phase

 The controlling site issues an “Enter Prepared State” broadcast message.


 The slave sites vote “OK” in response.
Phase 3: Commit / Abort Phase
The steps are same as two-phase commit except that “Commit ACK”/”Abort ACK” message is
not required
Recovery protocal:

What is recovery in distributed databases?

Recovery is the most complicated process in distributed databases. Recovery of a failed system
in the communication network is very difficult.

For example:
Consider that, location A sends message to location B and expects response from B but B is
unable to receive it. There are several problems for this situation which are as follows.

 Message was failed due to failure in the network.


 Location B sent message but not delivered to location A.
 Location B crashed down.
 So it is actually very difficult to find the cause of failure in a large communication network.
 Distributed commit in the network is also a serious problem which can affect the recovery in a
distributed databases.

Two-phase commit protocol in Distributed databases

 Two-phase protocol is a type of atomic commitment protocol. This is a distributed algorithm


which can coordinate all the processes that participate in the database and decide to commit or
terminate the transactions. The protocol is based on commit and terminate action.
 The two-phase protocol ensures that all participant which are accessing the database server can
receive and implement the same action (Commit or terminate), in case of local network failure.
 Two-phase commit protocol provides automatic recovery mechanism in case of a system
failure.
 The location at which original transaction takes place is called as coordinator and where the sub
process takes place is called as Cohort.

Commit request:
In commit phase the coordinator attempts to prepare all cohorts and take necessary steps to
commit or terminate the transactions.

Commit phase:
The commit phase is based on voting of cohorts and the coordinator decides to commit or
terminate the transaction.

Concurrency problems in distributed databases.

Some problems which occur while accessing the database are as follows:

1. Failure at local locations


When system recovers from failure the database is out dated compared to other locations. So it is
necessary to update the database.

2. Failure at communication location


System should have a ability to manage temporary failure in a communicating network in
distributed databases. In this case, partition occurs which can limit the communication between
two locations.

3. Dealing with multiple copies of data


It is very important to maintain multiple copies of distributed data at different locations.

4. Distributed commit
While committing a transaction which is accessing databases stored on multiple locations, if
failure occurs on some location during the commit process then this problem is called as
distributed commit.

5. Distributed deadlock
Deadlock can occur at several locations due to recovery problem and concurrency problem
(multiple locations are accessing same system in the communication network).
UNIT-V PARALLEL DATABASE SYSTEM:
Parallel DBMS is a Database Management System that runs through multiple processors and
disks. They combine two or more processors also disk storage that helps make operations and
executions easier and faster. They are designed to execute concurrent operations.
Parallel Databases :
Nowadays organizations need to handle a huge amount of data with a high transfer rate. For such
requirements, the client-server or centralized system is not efficient. With the need to improve
the efficiency of the system, the concept of the parallel database comes in picture. A parallel
database system seeks to improve the performance of the system through parallelizing concept.
Need :
Multiple resources like CPUs and Disks are used in parallel. The operations are performed
simultaneously, as opposed to serial processing. A parallel server can allow access to a single
database by users on multiple machines. It also performs many parallelization operations like
data loading, query processing, building indexes, and evaluating queries.
Advantages :
Here, we will discuss the advantages of parallel databases. Let’s have a look.
1. Performance Improvement –
By connecting multiple resources like CPU and disks in parallel we can significantly increase
the performance of the system.

2. High availability
In the parallel database, nodes have less contact with each other, so the failure of one node
doesn’t cause for failure of the entire system. This amounts to
significantlyhigherdatabaseavailability.

3. Properre source utilization

Due to parallel execution, the CPU will never be idle. Thus, proper utilization of
resourcesisthere.
4. Increase Reliability
When one site fails, the execution can continue with another available site which is having a
copy of data. Making the system more reliable.
Performance Measurement of Databases
Here, we will emphasize the performance measurement factor-like Speedup and Scale-up. Let’s
understand it one by one with the help of examples.
Speedup –
The ability to execute the tasks in less time by increasing the number of resources is called
Speedup.
Speedup=time original/time parallel
Where ,
time original = time required to execute the task using 1 processor
time parallel = time required to execute the task using 'n' processors

fig. Ideal Speedup curve

Example –

fig. A CPU requires 3 minutes to execute a process

fig. ‘n’ CPU requires 1 min to execute a process by dividing into smaller tasks
Scale-up –
The ability to maintain the performance of the system when both workload and resources
increase proportionally.
Scaleup = Volume Parallel/Volume Original
Where ,
Volume Parallel = volume executed in a given amount of time using 'n' processor
Volume Original = volume executed in a given amount of time using 1 processor

fig. Ideal Scaleup curve

Example –
20 users are using a CPU at 100% efficiency. If we try to add more users, then it’s not possible
for a single processor to handle additional users. A new processor can be added to serve the users
parallel. And will provide 200% efficiency.

The main architecture for parallel DBMS is:

1. Shared Memory System

A Shared Memory System is an architecture of Database Management System, where every


computer processor is able to access and process data from multiple memory modules or unit
through intercommunication channel. This architecture is also commonly known as SMP or
Symmetric Multi-processing. A Shared Memory System contains large amounts of cache
memories at each processor, so referencing of the shared memory is avoided.
Advantages of Shared Memory:

 Data is easily accessed from various processors.


 A single processor can send messages to other processors efficiently.
Disadvantage of Shared Memory:

 Waiting time for every single processor increases when all of them are uses.
 Bandwidth is also a problem.
2. Shared Disk System

A Shared Disk System is an architecture of Database Management System where every


computer processors can access multiple disk through intercommunication network. It can also
access and utilize every local memory. Each of the processors have own memory system, so the
shared data are more efficient.

Advantages of Shared Disk System:


 The fault tolerance can be achieved using this system.
Disadvantages of Shared Disk System:

 The addition of processors can slow down existing processors.


 Shared Disk System have a limited scalability meaning it is sometimes fixed.
3. Shared Nothing System

A Shared Nothing System is an architecture of Database Management System where every


processor has their own disk and memory for the objective of efficient workflows. The
processors can communicate with other processors using intercommunication network. Each of
the processors act like servers to store data on the disk. So there can be an efficient and effective
workflows.

Advantage of Shared Nothing System

 This system has more scalability.


 Number of processors and disk can be connected as per the requirement in share
nothing disk system.
Disadvantage of Shared Nothing system

 Must require the partitioning of data.


 Cost that need for this system is higher.

Hierarchical System or Non-Uniform Memory Architecture

 Hierarchical model system is a hybrid of shared memory system, shared disk system and shared
nothing system.
 Hierarchical model is also known as Non-Uniform Memory Architecture (NUMA).
 In this system each group of processor has a local memory. But processors from other groups
can access memory which is associated with the other group in coherent.
 NUMA uses local and remote memory(Memory from other group), hence it will take longer
time to communicate with each other.
Advantages of NUMA
 Improves the scalability of the system.
 Memory bottleneck(shortage of memory) problem is minimized in this architecture.
 Disadvantages of NUMA
The cost of the architecture is higher compared to other architectures.
Parallel query processing and optimization:
n a distributed database system, processing a query comprises of optimization at both the global
and the local level. The query enters the database system at the client or controlling site. Here,
the user is validated, the query is checked, translated, and optimized at a global level.
The architecture can be represented as −

Mapping Global Queries into Local Queries


The process of mapping global queries to local ones can be realized as follows −
 The tables required in a global query have fragments distributed across multiple sites. The
local databases have information only about local data. The controlling site uses the
global data dictionary to gather information about the distribution and reconstructs the
global view from the fragments.
 If there is no replication, the global optimizer runs local queries at the sites where the
fragments are stored. If there is replication, the global optimizer selects the site based
upon communication cost, workload, and server speed.
 The global optimizer generates a distributed execution plan so that least amount of data
transfer occurs across the sites. The plan states the location of the fragments, order in
which query steps needs to be executed and the processes involved in transferring
intermediate results.
 The local queries are optimized by the local database servers. Finally, the local query
results are merged together through union operation in case of horizontal fragments and
join operation for vertical fragments.
For example, let us consider that the following Project schema is horizontally fragmented
according to City, the cities being New Delhi, Kolkata and Hyderabad.
PROJECT

PId City Department Status

Suppose there is a query to retrieve details of all projects whose status is “Ongoing”.
The global query will be &inus;
$$\sigma_{status} = {\small "ongoing"}^{(PROJECT)}$$
Query in New Delhi’s server will be −
$$\sigma_{status} = {\small "ongoing"}^{({NewD}_-{PROJECT})}$$
Query in Kolkata’s server will be −
$$\sigma_{status} = {\small "ongoing"}^{({Kol}_-{PROJECT})}$$
Query in Hyderabad’s server will be −
$$\sigma_{status} = {\small "ongoing"}^{({Hyd}_-{PROJECT})}$$
In order to get the overall result, we need to union the results of the three queries as follows −
$\sigma_{status} = {\small "ongoing"}^{({NewD}_-{PROJECT})} \cup \sigma_{status} = {\
small "ongoing"}^{({kol}_-{PROJECT})} \cup \sigma_{status} = {\small
"ongoing"}^{({Hyd}_-{PROJECT})}$
Distributed Query Optimization
Distributed query optimization requires evaluation of a large number of query trees each of
which produce the required results of a query. This is primarily due to the presence of large
amount of replicated and fragmented data. Hence, the target is to find an optimal solution instead
of the best solution.
The main issues for distributed query optimization are −

 Optimal utilization of resources in the distributed system.


 Query trading.
 Reduction of solution space of the query.
Optimal Utilization of Resources in the Distributed System
A distributed system has a number of database servers in the various sites to perform the
operations pertaining to a query. Following are the approaches for optimal resource utilization −
Operation Shipping − In operation shipping, the operation is run at the site where the data is
stored and not at the client site. The results are then transferred to the client site. This is
appropriate for operations where the operands are available at the same site. Example: Select and
Project operations.
Data Shipping − In data shipping, the data fragments are transferred to the database server,
where the operations are executed. This is used in operations where the operands are distributed
at different sites. This is also appropriate in systems where the communication costs are low, and
local processors are much slower than the client server.
Hybrid Shipping − This is a combination of data and operation shipping. Here, data fragments
are transferred to the high-speed processors, where the operation runs. The results are then sent
to the client site.

Query Trading
In query trading algorithm for distributed database systems, the controlling/client site for a
distributed query is called the buyer and the sites where the local queries execute are called
sellers. The buyer formulates a number of alternatives for choosing sellers and for reconstructing
the global results. The target of the buyer is to achieve the optimal cost.
The algorithm starts with the buyer assigning sub-queries to the seller sites. The optimal plan is
created from local optimized query plans proposed by the sellers combined with the
communication cost for reconstructing the final result. Once the global optimal plan is
formulated, the query is executed.
Reduction of Solution Space of the Query
Optimal solution generally involves reduction of solution space so that the cost of query and data
transfer is reduced. This can be achieved through a set of heuristic rules, just as heuristics in
centralized systems.
Following are some of the rules −
 Perform selection and projection operations as early as possible. This reduces the data
flow over communication network.
 Simplify operations on horizontal fragments by eliminating selection conditions which are
not relevant to a particular site.
 In case of join and union operations comprising of fragments located in multiple sites,
transfer fragmented data to the site where most of the data is present and perform
operation there.
 Use semi-join operation to qualify tuples that are to be joined. This reduces the amount of
data transfer which in turn reduces communication cost.
 Merge the common leaves and sub-trees in a distributed query tree.
Load balancing:
A load balancer is a device that acts as a reverse proxy and distributes network or application
traffic across a number of servers. Load adjusting is the approach to conveying load units (i.e.,
occupations/assignments) across the organization which is associated with the distributed
system. Load adjusting should be possible by the load balancer. The load balancer is a
framework that can deal with the load and is utilized to disperse the assignments to the servers.
The load balancers allocates the primary undertaking to the main server and the second
assignment to the second server.

Purpose of Load Balancing in Distributed Systems:


 Security: A load balancer provide safety to your site with practically no progressions to
your application.
 Protect applications from emerging threats: The Web Application Firewall (WAF) in
the load balancer shields your site.
 Authenticate User Access: The load balancer can demand a username and secret key prior
to conceding admittance to your site to safeguard against unapproved access.
 Protect against DDoS attacks: The load balancer can distinguish and drop conveyed
refusal of administration (DDoS) traffic before it gets to your site.
 Performance: Load balancers can decrease the load on your web servers and advance
traffic for a superior client experience.
 SSL Offload: Protecting traffic with SSL (Secure Sockets Layer) on the load balancer
eliminates the upward from web servers bringing about additional assets being accessible
for your web application.
 Traffic Compression: A load balancer can pack site traffic giving your clients a vastly
improved encounter with your site.

Load Balancing Approaches:

 Round Robin
 Least Connections
 Least Time
 Hash
 IP Hash
Classes of Load Adjusting Calculations:
Following are a portion of the various classes of the load adjusting calculations.
 Static: In this model assuming any hub/node is found with a heavy load, an assignment can
be taken arbitrarily and move the undertaking to some other arbitrary system. .
 Dynamic: It involves the present status data for load adjusting. These are better
calculations than static calculations.
 Deterministic: These calculations utilize processor and cycle attributes to apportion cycles
to the hubs.
 Centralized: The framework states data is gathered by a single hub.

Advantages of Load Balancing:


 Load balancers minimize server response time and maximize throughput.
 Load balancer ensures high availability and reliability by sending requests only to online
servers
 Load balancers do continuous health checks to monitor the server’s capability of handling
the request.
Migration:
Another important policy to be used by a distributed operating system that supports process
migration is to decide about the total number of times a process should be allowed to migrate.

Migration Models:
 Code section
 Resource section
 Execution section
 Code section: It contains the real code.
 Resource fragment: It contains a reference to outer resources required by the interaction.
 Execution section: It stores the ongoing execution condition of interaction, comprising
private information, the stack, and the program counter.
 Powerless movement: In the powerless relocation just the code section will be moved.
 Solid relocation: In this movement, both the code fragment and the execution portion will
be moved. The relocation additionally can be started by the source.
Mobile database:
A Mobile database is a database that can be connected to a mobile computing device over a
mobile network (or wireless network). Here the client and the server have wireless
connections. In today’s world, mobile computing is growing very rapidly, and it is huge
potential in the field of the database. It will be applicable on different-different devices like
android based mobile databases, iOS based mobile databases, etc. Common examples of
databases are Couch base Lite, Object Box, etc.

Features of Mobile database :


Here, we will discuss the features of the mobile database as follows.
 A cache is maintained to hold frequent and transactions so that they are not lost due to
connection failure.
 As the use of laptops, mobile and PDAs is increasing to reside in the mobile system.
 Mobile databases are physically separate from the central database server.
 Mobile databases resided on mobile devices.
 Mobile databases are capable of communicating with a central database server or other
mobile clients from remote sites.
 With the help of a mobile database, mobile users must be able to work without a wireless
connection due to poor or even non-existent connections (disconnected).
 A mobile database is used to analyze and manipulate data on mobile devices.
Mobile Database typically involves three parties :
1. Fixed Hosts –
It performs the transactions and data management functions with the help of database
servers.

2. Mobiles Units –
These are portable computers that move around a geographical region that includes the
cellular network that these units use to communicate to base stations.

3. Base Stations –
These are two-way radios installation in fixed locations, that pass communication with the
mobile units to and from the fixed hosts.
Limitations :
Here, we will discuss the limitation of mobile databases as follows.
 It has Limited wireless bandwidth.
 In the mobile database, Wireless communication speed.
 It required Unlimited battery power to access.
 It is Less secured.
 It is Hard to make theft-proof.
 DISTRIBUTED OBJECT MANAGEMENT
 Distributed object management has the same objective with regards to object-oriented data
management as the traditional database systems had for relational databases: transparent
management of “objects” that are distributed across a number of sites.
 Thus users have an integrated, “single image” view of the objectbase while it is physically
distributed among a number of sites .
 Maintaining such an environment requires that problems similar to those of relational
distributed databases be addressed, with the additional opportunities and difficulties posed by
object-orientation. In the remainder, we will review these issues and indicate some of the
approaches.
 ARCHITECTURE
 The first step that system developers seem to take, on the way to providing true distribution
by peer-to-peer communication, is to develop client-server systems .

 This was true in User Application User Application User Application User Query User
Query User Query Distributed Objectbase .
 Transparent Access to Objectbase Communication Subsystem Site 1 User Query Server
Software Client & Server Software User Application Client & Server Software User
Application
Features of Distributed Object Systems
Object Interface Specification. ...
 Object Manager. ...
 Registration/Naming Service. ...
 Object Communication Protocol. ...
 5. Development Tools. ...
 Security.
Advantages
 Resource sharing − Sharing of hardware and software resources.
 Openness − Flexibility of using hardware and software of different vendors.
 Concurrency − Concurrent processing to enhance performance.
 Scalability − Increased throughput by adding new resources.
Distributed objects might be used :

1. to share information across applications or users.


2. to synchronize activity across several machines.
3. to increase performance associated with a particular task.
Multi databases:

Multiple databases require separate database administration, and a distributed database system
requires coordinated administration of the databases and network protocols. A parallel server can
consolidate several databases to simplify administrative tasks.

Multiple databases can provide greater availability than a single instance accessing a single
database, because an instance failure in a distributed database system does not prevent access to
data in the other databases: only the database owned by the failed instance is inaccessible. A
parallel server, however, allows continued access to all data when one instance fails, including
data accessed by the instance running on the failed node.

A parallel server accessing a single consolidated database avoids the need for distributed
updates, inserts, or deletions and more expensive two-phase commits by allowing a transaction
on any node to write to multiple tables simultaneously, regardless of which nodes usually write
to those tables.

Client-Server Systems

Any Oracle configurations can run in a client-server environment. In Oracle, a client application
runs on a remote computer using Net8 to access an Oracle server through a network. The
performance of this configuration is typically limited to the power of the single server node.

Figure 1-11 illustrates an Oracle client-server system.


Figure 1-11 Client-Server System

The client-server configuration allows you to off-load processing from the computer that runs an
Oracle server. If you have too many applications running on one machine, you can off-load them
to improve performance. However, if your database server is reaching its processing limits you
might want to move either to a larger machine or to a multinode system.

For compute-intensive applications, you could run some applications on one node of a multinode
system while running Oracle and other applications on another node or on several other nodes. In
this way you could use various nodes of a parallel machine as client nodes and one as a server
node.

If the database has several distinct, high-throughput parts, a parallel server running on high-
performance nodes can provide quick processing for each part of the database while also
handling occasional access across parts.

A client-server configuration requires that the network convey all communications between
clients and the database. This may not be appropriate for high-volume communications as is
required for many batch applications.

advantages Disadvantages

Modular development Costly software

Reliability Large overhead

Lower communication costs Data integrity

Better response Improper data distribution

You might also like