Distribution Database
Distribution Database
Distribution Database:
A distributed database is a collection of multiple interconnected databases, which are spread
physically across various locations that communicate via a computer network.
Features
Databases in the collection are logically interrelated with each other. Often they represent
a single logical database.
Data is physically stored across multiple sites. Data in each site can be managed by a
DBMS independent of the other sites.
The processors in the sites are connected via a network. They do not have any
multiprocessor configuration.
A distributed database is not a loosely connected file system.
A distributed database incorporates transaction processing, but it is not synonymous with
a transaction processing system.
It ensures that the data modified at any site is universally updated.
It is used in application areas where large volumes of data are processed and accessed by
numerous users simultaneously.
It is designed for heterogeneous database platforms.
It maintains confidentiality and data integrity of the databases.
Distributed Data Storage :
There are 2 ways in which data can be stored on different sites. These are:
1. Replication –
In this approach, the entire relationship is stored redundantly at 2 or more sites. If the entire
database is available at all sites, it is a fully redundant database. Hence, in replication, systems
maintain copies of data.
This is advantageous as it increases the availability of data at different sites. Also, now query
requests can be processed in parallel.
However, it has certain disadvantages as well. Data needs to be constantly updated. Any
change made at one site needs to be recorded at every site that relation is stored or else it may
lead to inconsistency. This is a lot of overhead. Also, concurrency control becomes way more
complex as concurrent access now needs to be checked over a number of sites.
2. Fragmentation –
In this approach, the relations are fragmented (i.e., they’re divided into smaller parts) and
each of the fragments is stored in different sites where they’re required. It must be made sure
that the fragments are such that they can be used to reconstruct the original relation (i.e, there
isn’t any loss of data).
Fragmentation is advantageous as it doesn’t create copies of data, consistency is not a
problem.
Overall Cost :
Various costs such as maintenance cost, procurement cost, hardware cost,
network/communication costs, labor costs, etc, adds up to the overall cost and make it
costlier than normal DBMS.
Security issues:
In a Distributed Database, along with maintaining no data redundancy, the security of
data as well as a network is a prime concern. A network can be easily attacked for data
theft and misuse.
Integrity Control:
In a vast Distributed database system, maintaining data consistency is important. All
changes made to data at one site must be reflected on all the sites. The communication
and processing cost is high in Distributed DBMS in order to enforce the integrity of
data.
Lacking Standards:
Although it provides effective communication and data sharing, still there are no
standard rules and protocols to convert a centralized DBMS to a large Distributed
DBMS. Lack of standards decreases the potential of Distributed DBMS.
Lack of Professional Support:
Due to a lack of adequate communication standards, it is not possible to link different
equipment produced by different vendors into a smoothly functioning network. Thus
several good resources may not be available to the users of the network.
Data design complex:
Designing a distributed database is more complex compared to a centralized database.
Distributed DBMS Architectures
DDBMS architectures are generally developed depending on three parameters −
Distribution − It states the physical distribution of data across the different sites.
Autonomy − It indicates the distribution of control of the database system and the degree
to which each constituent DBMS can operate independently.
Heterogeneity − It refers to the uniformity or dissimilarity of the data models, system
components and databases.
Architectural Models
Some of the common architectural models are −
Transparency :
Transparency in DDBMS refers to the transparent distribution of information to the
user from the system. It helps in hiding the information that is to be implemented by
the user.
In Distributed Database Management System, there are four types of transparencies, which
are as follows –
Transaction transparency
Performance transparency
DBMS transparency
Distribution transparency
1. Transaction transparency-
This transparency makes sure that all the transactions that are distributed preserve
distributed database integrity and regularity. Also, it is to understand that distribution
transaction access is the data stored at multiple locations It is very complex due to the use
of fragmentation, allocation, and replication structure of DBMS.
2. Performance transparency-
This transparency requires a DDBMS to work in a way that if it is a centralized database
management system. Also, the system should not undergo any downs in performance as
its architecture is distributed. Likewise,. This has another complexity to take under
consideration which is the fragmentation, replication, and allocation structure of DBMS.
3. DBMS transparency-
This transparency is only applicable to heterogeneous types of DDBMS (Databases that
have different sites and use different operating systems, products, and data models) as it
hides the fact that the local DBMS may be different. This transparency is one of the most
complicated transparencies to make use of as a generalization.
4. Distribution transparency-
Distribution transparency helps the user to recognize the database as a single thing or a
logical entity, and if a DDBMS displays distribution data transparency, then the user does
not need to know that the data is fragmented.
Distribution transparency has its 5 types, which are discussed below –
Fragmentation transparency-
In this type of transparency, the user doesn’t have to know about fragmented data and, due
to which, it leads to the reason why database accesses are based on the global schema..
Location transparency-
If this type of transparency is provided by DDBMS, then it is necessary for the user to
know how the data has been fragmented, but knowing the location of the data is not
necessary.
Replication transparency-
In replication transparency, the user does not know about the copying of fragments.
Replication transparency is related to concurrency transparency and failure transparency.
Local Mapping transparency-
In local mapping transparency, the user needs to define both the fragment names, location
of data items while taking into account any duplications that may exist. This is a more
difficult and time-taking query for the user in DDBMS transparency.
Naming transparency-
We already know that DBMS and DDBMS are types of centralized database system. It
means that each item in this database must consist of a unique name
Global Directory Issues:
Global Directory is an extension of the normal directory, including information about the
location of the fragments as well as the makeup of the fragments, for cases of distributed
DBMS or a multi-DBMS, that uses a global conceptual schema,
• Relevant for distributed DBMS or a multi-DBMS that uses a global conceptual schema
• Includes information about the location of the fragments as well as the makeup of
fragments.
• Directory is itself a database that contains meta-data about the actual data stored in
database.
Three issues –
A directory may either be global to the entire database or local to each site. –
Directory may be maintained centrally at one site, or in a distributed fashion by
distributing it over a number of sites.
➢If system is distributed, directory is always distributed – Replication may be single
copy or multiple copies.
➢ Multiple copies would provide more reliability
UNIT-II DISTRIBUTED DATABASE DESIGN
DISTRIBUTED DATABASE DESIGN:
The strategies can be broadly divided into replication and fragmentation. However, in most
cases, a combination of the two is used.
Data Replication
Data replication is the process of storing separate copies of the database at two or more sites. It is
a popular fault tolerance technique of distributed databases.
Advantages of Data Replication
Reliability − In case of failure of any site, the database system continues to work since a
copy is available at another site(s).
Reduction in Network Load − Since local copies of data are available, query processing
can be done with reduced network usage, particularly during prime hours. Data updating
can be done at non-prime hours.
Quicker Response − Availability of local copies of data ensures quick query processing
and consequently quick response time.
Simpler Transactions − Transactions require less number of joins of tables located at
different sites and minimal coordination across the network. Thus, they become simpler
in nature.
Disadvantages of Data Replication
Increased Storage Requirements − Maintaining multiple copies of data is associated
with increased storage costs. The storage space required is in multiples of the storage
required for a centralized system.
Increased Cost and Complexity of Data Updating − Each time a data item is updated,
the update needs to be reflected in all the copies of the data at the different sites. This
requires complex synchronization techniques and protocols.
Undesirable Application – Database coupling − If complex update mechanisms are not
used, removing data inconsistency requires complex co-ordination at application level.
This results in undesirable application – database coupling.
Some commonly used replication techniques are −
Snapshot replication
Near-real-time replication
Pull replication
Fragmentation
Fragmentation is the task of dividing a table into a set of smaller tables. The subsets of the table
are called fragments. Fragmentation can be of three types: horizontal, vertical, and hybrid
(combination of horizontal and vertical). Horizontal fragmentation can further be classified into
two techniques: primary horizontal fragmentation and derived horizontal fragmentation.
Fragmentation should be done in a way so that the original table can be reconstructed from the
fragments. This is needed so that the original table can be reconstructed from the fragments
whenever required. This requirement is called “reconstructiveness.”
Advantages of Fragmentation
Since data is stored close to the site of usage, efficiency of the database system is
increased.
Local query optimization techniques are sufficient for most queries since data is locally
available.
Since irrelevant data is not available at the sites, security and privacy of the database
system can be maintained.
Disadvantages of Fragmentation
When data from different fragments are required, the access speeds may be very low.
In case of recursive fragmentations, the job of reconstruction will need expensive
techniques.
Lack of back-up copies of data in different sites may render the database ineffective in
case of failure of a site.
Vertical Fragmentation
In vertical fragmentation, the fields or columns of a table are grouped into fragments. In order to
maintain reconstructiveness, each fragment should contain the primary key field(s) of the table.
Vertical fragmentation can be used to enforce privacy of data.
For example, let us consider that a University database keeps records of all registered students in
a Student table having the following schema.
STUDENT
Now, the fees details are maintained in the accounts section. In this case, the designer will
fragment the database as follows −
CREATE TABLE STD_FEES AS
SELECT Regd_No, Fees
FROM STUDENT;
Horizontal Fragmentation
Horizontal fragmentation groups the tuples of a table in accordance to values of one or more
fields. Horizontal fragmentation should also confirm to the rule of reconstructiveness. Each
horizontal fragment must have all columns of the original base table.
For example, in the student schema, if the details of all students of Computer Science Course
needs to be maintained at the School of Computer Science, then the designer will horizontally
fragment the database as follows −
CREATE COMP_STD AS
SELECT * FROM STUDENT
WHERE COURSE = "Computer Science";
Hybrid Fragmentation
In hybrid fragmentation, a combination of horizontal and vertical fragmentation techniques are
used. This is the most flexible fragmentation technique since it generates fragments with minimal
extraneous information. However, reconstruction of the original table is often an expensive task.
Hybrid fragmentation can be done in two alternative ways −
At first, generate a set of horizontal fragments; then generate vertical fragments from one
or more of the horizontal fragments.
At first, generate a set of vertical fragments; then generate horizontal fragments from one
or more of the vertical fragments.
semantic data control in distributed database:
Semantic datais the information,that allows machine to understand the meaning of
information
It describes the Technology and methods that convey the meaning of information
Semantic data systems are designed to represent the real world as accurately as
possible with in the dataset
Data modeling is the process of creating and extending data models which are visual
representations of data and its organization
Example:ER diagrams
The semantic data control are similar to the ones in centralized database system
Consider the fragmentation of relations and distribution of fragments across
multiple sites
Semantic data control include
View management
Data security
Semantic integrity control
View management:
View Mnagement in Distributed
different sites.
• Views are conceptually the same as the base relations, therefore we store them in the (possibly)
distributed directory.
– Thus, views might be centralized at one site, partially replicated, fully replicated
– Queries on views are translated into queries on base relations, yielding distributed queries due
to possible fragmentation of data.
Semantic Integrity Control Semantic integrity control defines and enforces the integrity
constraints of the database system.
The integrity constraints are as follows –
Data type integrity constraint
Entity integrity constraint
Referential integrity constraint
Data Type Integrity Constraint
A data type constraint restricts the range of values and the type of operations that can be applied
to the field with the specified data type.
For example, let us consider that a table "HOSTEL" has three fields - the hostel number, hostel
name and capacity. The hostel number should start with capital letter "H" and cannot be NULL,
and the capacity should not be more than 150. The following SQL command can be used for data
definition
CREATE TABLE HOSTEL ( H_NO VARCHAR2(5) NOT NULL, H_NAME
VARCHAR2(15), CAPACITY INTEGER, CHECK ( H_NO LIKE 'H%’ ), CHECK
( CAPACITY <= 150) );
Entity Integrity Control :
Entity integrity control enforces the rules so that each tuple can be uniquely identified from other
tuples. For this a primary key is defined. A primary key is a set of minimal fields that can
uniquely identify a tuple. Entity integrity constraint states that no two tuples in a table can have
identical values for primary keys and that no field which is a part of the primary key can have
NULL value
For example, in the above hostel table, the hostel number can be assigned as the primary key
through the following SQL statement
CREATE TABLE HOSTEL ( H_NO VARCHAR2(5) PRIMARY KEY,
H_NAME VARCHAR2(15), CAPACITY INTEGER);
Referential Integrity Constraint:
Referential integrity constraint lays down the rules of foreign keys. A foreign key is a field in a
data table that is the primary key of a related table. The referential integrity constraint lays down
the rule that the value of the foreign key field should either be among the values of the primary
key of the referenced table or be entirely NULL.
For example, let us consider a student table where a student may opt to live in a hostel. To
include this, the primary key of hostel table should be included as a foreign key in the student
table. The following SQL statement incorporates this
CREATE TABLE STUDENT ( S_ROLL INTEGER PRIMARY KEY,
S_NAME VARCHAR2(25) NOT NULL, S_COURSE VARCHAR2(10),
S_HOSTEL VARCHAR2(5) REFERENCES HOSTEL);
Query processing issues:
In a distributed database system, processing a query comprises of optimization at both the global
and the local level. The query enters the database system at the client or controlling site. Here,
the user is validated, the query is checked, translated, and optimized at a global level.
The architecture can be represented as −
o The total cost that will be incurred in processing the query. It is the dome of all times incurred
in processing the operations of the query at various sites and intrinsic communication.
o The resource time of the query. This is the time elapsed for executing the query. Since
operations can be executed in parallel at different sited, the response time of a query may be
significantly less than its cost. Obviously the total cost should be minimized.
o In a distributed system, the total cost to be minimized includes CPU, I\O, and communication
costs. This cost can be minimized by reducing the number of I\O operation through fast access
methods to the data and efficient use of main memory.
The communication cost is the time needed for exchanging the data between sited participating
in the execution of the query.
Data Localization
• The input to the second layer is an algebraic query on global relations. The main role of the
second layer is to localize the query's data using data distribution information in the fragment
schema.
• This layer determines which fragments are involved in the query and transforms the distributed
query into a query on fragments.
Data Localization
• The input to the second layer is an algebraic query on global relations. The main role of the
second layer is to localize the query's data using data distribution information in the fragment
schema.
• This layer determines which fragments are involved in the query and transforms the distributed
query into a query on fragments.
Generating a query on fragments is done in two steps
• First, the query is mapped into a fragment query by substituting each relation by its
reconstruction program (also called materialization program).
• Second, the fragment query is simplified and restructured to produce another "good" query.
• The output of the query optimization layer is a optimized algebraic query with communication
operators included on fragments. It is typically represented and saved (for future executions) as a
distributed query execution plan.
The first four characteristics hold for both centralized and distributed query processors while the
next four characteristics are particular to distributed query processors in tightly-integrated
distributed DBMSs.
Languages
Types of Optimization
Optimization Timing
Statistics
Decision Sites
Use of Semijoins
Types of Optimization
Exhaustive search
query optimization aims at choosing the “best” point in the solution space of all possible
execution strategies.
Although this method is effective in selecting the best strategy, it may incur a significant
processing cost for the optimization it self
The problem is that the solution space can be large that is, there may be many equivalent
strategies, even with a small number of relations..
Heuristics
restrict the solution space so that only a few strategies are considered
Find a very good solution, not necessarily the best one, but avoid the high cost of optimization,
in terms of memory and time consumptionOptimization Timing
Optimization can be done staticallybefore executing the query or dynamicallyas the query is
executed.
Static
Static query optimization is done at query compilation time.
Thus the cost of optimization may be amortized over multiple query executions.
this timing is appropriate for use with the exhaustive search method.
Since the sizes of the intermediate relations of a strategy are not known until run time, they must
be estimated using database statistics Optimization Timing
Optimization can be done statically before executing the query or dynamically as the query is
executed.
Optimization Timing
Dynamic
database statistics are not needed to estimate the size of intermediate results
The main advantage over static query optimization is that the actual sizes of intermediate
relations are available to the query processor, thereby minimizing the probability of a bad choice.
The main shortcoming is that query optimization, an expensive task, must be repeated for each
execution of the query. Therefore, this approach is best for ad-hoc queries.
Hybrid
provide the advantages of static query optimization
The approach is basically static, but dynamic query optimization may take place at run time
when a high difference between predicted sizes and actual size of intermediate relations is
detected.
if the error in estimate sizes > threshold, reoptimize at run time
Statistics
The effectiveness of query optimization relies on statistics on the database.
Dynamic query optimization requires statistics in order to choose which operators should be
done first.
Static query optimization is even more demanding since the size of intermediate relations must
also be estimated based on statistical information.
statistics for query optimization typically bear on fragments, and include fragment cardinality
and size as well as the size and number of distinct values of each attribute.
To minimize the probability of error, more detailed statistics such as histograms of attribute
values are sometimes used.
The accuracy of statistics is achieved by periodic updating.
With static optimization, significant changes in statistics used to optimize a query might result in
query reoptimization
.
Decision Sites
Centralized decision approach
single site generates the strategy that is determines the “best” schedule Simpler
cooperation among various sites to determine the schedule (elaboration of the best strategy)
one site makes the major decisions that is determines the global schedule
Other sites make local decisions that is optimizes the local sub querie
Network Topology
distributed query optimization be divided into two separate problems:
selection of the global execution strategy, based on inter site communication, and selection of
each local execution strategy, based on a centralized query processing algorithm.
Wide area networks (WAN) –point-to-point
communication cost will dominate; ignore all other cost factors
global schedule to minimize communication cost
local schedules according to centralized query optimization
Minimization of Query Response Time (time taken to generate user query results).
Maximize the throughput of the system (the number of requests handled within a given
amount of time).
Reduce the amount of storage and memory required for processing.
Increase parallelism.
1. Semi-join Algorithm
2. INGRES Algorithm
3. System R Algorithm
4. System R* Algorithm
5. Hill Climbing Algorithm
6. SDD-1 Algorithm
1. It refinements of an initial feasible solution are recursively computed until no more cost
improvements can be made.
2. In hill climbing algorithm, Semijoins, data replication, and fragmentation are not used.
3. It devised for wide area point-to-point networks.
4. It is the first distributed query processing algorithm.
5. The hill-climbing algorithm proceeds as follows:
i. Select initial feasible execution strategy ES0
• i.e., a global execution schedule that includes all intersite communication
• Determine the candidate result sites, where a relation referenced in the query exist
• Compute the cost of transferring all the other referenced relations to each candidate site
• ES0 = candidate site with minimum cost
ii. Split ES0 into two strategies: ES1 followed by ES2
• ES1: send one of the relations involved in the join to the other relation’s site –
• ES2: send the join result to the final result site
iii. Replace ES0 with the split schedule which gives
iv. Recursively apply steps 2 and 3 on ES1 and ES2 until no more benefit can be gained
v. Check for redundant transmissions in the final plan and eliminate them
Transactions
A transaction is a program including a collection of database operations, executed as a logical
unit of data processing. The operations performed in a transaction include one or more of
database operations like insert, delete, update or retrieve data. It is an atomic process that is
either performed into completion entirely or is not performed at all. A transaction involving only
data retrieval without any data update is called read-only transaction.
Each high level operation can be divided into a number of low level tasks or operations. For
example, a data update operation can be divided into three tasks −
read_item() − reads data item from storage to main memory.
modify_item() − change value of item in the main memory.
write_item() − write the modified value from main memory to storage.
Database access is restricted to read_item() and write_item() operations. Likewise, for all
transactions, read and write forms the basic database operations.
Transaction Operations
The low level operations performed in a transaction are −
begin_transaction − A marker that specifies start of transaction execution.
read_item or write_item − Database operations that may be interleaved with main
memory operations as a part of transaction.
end_transaction − A marker that specifies end of transaction.
commit − A signal to specify that the transaction has been successfully completed in its
entirety and will not be undone.
rollback − A signal to specify that the transaction has been unsuccessful and so all
temporary changes in the database are undone. A committed transaction cannot be rolled
back.
Transaction States
A transaction may go through a subset of five states, active, partially committed, committed,
failed and aborted.
Active − The initial state where the transaction enters is the active state. The transaction
remains in this state while it is executing read, write or other operations.
Partially Committed − The transaction enters this state after the last statement of the
transaction has been executed.
Committed − The transaction enters this state after successful completion of the
transaction and system checks have issued commit signal.
Failed − The transaction goes from partially committed state or active state to failed state
when it is discovered that normal execution can no longer proceed or system checks fail.
Aborted − This is the state after the transaction has been rolled back after failure and the
database has been restored to its state that was before the transaction began.
Rapid response
The response time of a TPS is important because a business cannot afford to have their
customers waiting for long periods of time before making a transaction.TPS system are designed
to process transactions virtually instantly. It takes only few seconds for a transaction.
Reliability
Transaction processing system should be reliable. Customers will not tolerate mistakes. TPS
system is always should contain sufficient safety and security measures.
Inflexibility
A TPS wants every transaction to be processed in exactly the same way, regardless of the user,
the customer or the time of day. Transaction must be processed in the same way each time to
maximize efficiency. If a TPS was flexible, different types of data would be entered in different
orders.
Historical data
TPS produces information on the historical basis. Because TPS generate information taking into
account transactions already taken place in the organization.
Link with external environment
Transaction processing system establishes the relation with external environment Because TPS
distributes information to its customers and suppliers.
TPS produces information and distributes it to the other type of systems. For ex: sales processing
systems supply information to the general ledger systems.
Controlled access
Since TPS systems can be such a powerful business tool, it must be able to allow only authorized
employees to access it at any time. Restricted access to the system ensures that employees that
employees who have the authority will only be able to process and control transaction.
In systems with low conflict rates, the task of validating every transaction for serializability may
lower performance. In these cases, the test for serializability is postponed to just before commit.
Since the conflict rate is low, the probability of aborting transactions which are not serializable is
also low. This approach is called optimistic concurrency control technique.
In this approach, a transaction’s life cycle is divided into the following three phases −
Execution Phase − A transaction fetches data items to memory and performs operations upon
them.
• Validation Phase − A transaction performs checks to ensure that committing its changes to the
database passes serializability test.
• Commit Phase − A transaction writes back modified data item in memory to the disk.
This algorithm uses three rules to enforce serializability in validation phase −
• Rule 1 − Given two transactions Ti and Tj, if Ti is reading the data item which Tj is writing,
then Ti’s execution phase cannot overlap with Tj’s commit phase. Tj can commit only after Ti
has finished execution.
• Rule 2 − Given two transactions Ti and Tj, if Ti is writing the data item that Tj is reading, then
Ti’s commit phase cannot overlap with Tj’s execution phase. Tj can start executing only after Ti
has already committed.
• Rule 3 − Given two transactions Ti and Tj, if Ti is writing the data item which Tj is also
writing, then Ti’s commit phase cannot overlap with Tj’s commit phase. Tj can start to commit
only after Ti has already committed.
deadlock management in distributed dbms:
Deadlock is a state of a database system having two or more transactions, when each transaction
is waiting for a data item that is being locked by some other transaction. A deadlock can be
indicated by a cycle in the wait-for-graph. This is a directed graph in which the vertices denote
transactions and the edges denote waits for data items.
For example, in the following wait-for-graph, transaction T1 is waiting for data item X which is
locked by T3. T3 is waiting for Y which is locked by T2 and T2 is waiting for Z which is locked
by T1. Hence, a waiting cycle is formed, and none of the transactions can proceed executing.
Deadlock prevention.
Deadlock avoidance.
Deadlock detection and removal.
All of the three approaches can be incorporated in both a centralized and a distributed database
system.
Deadlock Prevention
The deadlock prevention approach does not allow any transaction to acquire locks that will lead
to deadlocks. The convention is that when more than one transactions request for locking the
same data item, only one of them is granted the lock.
Deadlock Avoidance
The deadlock avoidance approach handles deadlocks before they occur. It analyzes the
transactions and the locks to determine whether or not waiting leads to a deadlock.
.
There are two algorithms for this purpose, namely wait-die and wound-wait. Let us assume that
there are two transactions, T1 and T2, where T1 tries to lock a data item which is already locked
by T2. The algorithms are as follows −
Wait-Die − If T1 is older than T2, T1 is allowed to wait. Otherwise, if T1 is younger than
T2, T1 is aborted and later restarted.
Wound-Wait − If T1 is older than T2, T2 is aborted and later restarted. Otherwise, if T1
is younger than T2, T1 is allowed to wait.
Deadlock Detection and Removal
The deadlock detection and removal approach runs a deadlock detection algorithm periodically
and removes deadlock in case there is one.
Since there are no precautions while granting lock requests, some of the transactions may be
deadlocked. To detect deadlocks, the lock manager periodically checks if the wait-forgraph has
cycles. If the system is deadlocked, the lock manager chooses a victim transaction from each
cycle. The victim is aborted and rolled back; and then restarted later. Some of the methods used
for victim selection are −
Transaction Location
Transactions in a distributed database system are processed in multiple sites and use data items
in multiple sites. The amount of data processing is not uniformly distributed among these sites.
The time period of processing also varies.
Thus the same transaction may be active at some sites and inactive at others. When two
conflicting transactions are located in a site, it may happen that one of them is in inactive state.
This condition does not arise in a centralized system. This concern is called transaction location
issue.
Transaction Control
Transaction control is concerned with designating and controlling the sites required for
processing a transaction in a distributed database system. There are many options regarding the
choice of where to process the transaction and how to designate the center of control, like −
A reliable DDBMS is one that can continue to process user requests even when the
underlying system is unreliable, i.e., failures occur
Data replication + Easy scaling = Reliable system
Distribution enhances system reliability (not enough)
◦ Need number of protocols to be implemented to exploit distribution and
replication
Reliability is closely related to the problem of how to maintain the atomicity and
durability properties of transactions
◦ System, State and Failure
◦ Reliability refers to a system that consists of a set of components.
◦ The system has a state, which changes as the system operates.
◦ The behavior of the system : authoritative specification indicates the valid
behavior of each system state.
◦ Any deviation of a system from the behavior described in the specification is
considered a failure.
◦ The internal state of a system such that there exist circumstances in which further
processing, by the normal algorithms of the system, will lead to a failure which is
not attributed to a subsequent fault, is called erroneous state.
◦ The part of the state which is incorrect is an error.
◦ An error in the internal states of the components of a system or in the design of a
system is a fault.
causes results in
Fault Error Failure
◦
◦
Hard faults
Permanent
Resulting failures are called hard failures
Soft faults
Transient or intermittent
Account for more than 90% of all failures
Resulting failures are called soft failures
Transactions can fail for a number of reasons. Failure can be due to an error in the transaction
caused by incorrect input data as well as the detection of a present or potential deadlock.
Furthermore, some concurrency control algorithms do not permit a transaction to proceed or
even to wait if the data that they attempt to access are currently being accessed by another
transaction. This might also be considered a failure. The usual approach to take in cases of
transaction failure is to abort the transaction, thus resetting the database to its state prior to the
start of this transaction.
The frequency of transaction failures is not easy to measure. An early study reported that in
System R, 3% of the transactions aborted abnormally . In general, it can be stated that
(1) within a single application, the ratio of transactions that abort themselves is rather constant,
being a function of the incorrect data, the available semantic data control features, and so on; and
(2) the number of transaction aborts by the DBMS due to concurrency control considerations
(mainly deadlocks) is dependent on the level of concurrency (i.e., number of concurrent
transactions), the interference of the concurrent applications, the granularity of locks, and so on
.
The reasons for system failure can be traced back to a hardware or to a software failure. The
important point from the perspective of this discussion is that a system failure is always assumed
to result in the loss of main memory contents. Therefore, any part of the database that was in
main memory buffers is lost as a result of a system failure. However, the database that is stored
in secondary storage is assumed to be safe and correct. In distributed database terminology,
system failures are typically referred to as site failures, since they result in the failed site being
unreachable from other sites in the distributed system
We typically differentiate between partial and total failures in a distributed system. Total
failure refers to the simultaneous failure of all sites in the distributed system; partial failure
indicates the failure of only some sites while the others remain operational. As indicated in
Chapter 1, it is this aspect of distributed systems that makes them more available.
Media failure refers to the failures of the secondary storage devices that store the database. Such
failures may be due to operating system errors, as well as to hardware faults such as head
crashes or controller failures. The important point from the perspective of DBMS reliability is
that all or part of the database that is on the secondary storage is considered to be destroyed and
inaccessible. Duplexing of disk storage and maintaining archival copies of the database are
common techniques that deal with this sort of catastrophic problem.
Media failures are frequently treated as problems local to one site and therefore not specifically
addressed in the reliability mechanisms of distributed DBMSs. We consider techniques for
dealing with them in Section 12.3.5 under local recovery management. We then turn our
attention to site failures when we consider distributed recovery functions.
commit protocal:
In a local database system, for committing a transaction, the transaction manager has to only
convey the decision to commit to the recovery manager. However, in a distributed system, the
transaction manager should convey the decision to commit to all the servers in the various sites
where the transaction is being executed and uniformly enforce the decision. When processing is
complete at each site, it reaches the partially committed transaction state and waits for all other
transactions to reach their partially committed states. When it receives the message that all the
sites are ready to commit, it starts to commit. In a distributed system, either all sites commit or
none of them does.
The different distributed commit protocols are −
One-phase commit
Two-phase commit
Three-phase commit
Distributed One-phase Commit
Distributed one-phase commit is the simplest commit protocol. Let us consider that there is a
controlling site and a number of slave sites where the transaction is being executed. The steps in
distributed commit are −
After each slave has locally completed its transaction, it sends a “DONE” message to the
controlling site.
The slaves wait for “Commit” or “Abort” message from the controlling site. This waiting
time is called window of vulnerability.
When the controlling site receives “DONE” message from each slave, it makes a decision
to commit or abort. This is called the commit point. Then, it sends this message to all the
slaves.
On receiving this message, a slave either commits or aborts and then sends an
acknowledgement message to the controlling site.
Distributed Two-phase Commit
Distributed two-phase commit reduces the vulnerability of one-phase commit protocols. The
steps performed in the two phases are as follows −
Phase 1: Prepare Phase
After each slave has locally completed its transaction, it sends a “DONE” message to the
controlling site. When the controlling site has received “DONE” message from all slaves,
it sends a “Prepare” message to the slaves.
The slaves vote on whether they still want to commit or not. If a slave wants to commit, it
sends a “Ready” message.
A slave that does not want to commit sends a “Not Ready” message. This may happen
when the slave has conflicting concurrent transactions or there is a timeout.
Phase 2: Commit/Abort Phase
After the controlling site has received “Ready” message from all the slaves −
o The controlling site sends a “Global Commit” message to the slaves.
o The slaves apply the transaction and send a “Commit ACK” message to the
controlling site.
o When the controlling site receives “Commit ACK” message from all the slaves, it
considers the transaction as committed.
After the controlling site has received the first “Not Ready” message from any slave −
o The controlling site sends a “Global Abort” message to the slaves.
o The slaves abort the transaction and send a “Abort ACK” message to the
controlling site.
o When the controlling site receives “Abort ACK” message from all the slaves, it
considers the transaction as aborted.
Distributed Three-phase Commit
The steps in distributed three-phase commit are as follows −
Phase 1: Prepare Phase
The steps are same as in distributed two-phase commit.
Phase 2: Prepare to Commit Phase
Recovery is the most complicated process in distributed databases. Recovery of a failed system
in the communication network is very difficult.
For example:
Consider that, location A sends message to location B and expects response from B but B is
unable to receive it. There are several problems for this situation which are as follows.
Commit request:
In commit phase the coordinator attempts to prepare all cohorts and take necessary steps to
commit or terminate the transactions.
Commit phase:
The commit phase is based on voting of cohorts and the coordinator decides to commit or
terminate the transaction.
Some problems which occur while accessing the database are as follows:
4. Distributed commit
While committing a transaction which is accessing databases stored on multiple locations, if
failure occurs on some location during the commit process then this problem is called as
distributed commit.
5. Distributed deadlock
Deadlock can occur at several locations due to recovery problem and concurrency problem
(multiple locations are accessing same system in the communication network).
UNIT-V PARALLEL DATABASE SYSTEM:
Parallel DBMS is a Database Management System that runs through multiple processors and
disks. They combine two or more processors also disk storage that helps make operations and
executions easier and faster. They are designed to execute concurrent operations.
Parallel Databases :
Nowadays organizations need to handle a huge amount of data with a high transfer rate. For such
requirements, the client-server or centralized system is not efficient. With the need to improve
the efficiency of the system, the concept of the parallel database comes in picture. A parallel
database system seeks to improve the performance of the system through parallelizing concept.
Need :
Multiple resources like CPUs and Disks are used in parallel. The operations are performed
simultaneously, as opposed to serial processing. A parallel server can allow access to a single
database by users on multiple machines. It also performs many parallelization operations like
data loading, query processing, building indexes, and evaluating queries.
Advantages :
Here, we will discuss the advantages of parallel databases. Let’s have a look.
1. Performance Improvement –
By connecting multiple resources like CPU and disks in parallel we can significantly increase
the performance of the system.
2. High availability
In the parallel database, nodes have less contact with each other, so the failure of one node
doesn’t cause for failure of the entire system. This amounts to
significantlyhigherdatabaseavailability.
Due to parallel execution, the CPU will never be idle. Thus, proper utilization of
resourcesisthere.
4. Increase Reliability
When one site fails, the execution can continue with another available site which is having a
copy of data. Making the system more reliable.
Performance Measurement of Databases
Here, we will emphasize the performance measurement factor-like Speedup and Scale-up. Let’s
understand it one by one with the help of examples.
Speedup –
The ability to execute the tasks in less time by increasing the number of resources is called
Speedup.
Speedup=time original/time parallel
Where ,
time original = time required to execute the task using 1 processor
time parallel = time required to execute the task using 'n' processors
Example –
fig. ‘n’ CPU requires 1 min to execute a process by dividing into smaller tasks
Scale-up –
The ability to maintain the performance of the system when both workload and resources
increase proportionally.
Scaleup = Volume Parallel/Volume Original
Where ,
Volume Parallel = volume executed in a given amount of time using 'n' processor
Volume Original = volume executed in a given amount of time using 1 processor
Example –
20 users are using a CPU at 100% efficiency. If we try to add more users, then it’s not possible
for a single processor to handle additional users. A new processor can be added to serve the users
parallel. And will provide 200% efficiency.
Waiting time for every single processor increases when all of them are uses.
Bandwidth is also a problem.
2. Shared Disk System
Hierarchical model system is a hybrid of shared memory system, shared disk system and shared
nothing system.
Hierarchical model is also known as Non-Uniform Memory Architecture (NUMA).
In this system each group of processor has a local memory. But processors from other groups
can access memory which is associated with the other group in coherent.
NUMA uses local and remote memory(Memory from other group), hence it will take longer
time to communicate with each other.
Advantages of NUMA
Improves the scalability of the system.
Memory bottleneck(shortage of memory) problem is minimized in this architecture.
Disadvantages of NUMA
The cost of the architecture is higher compared to other architectures.
Parallel query processing and optimization:
n a distributed database system, processing a query comprises of optimization at both the global
and the local level. The query enters the database system at the client or controlling site. Here,
the user is validated, the query is checked, translated, and optimized at a global level.
The architecture can be represented as −
Suppose there is a query to retrieve details of all projects whose status is “Ongoing”.
The global query will be &inus;
$$\sigma_{status} = {\small "ongoing"}^{(PROJECT)}$$
Query in New Delhi’s server will be −
$$\sigma_{status} = {\small "ongoing"}^{({NewD}_-{PROJECT})}$$
Query in Kolkata’s server will be −
$$\sigma_{status} = {\small "ongoing"}^{({Kol}_-{PROJECT})}$$
Query in Hyderabad’s server will be −
$$\sigma_{status} = {\small "ongoing"}^{({Hyd}_-{PROJECT})}$$
In order to get the overall result, we need to union the results of the three queries as follows −
$\sigma_{status} = {\small "ongoing"}^{({NewD}_-{PROJECT})} \cup \sigma_{status} = {\
small "ongoing"}^{({kol}_-{PROJECT})} \cup \sigma_{status} = {\small
"ongoing"}^{({Hyd}_-{PROJECT})}$
Distributed Query Optimization
Distributed query optimization requires evaluation of a large number of query trees each of
which produce the required results of a query. This is primarily due to the presence of large
amount of replicated and fragmented data. Hence, the target is to find an optimal solution instead
of the best solution.
The main issues for distributed query optimization are −
Query Trading
In query trading algorithm for distributed database systems, the controlling/client site for a
distributed query is called the buyer and the sites where the local queries execute are called
sellers. The buyer formulates a number of alternatives for choosing sellers and for reconstructing
the global results. The target of the buyer is to achieve the optimal cost.
The algorithm starts with the buyer assigning sub-queries to the seller sites. The optimal plan is
created from local optimized query plans proposed by the sellers combined with the
communication cost for reconstructing the final result. Once the global optimal plan is
formulated, the query is executed.
Reduction of Solution Space of the Query
Optimal solution generally involves reduction of solution space so that the cost of query and data
transfer is reduced. This can be achieved through a set of heuristic rules, just as heuristics in
centralized systems.
Following are some of the rules −
Perform selection and projection operations as early as possible. This reduces the data
flow over communication network.
Simplify operations on horizontal fragments by eliminating selection conditions which are
not relevant to a particular site.
In case of join and union operations comprising of fragments located in multiple sites,
transfer fragmented data to the site where most of the data is present and perform
operation there.
Use semi-join operation to qualify tuples that are to be joined. This reduces the amount of
data transfer which in turn reduces communication cost.
Merge the common leaves and sub-trees in a distributed query tree.
Load balancing:
A load balancer is a device that acts as a reverse proxy and distributes network or application
traffic across a number of servers. Load adjusting is the approach to conveying load units (i.e.,
occupations/assignments) across the organization which is associated with the distributed
system. Load adjusting should be possible by the load balancer. The load balancer is a
framework that can deal with the load and is utilized to disperse the assignments to the servers.
The load balancers allocates the primary undertaking to the main server and the second
assignment to the second server.
Round Robin
Least Connections
Least Time
Hash
IP Hash
Classes of Load Adjusting Calculations:
Following are a portion of the various classes of the load adjusting calculations.
Static: In this model assuming any hub/node is found with a heavy load, an assignment can
be taken arbitrarily and move the undertaking to some other arbitrary system. .
Dynamic: It involves the present status data for load adjusting. These are better
calculations than static calculations.
Deterministic: These calculations utilize processor and cycle attributes to apportion cycles
to the hubs.
Centralized: The framework states data is gathered by a single hub.
Migration Models:
Code section
Resource section
Execution section
Code section: It contains the real code.
Resource fragment: It contains a reference to outer resources required by the interaction.
Execution section: It stores the ongoing execution condition of interaction, comprising
private information, the stack, and the program counter.
Powerless movement: In the powerless relocation just the code section will be moved.
Solid relocation: In this movement, both the code fragment and the execution portion will
be moved. The relocation additionally can be started by the source.
Mobile database:
A Mobile database is a database that can be connected to a mobile computing device over a
mobile network (or wireless network). Here the client and the server have wireless
connections. In today’s world, mobile computing is growing very rapidly, and it is huge
potential in the field of the database. It will be applicable on different-different devices like
android based mobile databases, iOS based mobile databases, etc. Common examples of
databases are Couch base Lite, Object Box, etc.
2. Mobiles Units –
These are portable computers that move around a geographical region that includes the
cellular network that these units use to communicate to base stations.
3. Base Stations –
These are two-way radios installation in fixed locations, that pass communication with the
mobile units to and from the fixed hosts.
Limitations :
Here, we will discuss the limitation of mobile databases as follows.
It has Limited wireless bandwidth.
In the mobile database, Wireless communication speed.
It required Unlimited battery power to access.
It is Less secured.
It is Hard to make theft-proof.
DISTRIBUTED OBJECT MANAGEMENT
Distributed object management has the same objective with regards to object-oriented data
management as the traditional database systems had for relational databases: transparent
management of “objects” that are distributed across a number of sites.
Thus users have an integrated, “single image” view of the objectbase while it is physically
distributed among a number of sites .
Maintaining such an environment requires that problems similar to those of relational
distributed databases be addressed, with the additional opportunities and difficulties posed by
object-orientation. In the remainder, we will review these issues and indicate some of the
approaches.
ARCHITECTURE
The first step that system developers seem to take, on the way to providing true distribution
by peer-to-peer communication, is to develop client-server systems .
This was true in User Application User Application User Application User Query User
Query User Query Distributed Objectbase .
Transparent Access to Objectbase Communication Subsystem Site 1 User Query Server
Software Client & Server Software User Application Client & Server Software User
Application
Features of Distributed Object Systems
Object Interface Specification. ...
Object Manager. ...
Registration/Naming Service. ...
Object Communication Protocol. ...
5. Development Tools. ...
Security.
Advantages
Resource sharing − Sharing of hardware and software resources.
Openness − Flexibility of using hardware and software of different vendors.
Concurrency − Concurrent processing to enhance performance.
Scalability − Increased throughput by adding new resources.
Distributed objects might be used :
Multiple databases require separate database administration, and a distributed database system
requires coordinated administration of the databases and network protocols. A parallel server can
consolidate several databases to simplify administrative tasks.
Multiple databases can provide greater availability than a single instance accessing a single
database, because an instance failure in a distributed database system does not prevent access to
data in the other databases: only the database owned by the failed instance is inaccessible. A
parallel server, however, allows continued access to all data when one instance fails, including
data accessed by the instance running on the failed node.
A parallel server accessing a single consolidated database avoids the need for distributed
updates, inserts, or deletions and more expensive two-phase commits by allowing a transaction
on any node to write to multiple tables simultaneously, regardless of which nodes usually write
to those tables.
Client-Server Systems
Any Oracle configurations can run in a client-server environment. In Oracle, a client application
runs on a remote computer using Net8 to access an Oracle server through a network. The
performance of this configuration is typically limited to the power of the single server node.
The client-server configuration allows you to off-load processing from the computer that runs an
Oracle server. If you have too many applications running on one machine, you can off-load them
to improve performance. However, if your database server is reaching its processing limits you
might want to move either to a larger machine or to a multinode system.
For compute-intensive applications, you could run some applications on one node of a multinode
system while running Oracle and other applications on another node or on several other nodes. In
this way you could use various nodes of a parallel machine as client nodes and one as a server
node.
If the database has several distinct, high-throughput parts, a parallel server running on high-
performance nodes can provide quick processing for each part of the database while also
handling occasional access across parts.
A client-server configuration requires that the network convey all communications between
clients and the database. This may not be appropriate for high-volume communications as is
required for many batch applications.
advantages Disadvantages