Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
70 views18 pages

Adbms Notes

The document discusses distributed database systems. It defines a distributed database as a collection of interconnected databases spread across physical locations that communicate over a network. It describes factors like organizational units being distributed globally and the need for sharing data across units as reasons for using distributed databases. It also discusses homogeneous vs heterogeneous distributed database management systems and different approaches for data storage and replication in distributed databases.

Uploaded by

pooja verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views18 pages

Adbms Notes

The document discusses distributed database systems. It defines a distributed database as a collection of interconnected databases spread across physical locations that communicate over a network. It describes factors like organizational units being distributed globally and the need for sharing data across units as reasons for using distributed databases. It also discusses homogeneous vs heterogeneous distributed database management systems and different approaches for data storage and replication in distributed databases.

Uploaded by

pooja verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

At the end of this session, you will be able to:

Introduction to Distributed Database , Homogeneous Distributed DBMS Vs. Heterogeneous


Distributed DBMS , Distributed DBMS Storage , Data Replication in Distributed DBMS

Introduction to Distributed DBMS

A distributed database system consists of loosely coupled sites that share no physical component
Database systems that run on each site are independent of each other

Transactions may access data at one or more sites.

A distributed database is a collection of multiple interconnected databases, which are spread


physically across various locations that communicate via a computer network. d

Factors Encouraging Distributed DBMS

Distributed Nature of Organizational Units − Most organizations in the current times are
subdivided into multiple units that are physically distributed over the globe. Each unit requires its
own set of local data.

Need for Sharing of Data − The multiple organizational units often need to communicate with each
other and share their data and resources. This demands common databases or replicated
databases that should be used in a synchronized manner.

Support for Both OLTP and OLAP − Online Transaction Processing (OLTP) and Online Analytical
Processing (OLAP) work upon diversified systems which may have common data. Distributed
database systems aid both these processing by providing synchronized data.d
Homogeneous Vs Heterogeneous Distributed DBMS

In a homogeneous distributed database 2) All sites have identical software.

All sites are aware of each other and agree to cooperate in processing user requests.

Each site surrenders part of its autonomy in terms of right to change schemas or software

Appears to user as a single system.

In a heterogeneous distributed database (d)

Different sites may use different schemas and software

Difference in schema is a major problem for query processing

Difference in software is a major problem for transaction processing

Sites may not be aware of each other and may provide only limited facilities for cooperation in transaction
processing

Distributed DBMS Storage

Replication System maintains multiple copies of data, stored in different sites, for faster retrieval and
fault tolerance.

Fragmentation Relation is partitioned into several fragments stored in distinct sites

Replication and fragmentation can be combined

Relation is partitioned into several fragments: system maintains several identical replicas of each such
fragment.

Replication in Distributed DBMS (d)

A relation or fragment of a relation is replicated if it is stored redundantly in two or more sites.

Full replication of a relation is the case where the relation is stored at all sites.

Fully redundant databases are those in which every site contains a copy of the entire database.

Partly replication of a relation is the case where the relation is stored at selected sites

If a relation is only stored at single site then it is known as un replicated database

Data Replication in Distributed DBMS

• Advantages of Replication

Availability: failure of site containing relation r does not result in unavailability of r is replicas exist.

Parallelism: queries on r may be processed by several nodes in parallel.

Reduced data transfer: relation r is available locally at each site containing a replica of r.
Data Replication in Distributed DBMS…..

• Disadvantages of Replication

Increased cost of updates: each replica of relation r must be updated.

Increased complexity of concurrency control: concurrent updates to distinct replicas may lead to inconsistent data
unless special concurrency control mechanisms are implemented.

One solution: choose one copy as primary copy and apply concurrency control operations on primary copy.

At the end of this session, you will be able to :

understand the concept of EE ,Pipelining,Materialisation

Materialization

Materialized evaluation walks the parse or expression tree of the relational algebra operation, and
performs the innermost or leaf-level operations first

The intermediate result of each operation is materialized — an actual, but temporary, relation —and becomes
input for subsequent operations.

The cost of materialization is the sum of the individual operations plus the cost of writing the intermediate
results to disk — a function of the blocking factor (number of records per block) of the temporaries.

The problem with materialization is that — lots of temporary files, lots of I/O.

In the previous, we took a brief introduction about materialization and how to evaluate multiple operations of
an expression.

Materialization is an easy approach for evaluating multiple operations of the given query and storing the
results in the temporary relations. The result can be the output of any join condition, selection condition, and
many more. Thus, materialization is the process of creating and setting a view of the results of the evaluated
operations for the user query. It is similar to the cache memory where the searched data get settled
temporarily. We can easily understand the working of materialization through the pictorial representation of
the expression. An operator tree is used for representing an expression.

The materialization uses the following approach for evaluating operations of the given expression:

In the operator tree, we begin from the lowest-level operations (at the bottom of the tree) in the expression.
The inputs to the lowest level operations are stored in the form of relations in the database. For example,
suppose we want to fetch the name of the student as 'John' from the 'Student' relation.

The relation expression will be:

σ name= "John" (Student)

In this example, there is only one operation of selecting the name from the given relation. Also, this operation
is the lowest-level operation. So, we will begin by evaluating this selection operation.

Now, we will use an appropriate algorithm which is suitable for evaluating the operation.

Like in our example, we will use an appropriate selection algorithm for retrieving the name from the

Student relation.
Then, store the result of the operation in the temporary relations.

• We use these temporary relations for evaluating the next-level operation in the operator tree. The result
works as an input for every next level up in the tree.

• Repeat these steps until all operators at the root of tree will be evaluated, and the final result of the
expression will be generated.

We also call the described evaluation as Materialized evaluation because the result of one operation is
materialized and used in the evaluation of next operation and so on.

Cost Estimation of Materialized Evaluation

The process of estimating the cost of the materialized evaluation is different from the process of estimating
the cost of an algorithm. It is because in analyzing the cost of an algorithm, we do not include the cost of
writing the results on to the disks. But in the evaluation of an expression, we not only compute the cost of all
operations but also include the cost of writing the result of currently evaluated operation to disk.

To estimate the cost of the materialized evaluation, we consider that results are stored in the buffer, and
when the buffer fills completely, the results are stored to the disk.

Let, a total of br number of blocks are written. Thus, we can estimate br as:

br = nr/fr.

Here, nr is the estimated number of tuples in the result relation r and fr is the number of records of relation r
that fits in a block. Thus, fr is a blocking factor of the resultant relation r.

Here, nr is the estimated number of tuples in the result relation r and fr is the number of records of relation r
that fits in a block. Thus, fr is a blocking factor of the resultant relation r.

With this, we also need to calculate the transfer time by estimating the number of required disks. It is so
because the disk head may have moved in-between the successive writes of the block. Thus, we can estimate:

Number of seeks = Γ br/ bb ꓶ

Here, bb defines the size of the output buffer, i.e., measured in blocks.

We can optimize the cost estimation of the materialization process by using the concept of double buffering.
Double buffering is the method of using two buffers, where one buffer executes the algorithm continuously,
and the other is being written out. It makes the algorithm to execute more fastly by performing CPU activities
parallel with I/O activities. We can also reduce the number of seeks by allocating the extra blocks to the
output buffer and altogether writing out multiple blocks.

In the earlier section, we learned about materialization in which we evaluate multiple operations in the given
expression via temporary relations. But, it leads to a drawback of producing a high number of temporary files.
It makes the query-evaluation less efficient. However, the evaluation of the query should be highly efficient in
producing an effective output.

Here, we will discuss another method of evaluating the multiple operations of an expression that works more
efficiently than materialization. Such a more efficient way is known as Pipelining. Pipelining helps in improving
the efficiency of the query-evaluation by decreasing the production of a number of temporary files.

Actually, we reduce the construction of the temporary files by merging the multiple operations into a pipeline.
The result of one currently executed operation passes to the next operation for its execution, and the chain
continues till all operations are completed, and we get the final output of the expression. Such type of
evaluation process is known as Pipelined Evaluation

Pipelining

With pipelined evaluation, operations form a queue, and results are passed from one operation to another
as they are calculated, hence the technique’s name.

With pipelined evaluation, operations form a queue, and results are passed from one operation to another as
they are calculated, hence the technique’s name.

General approach: restructure the individual operation algorithms so that they take streams of tuples as
both input and output.

Limitation : General approach: restructure the individual operation algorithms so that they take streams of
tuples as both input and output.

So for instance, algorithms that require sorting can only use pipelining if the input is already
sorted beforehand, since sorting by nature cannot be performed until all tuples to be sorted are known.

Advantages of Pipeline

There are following advantages of creating a pipelining of operations:

It reduces the cost of query evaluation by eliminating the cost of reading and writing the temporary relations,
unlike the materialization process.

If we combine the root operator of a query evaluation plan in a pipeline with its inputs, the process of
generating query results becomes quick. As a result, it is beneficial for the users as they can view the results of
their asked queries as soon as the outputs get generated. Else, the users need to wait for high-time to get and
view any query results.

Pipelining Materialization
It is a modern approach to evaluate multiple It is a traditional approach to evaluate multiple operations.
operations.
It does not use any temporary relations for storing It uses temporary relations for storing the results of the evaluated operations.
the results of the evaluated operations. So, it needs more temporary files and I/O.
It is a more efficient way of query evaluation as it It is less efficient as it takes time to generate the query results.
quickly generates the results.
It requires memory buffers at a high rate for It does not have any higher requirements for memory
generating outputs. Insufficient memory buffers will buffers for query evaluation.
cause thrashing.
Poor performance if trashing occurs. No trashing occurs in materialization. Thus, in such cases,
materialization is having better performance.
It optimizes the cost of query evaluation. As it does The overall cost includes the cost of operations plus the cost of
not include the cost of reading and writing the reading and writing results on the temporary storage.
temporary storages.
At the end of this session, you will be able to:

Introduction to Parallel Database

Difference Between Parallel Database and Distributed Database

Goal of Parallel Database

Introduction to Parallel Database

Organizations of every size benefit from databases because they improve the management of information.

The database has a specialized program, a server that oversees all user requests for data and adheres to strict
rules for security and system integrity.

If an organization has a large user base and millions of records to process, it may turn to a parallel database
approach.

Parallel databases are fast, flexible and reliable.

A parallel database system seeks to improve performance through parallelization of various operations, such
as loading data, building indexes and evaluating queries.

Although entire database may be stored in single server

• The client server and centralized Database System is not much efficient to handle such business
requirement.

• The need to improve the efficiency gave birth to the concept of Parallel Databases.

• Parallel database system improves performance of data processing using multiple resources in parallel,
like multiple CPU and disks are used parallelly.

• It also performs many parallelization operations like, data loading and query processing.

• A parallel database solves this problem by splitting database operations into separate tasks, each running
on a separate computer.

• The computers share the workload, allowing more database processing than is possible with a single
server.
Difference Between Parallel Database and Distributed Database

More advanced approaches use several computers and many files, sometimes at different locations.

Parallel and distributed methods improve access speed for very large databases, access for geographically
dispersed organizations and reliability for applications that depend on uptime.

A distributed database houses data in two or more server computers at separate locations.

The two share a link over the Internet such that the Chicago database receives shipment records from Kansas
City every night.

A typical parallel database resides in one location with one set of files, though several computers share the
workload.

A parallel database receives Structured Query Language, or SQL requests from users.

The server breaks these down into a series of steps, then executes them.

A standard database server performs all the steps by itself whereas a parallel database assigns steps to
different computers.

When each computer finishes its task, the database assembles the information and sends results back to the
user.

Because each computer works on only part of the work, together they finish a SQL request in much less time.

As an organization's database requirements grow, you add computers to the parallel database to meet the
increased workload.

Distributed databases improve access, as each local office has its own database.

Most SQL transactions take place at the office level without the delays incurred by long-distance dat networks.

Each local database has information in common with the others, but may also have data unique to thlocation.

Periodically, the local databases synchronize over a long-distance network to stay current with each other.

By contrast, a parallel database doesn't improve access to remote locations.

Goals of Parallel Databases


Improve performance: The performance of the system can be improved by connecting multiple CPU and disks
in parallel. Many small processors can also be connected in parallel.

Improve availability of data: Data can be copied to multiple locations to improve the availability of data.

For example: if a module contains a relation (table in database) which is unavailable then it is important to
make it available from another module.

Improve reliability: Reliability of system is improved with completeness, accuracy and availability of data.

Provide distributed access of data: Companies having many branches in multiple cities can access data with
the help of parallel database system.

At the end of this session, you will be able to:

• Design of Parallel Database

• Advantage of Parallel Database

Parallel Query Evaluation

Parallel Database Design

• Parallel database system improves performance of data processing using multiple resources in parallel,
like multiple CPU, disks and memory are used parallelly.

• Database community distinguished three types of architecture for parallel database systems:

1. Shared nothing:

• Each processor has its own memory and its own peripherals.

• Communication is only through messages.

• This design has great fault isolation and scales well.

• The problem is the communication bottleneck.

2. Shared disk:

Each processor has its own memory, but it shares the disk subsystem with other processors.

In addition, there is a communication network.

This design allows for an easy migration from a single node configuration.
It offers good fault isolation and goes well with electronic disks for performance improvement.

Shared everything:

This is what has come to be known as an SMP (symmetric multiprocessor).

Memory and peripherals are shared among many processors.

The advantages are easy programming and load balancing.

The problem is low fault isolation because of the shared memory.

The above distinction is more of a taxonomy than an accurate distinction of real system architectures.

An important question, however, is which architecture should be used for a specific database application.

• At the top level, the system is partitioned with respect to main memory and peripherals. there is a
communication system with high bandwidth and low potential.

• This puts it into the shared nothing category.

• Each node, however, could be an SMP that is responsible for a certain partition of the peripherals.

• This implies both shared disks and shared everything. If you think this is too far out, consider this:

Even today there are large distributed systems with SMP nodes who in total run a distributed database
system.We will find SMP structures on processor chips, complete with partitioned caches, shared higher-
level caches etc..

Advantage of Parallel Database

Speed

The main advantage to parallel databases is speed.


The server breaks up a user database request into parts and dispatches each part to a separate computer.

They work on the parts simultaneously and merge the results, passing them back to the user.

This speeds up most data requests, allowing faster access to very large databases.

Reliability

• A parallel database, properly configured, can continue to work despite the failure of any computer in the
cluster.

• The database server senses that a specific computer is not responding and reroutes its work to the
remaining computers.

Capacity

• As more users request access to the database, the computer administrators add more computers to the
parallel server, boosting its overall capacity.

• A parallel database, for example, allows a large online retailer to have thousands of users accessing
information at the same time.

• This level of performance is not possible with a single server.

When Is Parallel Processing Not Advantageous

The following guidelines describe situations when parallel processing is not advantageous.

• In general, parallel processing is less advantageous when the cost of synchronization becomes too high
and therefore throughput decreases.

• If many users on a large number of nodes modify a small set of data, then synchronization is likely to be
very high. However, if they just read data, then no synchronization is required.

• Parallel processing is not advantageous when there is dispute between instances on a single block or row.

For example, it would not be effective to use a table with one row used primarily as a sequence numbering
tool.

Parallel Query Evaluation Technique

The two techniques used in parallel query evaluation are as follows:


1. Inter query parallelism : This technique allows to run multiple queries on different processors
simultaneously. Pipelined parallelism is achieved by using inter query parallelism, which improves the
output of the system.

• For example: If there are 6 queries, each query will take 3 seconds for evaluation. Thus, the total time
taken to complete evaluation process is 18 seconds. Inter query parallelism achieves this task only in 3
seconds. However, Inter query parallelism is difficult to achieve every time.

2. Intra Query Parallelism: In this technique query is divided in sub queries which can run simultaneously on
different processors, this will minimize the query evaluation time.

• Intra query parallelism improves the response time of the system.

For Example: If we have query, which can take 10 seconds to complete the evaluation process

But We can achieve this task in only 2 seconds by using intra query evaluation as each query is divided in sub-
queries

Optimization of Parallel Query

• Parallel Query optimization is nothing but selecting the efficient query evaluation plan.

• Parallel Query optimization plays an important role in developing system to minimize the cost of query
evaluation.

• Two factors play a very important in parallel query optimization.

a) total time spent to find the best plan.

b) amount of time required to execute the plan.

c) Query Optimization is done with an aim to:

d) Speed up the queries by finding the queries which can give the fastest result on execution.

e) Increase the performance of the system.

f) Select the best query evaluation plan.

g) Avoid the unwanted plan.

In this session we have discussed:

• Benefits of Parallel Database

• Query Evaluation of Parallel Database

At the end of this session, you will be able to:

• Concept of Fragmentation in Distributed Database

• Types of Fragmentation in Distributed DBMS

Transparency in Distributed DBMS


Introduction to Fragmentation

Decomposing a database into multiple smaller units is called FRAGMENTS, which are logically related but
physically dispersed.

Characteristics of Fragmentation

• Must be complete

• Must be possible to reconstruct the original database from the fragments.

A relation can be fragmented in three ways:

Introduction to Horizontal Fragmentation

• Horizontal fragmentation:

• It is a horizontal subset of a relation which contain those tuples which satisfy selection conditions.

• Specified in the SELECT operation of the relational algebra on single or multiple attributes

• Consider the Employee relation with selection condition (DNO = 5). Employee relation. σ(Dno=5)
Employee

• All tuples satisfy this condition will create a subset which will be a horizontal fragment of Employee
relation. σ(Dno=5) Employee

Introduction to Vertical Fragmentation

• Vertical fragmentation divides a relation “vertically” by columns.

• It is a subset of a relation which is created by a subset of columns.

• Consider the Employee relation: Employee( Eno, Name, Birthdate, Gender, Designation and Salary)

• A vertical fragment of can be created by keeping the values of Eno,Name, Birthdate, and Salary.

• Each fragment must include the primary key attribute of the parent relation Employee.

• In this way all vertical fragments of a relation are connected.

• PROJECT operation of the relational algebra is used ∏ Eno, Name, Birthdate, Salary (Employee)

Introduction to Fragmentation
Introduction Mixed fragmentation:

• A combination of Vertical fragmentation and Horizontal fragmentation.

• This is achieved by SELECT-PROJECT operations which is represented by πLi(σCi (R))

• Select name and salary of all Male Employees from Employees relation whose salary =$50,000

• Transparency in Distributed Database

• Degree to which system user may remain unaware of the details of how and where the data items are
stored in a distributed system

• It is the property of distributed databases by the virtue of which the internal details of the distribution are
hidden from the users.

• The DDBMS designer may choose to fragment tables, replicate the fragments and store them at different
sites.

Consider transparency issues in relation to:

• Fragmentation transparency

• Replication transparency

• Location transparency

• Transparency in Distributed Database

a) Fragmentation transparency

• Fragmentation transparency enables users to query upon any table as if it were unfragmented. Thus, it
hides the fact that the table the user is querying on is actually a fragment or union of some fragments.

• It also conceals the fact that the fragments are located at diverse sites.

b) b) Replication transparency

• Replication transparency ensures that replication of databases are hidden from the users. It enables users
to query upon a table as if only a single copy of the table exists.

• Replication transparency is associated with concurrency transparency and failure transparency. Whenever
a user updates a data item, the update is reflected in all the copies of the table.

c) Location Transparency

• Location transparency ensures that the user can query on any table(s) or fragment(s) of a table as if they
were stored locally in the user’s site.
• In any distributed database system, the designer should ensure that all the stated transparencies are
maintained to a considerable extent.

• Concurrency Control in Distributed Systems

• Concurrency controlling techniques ensure that multiple transactions are executed simultaneously while
maintaining the ACID properties of the transactions and serializability in the schedules.

Types of Concurrency Control Techniques in Distributed Systems:

i. Distributed Two-phase Locking Algorithm

ii. Distributed Timestamp Concurrency Control

iii. Conflict Graphs

Concurrency Control in Distributed Systems

Distributed Two-Phase Locking Algorithm

• The basic principle of distributed two-phase locking is same as the basic two-phase locking protocol.

• However, in a distributed system there are sites designated as lock managers.

• A lock manager controls lock acquisition requests from transaction monitors.

• In order to enforce co-ordination between the lock managers in various sites, at least one site is given the
authority to see all transactions and detect lock conflicts.

• Depending upon the number of sites who can detect lock conflicts, distributed two-phase locking
approaches can be of three types −

Centralized two-phase locking :

• In this approach, one site is designated as the central lock manager.

• All the sites in the environment know the location of the central lock manager and obtain lock from it
during transactions.

Primary copy two-phase locking

• In this approach, a number of sites are designated as lock control centers.

• Each of these sites has the responsibility of managing a defined set of locks.

• All the sites know which lock control center is responsible for managing lock of which data table/fragment
item.
• Distributed two-phase locking

• In this approach, there are a number of lock managers, where each lock manager controls locks of data
items stored at its local site.

• The location of the lock manager is based upon data distribution and replication.

• Distributed Timestamp Concurrency Control

• In a centralized system, timestamp of any transaction is determined by the physical clock reading.

• in a distributed system, any site’s local physical/logical clock readings cannot be used as global
timestamps.

• since they are not globally unique.

• So, a timestamp comprises of a combination of site ID and that site’s clock reading.

• For implementing timestamp ordering algorithms, each site has a scheduler that maintains a separate
queue for each transaction manager.

• During transaction, a transaction manager sends a lock request to the site’s scheduler.

• The scheduler puts the request to the corresponding queue in increasing timestamp order. Requests are
processed from the front of the queues in the order of their timestamps, i.e. the oldest first.

• Conflict Graphs

• For this transaction classes are defined.

• A transaction class contains two set of data items called read set and write set.

• A transaction belongs to a particular class if the transaction’s read set is a subset of the class’ read set and
the transaction’s write set is a subset of the class’ write set.

• In the read phase, each transaction issues its read requests for the data items in its read set.

• In the write phase, each transaction issues its write requests.

• A conflict graph is created for the classes to which active transactions belong.

• This contains a set of vertical, horizontal, and diagonal edges.

• A vertical edge connects two nodes within a class and denotes conflicts within the class.

• A horizontal edge connects two nodes across two classes and denotes a write-write conflict among
different classes.

• A diagonal edge connects two nodes across two classes and denotes a write-read or a read-write conflict
among two classes.

E-R Diagrams
• E-R diagrams are graphical representation of the data modeling.

• E-R diagrams helps in defining the relationship between different entities.

ER diagrams uses symbols to represent three different types of information

Rectangular boxes are used to represent entities.

Diamonds are normally used to represent relationships

ovals are used to represent attributes.

Entity

A data entity is anything in real or abstract about which user want to store data. e.g., worker, payment,
book, etc.

Relationship

A relationship is an association that exists between one or more than one entities.

e.g.,”the customer places an order”, here in this example places defines the relationship between customer
and the order that customer place.

Attribute

Attributes are the characteristics which are common

to all or most instances of a particular entity.

e.g. Name, address, Phone number, Employee ID,etc

are all attributes of the entity employee.

An attribute which uniquely identifies one and

only one instance of an entity is called a primary key.

Primary key can be combination of attributes.

E.g. Employee ID is a primary key for Employee.

An example E-R diagram is illustrated below, which

is between movie, theatre, show, actor and screen.


Hospital Management System.

Entity-Relationship Digram

Rectangles represent entity sets.

n Diamonds represent relationship sets.

n Lines link attributes to entity sets and entity sets to relationship sets.

n Underline indicates primary key attributes

Relationship Sets with Attributes

Roles
• Entity sets of a relationship need not be distinct

• The labels “manager” and “worker” are called roles; they specify how employee entities interact via the
works_for relationship set.

• Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles.

• Role labels are optional, and are used to clarify semantics of the relationship

You might also like