Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
26 views110 pages

3 Distribution Design

Uploaded by

dynamogaming8055
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views110 pages

3 Distribution Design

Uploaded by

dynamogaming8055
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 110

Outline

 Introduction
 Background
 Distributed Database Design
 Fragmentation
 Data distribution
 Database Integration
 Semantic Data Control
 Distributed Query Processing
 Multidatabase Query Processing
 Distributed Transaction Management
 Data Replication
 Parallel Database Systems
 Distributed Object DBMS
 Peer-to-Peer Data Management
 Web Data Management
 Current Issues
Design Problem

 In the general setting :


Making decisions about the placement of data and programs
across the sites of a computer network as well as possibly
designing the network itself.

 In Distributed DBMS, the placement of applications


entails
 placement of the distributed DBMS software; and

 placement of the applications that run on the database


Dimensions of the Problem
Dimensions of the Problem
Level of sharing: Three possibilities

• No sharing: Each application and its data execute at one site, and there
is no communication with any other program or access to any data file
at other sites. This characterizes the very early days of networking and
is probably not very common today.

• Level of data sharing: All the programs are replicated at all the sites,
but data files are not. Accordingly, user requests are handled at the
site where they originate and the necessary data files are moved around
the network.

• Data-plus-program sharing: both data and programs may be shared,


meaning that a program at a given site can request a service from
another program at a second site, which, in turn, may have to access a
data file located at a third site.
Dimensions of the Problem
Access pattern behavior

 It is possible to identify two alternatives.

 The access patterns of user requests may be static, so that they do


not change over time, or
 dynamic.

 It is easier to plan for and manage the static environments than would
be the case for dynamic distributed systems.

 Unfortunately, it is difficult to find many real-life distributed


applications that would be classified as static.
Dimensions of the Problem

Level of knowledge

 The third dimension of classification is the level of knowledge about the


access pattern behavior.

 One possibility, is that the designers do not have any information about
how users will access the database.

 This is a theoretical possibility, but it is very difficult, if not impossible,


to design a distributed DBMS that can effectively cope with this
situation.

 The more practical alternatives are that the designers have complete
information, where the access patterns can reasonably be predicted and
do not deviate significantly from these predictions, or partial
information, where there are deviations from the predictions.
Distribution Design

 Top-down
 mostly in designing systems from scratch

 mostly in homogeneous systems

 Bottom-up
 when the databases already exist at a number of sites
Top-Down Design
Top-Down Design
Requirements analysis

 Requirements analysis that defines the environment of the


system and “elicits both the data and processing needs of all
potential database users”.

 The requirements study also specifies where the final system is


expected to stand with respect to the objectives of a
distributed DBMS.

 These objectives are defined with respect to performance,


reliability and availability, economics, and expandability
(flexibility).
Top-Down Design

View design and Conceptual design

• The requirements document is input to two parallel


activities:
 view design and
 conceptual design.

• The view design activity deals with defining the


interfaces for end users.

• The conceptual design, is the process by which the


enterprise is examined to determine entity types and
relationships among these entities.
Top-Down Design
Entity analysis and Functional analysis

• One can possibly divide this process into two related


activity groups:
 entity analysis and
 functional analysis.
• Entity analysis is concerned with determining the
entities, their attributes, and the relationships among
them.
• Functional analysis, is concerned with determining the
fundamental functions with which the modeled
enterprise is involved.
• The results of these two steps need to be cross-
referenced to get a better understanding of which
functions deal with which entities.
Top-Down Design
Statistical information

• In conceptual design and view design activities the


user needs to specify the data entities and must
determine the applications that will run on the
database as well as statistical information about
these applications.

• Statistical information includes the specification of


the frequency of user applications, the volume of
various information, and the like.
Top-Down Design
Distribution design

• The global conceptual schema (GCS) and access pattern


information collected as a result of view design are inputs to the
distribution design step.

• The objective at this stage, which is the focus of this chapter, is to


design the local conceptual schemas (LCSs) by distributing the
entities over the sites of the distributed system.

• It is possible, to treat each entity as a unit of distribution.


Top-Down Design
Distribution design
• There is a relationship between the conceptual design and the
view design.

• In one sense, the conceptual design can be interpreted as


being an integration of user views.

• Even though this view integration activity is very important,


the conceptual model should support not only the existing
applications, but also future applications.

• View integration should be used to ensure that entity and


relationship requirements for all the views are covered in the
conceptual schema.
Top-Down Design
Distribution design

• Rather than distributing relations, it is quite common to divide


them into sub-relations, called fragments, which are then
distributed.
• Thus, the distribution design activity consists of two steps:
 fragmentation and
 allocation.
• The last step in the design process is the physical design,
which maps the local conceptual schemas to the physical
storage devices available at the corresponding sites.
• The inputs to this process are the local conceptual schema and
the access pattern information about the fragments in them.
Distribution Design Issues

 Why fragment at all?

 How to fragment?

 How much to fragment?

 How to test correctness?

 How to allocate?

 Information requirements?
Fragmentation

 Can't we just distribute relations?


 What is a reasonable unit of distribution?
 relation
 views are subsets of relations locality
 extra communication

 fragments of relations (sub-relations)


 concurrent execution of a number of transactions that access
different portions of a relation
 views that cannot be defined on a single fragment will require
extra processing
 semantic data control (especially integrity enforcement) more
difficult
Example
Relation Schemes
EMP
ENO ENAME TITLE SAL PNO RESP DUR

PROJ
PNO PNAME BUDGET

EMP(ENO, ENAME, TITLE, SAL, PNO, RESP, DUR)


PROJ (PNO, PNAME, BUDGET)

 Underlined attributes are relation keys (tuple identifiers).


 Tabular form
Example
Relation Instances
Example
Normalized Relations

Figure 3.3
Example
Transparent Access
SELECT ENAME,SAL
Tokyo
FROM EMP,ASG,PAY
WHERE DUR > 12 Paris
Boston
AND EMP.ENO = ASG.ENO Paris projects
AND PAY.TITLE = EMP.TITLE Paris employees
Communication Paris assignments
Network Boston employees

Boston projects
Boston employees
Boston assignments
Montreal
New
Montreal projects
York Paris projects
Boston projects New York projects
New York employees with budget > 200000
New York projects Montreal employees
New York assignments Montreal assignments
Fragmentation Alternatives – Horizontal

PROJ
PROJ1 : projects with budgets less than PNO PNAME BUDGET LOC
$200,000
P1 Instrumentation 150000 Montreal
PROJ2 : projects with budgets greater than P2 Database Develop. 135000 New York
P3 CAD/CAM 250000 New York
or equal to $200,000 P4 Maintenance 310000 Paris
P5 CAD/CAM 500000 Boston

PROJ1 PROJ2

PNO PNAME BUDGET LOC PNO PNAME BUDGET LOC


P1 Instrumentation 150000 Montreal P3 CAD/CAM 250000 New York
P2 Database Develop. 135000 New York P4 Maintenance 310000 Paris
P5 CAD/CAM 500000 Boston
Example
Fragmentation Alternatives – Vertical
PROJ
PROJ1: information about project PNO PNAME BUDGET LOC
budgets
P1 Instrumentation 150000 Montreal
PROJ2: information about project P2 Database Develop. 135000 New York
P3 CAD/CAM 250000 New York
names and locations P4 Maintenance 310000 Paris
P5 CAD/CAM 500000 Boston

PROJ1 PROJ2
PNO BUDGET PNO PNAME LOC

P1 150000 P1 Instrumentation Montreal


P2 135000 P2 Database Develop. New York
P3 250000 P3 CAD/CAM New York
P4 310000 P4 Maintenance Paris
P5 500000 P5 CAD/CAM Boston
Degree of Fragmentation

finite number of alternatives

tuples relations
or
attributes

Finding the suitable level of partitioning within this


range
Correctness of Fragmentation

 Completeness
 Decomposition of relation R into fragments R1, R2, ..., Rn is
complete if and only if each data item in R can also be
found in some Ri
 Reconstruction
 If relation R is decomposed into fragments R1, R2, ..., Rn,
then there should exist some relational operator ∇ such
that
R = ∇1≤i≤nRi
 Disjointness
 If relation R is decomposed into fragments R1, R2, ..., Rn,
and data item di is in Rj, then di should not be in any other
fragment Rk (k ≠ j ).
Allocation Alternatives

 Non-replicated
 partitioned : each fragment resides at only one site
 Replicated
 fully replicated : each fragment at each site
 partially replicated : each fragment at some of the sites
 Rule of thumb:
If read-only queries << 1, replication is advantageous,
update queries
otherwise replication may cause problems
Comparison of Replication Alternatives
Full-replication Partial-replication Partitioning

QUERY Same Difficulty


Easy
PROCESSING

DIRECTORY Easy or Same Difficulty


MANAGEMENT Non-existant

CONCURRENCY
Moderate Difficult Easy
CONTROL

RELIABILITY Very high High Low

Possible Possible
REALITY Realistic
application application
Information Requirements

 Four categories:
 Database information
 Application information
 Communication network information
 Computer system information
Fragmentation

 Horizontal Fragmentation (HF)


 Primary Horizontal Fragmentation (PHF)
 Derived Horizontal Fragmentation (DHF)

 Vertical Fragmentation (VF)


 Hybrid Fragmentation (HF)
PHF – Information Requirements
 Database Information
 relationship

SKILL
TITLE, SAL

L1
EMP PROJ
ENO, ENAME, TITLE PNO, PNAME, BUDGET,
LOC

ASG
ENO, PNO, RESP, DUR

 cardinality of each relation: card(R)


PHF - Information Requirements
 Application Information
 simple predicates : Given R[A1, A2, …, An], a simple predicate pj is

pj : Ai θValue
where θ  {=,<,≤,>,≥,≠}, Value  Di and Di is the domain of Ai.
For relation R we define Pr = {p1, p2, …,pm}
Example :
PNAME = "Maintenance"
BUDGET ≤ 200000
 minterm predicates : Given R and Pr = {p1, p2, …,pm}
define M = {m1,m2,…,mr} as

M = { mi | mi = pjPr pj* }, 1≤j≤m, 1≤i≤z


where pj* = pj or pj* = ¬(pj).
PHF – Information Requirements

Example

m1: PNAME="Maintenance"  BUDGET≤200000

m2: NOT(PNAME="Maintenance")  BUDGET≤200000

m3: PNAME= "Maintenance"  NOT(BUDGET≤200000)

m4: NOT(PNAME="Maintenance")  NOT(BUDGET≤200000)


Example
Consider relation PAY of Figure 3.3. The following are some of the
possible simple predicates that can be defined on PAY.
PHF – Information Requirements
 Application Information
 minterm selectivities: sel(mi)
 Thenumber of tuples of the relation that would
be accessed by a user query which is specified
according to a given minterm predicate mi.
 access frequencies: acc(qi)
 Thefrequency with which a user application qi
accesses data.
 Accessfrequency for a minterm predicate can
also be defined.
Primary Horizontal Fragmentation

Definition :
Rj = Fj(R), 1 ≤ j ≤ w
where Fj is a selection formula, which is (preferably) a minterm
predicate.
Therefore,
A horizontal fragment Ri of relation R consists of all the tuples of R
which satisfy a minterm predicate mi.


Given a set of minterm predicates M, there are as many horizontal
fragments of relation R as there are minterm predicates.
Set of horizontal fragments also referred to as minterm fragments.
PHF – Algorithm

Given: A relation R, the set of simple predicates Pr


Output: The set of fragments of R = {R1, R2,…,Rw} which
obey the fragmentation rules.

Preliminaries :
 Pr should be complete
 Pr should be minimal
Example

We assume that the non-negativity of the BUDGET values is a feature of the relation that is enforced by an
integrity constraint. Otherwise, a simple predicate of the form 0 BUDGET also needs to be included in Pr

Example 3.7 demonstrates one of the problems of horizontal partitioning. If the


domain of the attributes participating in the selection formulas are continuous
and infinite, as in Example 3.7, it is quite difficult to define the set of formulas F
= {F1, F2, ….., Fn} that would fragment the relation properly. One possible
course of action is to define ranges as we have done in Example 3.7. However,
there is always the problem of handling the two endpoints. For example, if a new
tuple with a BUDGET value of, say, $600,000 were to be inserted into PROJ, one
would have had to review the fragmentation to decide if the new tuple is to go
into PROJ2 or if the fragments need to be revised and a new fragment needs to be
defined as
Example
Completeness of Simple Predicates
 A set of simple predicates Pr is said to be complete if and only if the
accesses to the tuples of the minterm fragments defined on Pr requires
that two tuples of the same minterm fragment have the same probability
of being accessed by any application.

 Example :
 Assume PROJ[PNO,PNAME,BUDGET,LOC] has two applications defined on it.
 Find the budgets of projects at each location. (1)
 Find projects with budgets less than $200000. (2)
Completeness of Simple Predicates

According to (1),
Pr={LOC=“Montreal”,LOC=“New York”,LOC=“Paris”}

which is not complete with respect to (2).


Modify
Pr ={LOC=“Montreal”,LOC=“New York”,LOC=“Paris”, BUDGET≤200000,BUDGET>200000}

which is complete.
Minimality of Simple Predicates

 If a predicate influences how fragmentation is


performed, (i.e., causes a fragment f to be further
fragmented into, say, fi and fj) then there should be at
least one application that accesses fi and fj differently.
 In other words, the simple predicate should be relevant
in determining a fragmentation.
 If all the predicates of a set Pr are relevant, then Pr is
minimal.
acc(mi ) acc(m )
= j

card( fi ) card( f j )
Minimality of Simple Predicates

Example :
Pr ={LOC=“Montreal”,LOC=“New York”, LOC=“Paris”,
BUDGET≤200000,BUDGET>200000}

is minimal (in addition to being complete). However, if we add


PNAME = “Instrumentation”

then Pr is not minimal.


Exercises-Example
Exercises-Solution
Exercises-Example
Exercises-Solution
Exercises-Solution
Exercises-Example
Exercises-Solution
COM_MIN Algorithm
Given: a relation R and a set of simple predicates Pr
Output: a complete and minimal set of simple predicates Pr'
for Pr

Rule 1: a relation or fragment is partitioned into at least two


parts which are accessed differently by at least one
application.
COM_MIN Algorithm
 Initialization :
 find a pi  Pr such that pi partitions R according to Rule 1
 set Pr' = pi ; Pr Pr – {pi} ; F  {fi}
 Iteratively add predicates to Pr' until it is complete
 find a pj  Pr such that pj partitions some fk defined
according to minterm predicate over Pr' according to Rule 1
 set Pr' = Pr'  {pi}; Pr Pr – {pi}; F  F  {fi}
 if pk  Pr' which is nonrelevant then
Pr'  Pr – {pi}
F  F – {fi}
COM_MIN Algorithm (detail)
PHORIZONTAL Algorithm
Makes use of COM_MIN to perform fragmentation.
Input: a relation R and a set of simple predicates Pr
Output: a set of minterm predicates M according to which
relation R is to be fragmented

 Pr'  COM_MIN (R,Pr)


 determine the set M of minterm predicates
 determine the set I of implications among pi  Pr
 eliminate the contradictory minterms from M
PHORIZONTAL Algorithm (detail)
PHF – Example
 Two candidate relations : PAY and PROJ.
 Fragmentation of relation PAY
 Application: Check the salary info and determine raise.
 Employee records kept at two sites  application run at
two sites
 Simple predicates
p1 : SAL ≤ 30000
p2 : SAL > 30000
Pr = {p1,p2} which is complete and minimal Pr'=Pr
 Minterm predicates
m1 : (SAL ≤ 30000)
m2 : NOT(SAL ≤ 30000) = (SAL > 30000)
PHF – Example

PAY1 PAY2
TITLE SAL TITLE SAL
Mech. Eng. 27000 Elect. Eng. 40000
Programmer 24000 Syst. Anal. 34000
PHF – Example
 Fragmentation of relation PROJ
 Applications:
 Find the name and budget of projects given their no.
 Issued at three sites
 Access project information according to budget
 one site accesses ≤200000 other accesses
>200000
 Simple predicates
 For application (1)
p1 : LOC = “Montreal”
p2 : LOC = “New York”
p3 : LOC = “Paris”
 For application (2)
p4 : BUDGET ≤ 200000
p5 : BUDGET > 200000
 Pr = Pr' = {p1,p2,p3,p4,p5}
PHF – Example

 Fragmentation of relation PROJ continued


 Minterm fragments left after elimination
m1 : (LOC = “Montreal”)  (BUDGET ≤ 200000)
m2 : (LOC = “Montreal”)  (BUDGET > 200000)
m3 : (LOC = “New York”)  (BUDGET ≤ 200000)
m4 : (LOC = “New York”)  (BUDGET > 200000)
m5 : (LOC = “Paris”)  (BUDGET ≤ 200000)
m6 : (LOC = “Paris”)  (BUDGET > 200000)
PHF – Example

PROJ1 PROJ2

PNO PNAME BUDGET LOC PNO PNAME BUDGET LOC


Database
P1 Instrumentation 150000 Montreal P2 135000 New York
Develop.

PROJ4 PROJ6

PNO PNAME BUDGET LOC PNO PNAME BUDGET LOC

P3 CAD/CAM 250000 New P4 Maintenance 310000 Paris


York
PHF – Correctness
 Completeness
 Since Pr' is complete and minimal, the selection predicates are
complete

 Reconstruction
 If relation R is fragmented into FR = {R1,R2,…,Rr}

R = Ri FR Ri
 Disjointness
 Minterm predicates that form the basis of fragmentation should be
mutually exclusive.
Derived Horizontal Fragmentation

 Defined on a member relation of a link according to a selection operation


specified on its owner.
 Each link is an equijoin.
 Equijoin can be implemented by means of semijoins.

SKILL
TITLE, SAL

L1
EMP PROJ
ENO, ENAME, TITLE PNO, PNAME, BUDGET, LOC

L2 L3
ASG
ENO, PNO, RESP, DUR
DHF – Definition
Given a link L where owner(L)=S and member(L)=R, the derived horizontal
fragments of R are defined as

Ri = R ⋉F Si, 1≤i≤w
where w is the maximum number of fragments that will be defined on R
and

Si = Fi (S)
where Fi is the formula according to which the primary horizontal
fragment Si is defined.
DHF – Example
Given link L1 where owner(L1)=PAY and member(L1)=EMP
EMP1 = EMP ⋉ PAY1

EMP2 = EMP ⋉ PAY2


where
PAY1 = SAL≤30000(PAY)
PAY2 = SAL>30000(PAY)

EMP1 EMP2
ENO ENAME TITLE ENO ENAME TITLE

E3 A. Lee Mech. Eng. E1 J. Doe Elect. Eng.


E4 J. Miller Programmer E2 M. Smith Syst. Anal.
E7 R. Davis Mech. Eng. E5 B. Casey Syst. Anal.
E6 L. Chu Elect. Eng.
E8 J. Jones Syst. Anal.
DHF – Correctness
 Completeness
 Referential integrity
 Let R be the member relation of a link whose owner is
relation S which is fragmented as FS = {S1, S2, ..., Sn}.
Furthermore, let A be the join attribute between R and S.
Then, for each tuple t of R, there should be a tuple t' of S
such that
t[A] = t' [A]
 Reconstruction
 Same as primary horizontal fragmentation.
 Disjointness
 Simple join graphs between the owner and the member
fragments.
Example
Let us continue with the distribution design of the database we started in Example
3.11.We already decided on the fragmentation of relation EMP according to the
fragmentation of PAY (Example 3.12
Example
Let us now consider ASG. Assume that there are the following two
applications:
Example
Example
Example
Exercises-Example

Given relation PAY as in Figure 3.3, let p1: SAL < 30000 and
p2: SAL ≥ 3000 be two simple predicates. Perform a
horizontal fragmentation of PAY with respect to these
predicates to obtain PAY1, and PAY2. Using the fragmentation
of PAY, perform further derived horizontal fragmentation for
EMP. Show completeness, reconstruction, and disjointness of
the fragmentation of EMP.
Exercises-Solution
Exercises-Solution
Vertical Fragmentation
 Has been studied within the centralized context
 design methodology
 physical clustering
 More difficult than horizontal, because more
alternatives exist.
Two approaches :
 grouping
 attributes to fragments
 splitting
 relation to fragments
Vertical Fragmentation
 Overlapping fragments
 grouping
 Non-overlapping fragments
 splitting
We do not consider the replicated key attributes to
be overlapping.
Advantage:
Easier to enforce functional dependencies
(for integrity checking etc.)
VF – Information Requirements

 Application Information
 Attribute affinities
 a measure that indicates how closely related the attributes are
 This is obtained from more primitive usage data
 Attribute usage values
 Given a set of queries Q = {q1, q2,…, qq} that will run on the relation
R[A1, A2,…, An],

 1 if attribute Aj is referenced by query qi


use(qi,Aj) = 
 0 otherwise

use(qi,•) can be defined accordingly


VF – Definition of use(qi,Aj)

Consider the following 4 queries for relation PROJ


q1: SELECT BUDGET q2: SELECT PNAME,BUDGET
FROM PROJ FROM PROJ
WHERE PNO=Value
q3: SELECT PNAME q4: SELECT SUM(BUDGET)
FROM PROJ FROM PROJ
WHERE LOC=Value WHERE LOC=Value
Let A1= PNO, A2= PNAME, A3= BUDGET, A4= LOC

A1 A2 A3 A4
q1 1 0 1 0
q2 0 1 1 0
q3 0 1 0 1
q4 0 0 1 1
VF – Affinity Measure aff(Ai,Aj)
The attribute affinity measure between two attributes Ai and Aj of a
relation R[A1, A2, …, An] with respect to the set of applications Q = (q1, q2,
…, qq) is defined as follows :

aff (Ai, Aj) =  (query access)


all queries that access A and A i j


access
query access = access frequency of a query 
execution
all sites
VF – Calculation of aff(Ai, Aj)

Assume each query in the previous example accesses the S1 S2 S3


attributes once during each execution.
q1 15 20 10
Also assume the access frequencies
q2 5 0 0
q3 25 25 25
q
4 3 0 0

Then A1 A2 A3 A4
aff(A1, A3) = 15*1 + 20*1+10*1 A1 45 0 45 0
= 45 A2 0 80 5 75
and the attribute affinity matrix AA is A3 45 5 53 3
A4 0 75 3 78
VF – Clustering Algorithm

 Take the attribute affinity matrix AA and reorganize the


attribute orders to form clusters where the attributes in
each cluster demonstrate high affinity to one another.
 Bond Energy Algorithm (BEA) has been used for
clustering of entities. BEA finds an ordering of entities
(in our case attributes) such that the global affinity
measure is maximized.

AM = i j
(affinity of Ai and Aj with their neighbors)
Bond Energy Algorithm
Input: The AA matrix
Output: The clustered affinity matrix CA which is a perturbation of AA
 Initialization: Place and fix one of the columns of AA in CA.
 Iteration: Place the remaining n-i columns in the remaining i+1
positions in the CA matrix. For each column, choose the placement
that makes the most contribution to the global affinity measure.
 Row order: Order the rows according to the column ordering.
Bond Energy Algorithm (detail)
Bond Energy Algorithm

“Best” placement? Define contribution of a placement:

cont(Ai, Ak, Aj) = 2bond(Ai, Ak)+2bond(Ak, Al) –2bond(Ai, Aj)

n
where
bond(Ax,Ay) = 
z =1
aff(Az,Ax)aff(Az,Ay)
BEA – Example
Consider the following 4 queries for relation PROJ
q1: SELECT BUDGET q2: SELECT PNAME,BUDGET
FROM PROJ FROM PROJ
WHERE PNO=Value
q3: SELECT PNAME q4: SELECT SUM(BUDGET)
FROM PROJ FROM PROJ
WHERE LOC=Value WHERE LOC=Value
Let A1= PNO, A2= PNAME, A3= BUDGET, A4= LOC

A1 A2 A3 A4
q1 1 0 1 0
q2 0 1 1 0
q3 0 1 0 1
q4 0 0 1 1
BEA – Example

Note that the diagonal values are not computed since


they are meaningless.

Attribute Affinity Matrix


BEA – Example
Let us consider the AA matrix i.e. Attribute Affinity Matrix and study the contribution of
moving attribute A4 between attributes A1 and A2 given by the formula
BEA – Example
Consider the following AA matrix and the corresponding CA matrix where A1 and
A2 have been placed. Place A3:

Ordering (0-3-1) :
cont(A0,A3,A1) = 2bond(A0 , A3)+2bond(A3 , A1)–2bond(A0 , A1)
= 2* 0 + 2* 4410 – 2*0 = 8820
Ordering (1-3-2) :
cont(A1,A3,A2) = 2bond(A1 , A3)+2bond(A3 , A2)–2bond(A1,A2)
= 2* 4410 + 2* 890 – 2*225 = 10150
Ordering (2-3-4) :
cont (A2,A3,A4) = 1780
BEA – Example

A1 A3 A 2
 Therefore, the CA matrix has the form
45 45 0
0 5 80
45 53 5
0 3 75

 When A4 is placed, the final form of the CA matrix (after A1 A3 A2 A4


row organization) is A1 45 45 0 0

A3 45 53 5 3
A2 0 5 80 75
A4 0 3 75 78
BEA – Example
BEA – Example

Note: Although, as noted in the book, it doesn’t make sense to compute the
diagonal values in the AA matrix, we show them here since they are used in the
following calculations.
Now we start applying the BEA algorithm. We first fix the first two columns and
the Clustered Affinity (CA) Matrix looks like the following:
BEA – Example
Next we consider placing A3 – there are three places where it can be placed:
(a) to the left of A1, which has a contribution of 16950;
(b) in between A1 and A2, which has a contribution of 22050, and
(c) to the right of A2, which has a contribution of 21450.
Thus, the bets ordering is A1, A3, A2 resulting in the following CA matrix:
VF – Algorithm
How can you divide a set of clustered attributes {A1, A2, …, An} into
two (or more) sets {A1, A2, …, Ai} and {Ai, …, An} such that there are no
(or minimal) applications that access both (or more than one) of the
sets.

A1 A2 A3 … Ai Ai+1 . . A
. m
A1
A2
TA
Ai

Ai+1
BA
Am
VF – ALgorithm
Define
TQ = set of applications that access only TA
BQ = set of applications that access only BA
OQ = set of applications that access both TA and BA
and
CTQ = total number of accesses to attributes by applications that access only TA
CBQ = total number of accesses to attributes by applications that access only BA
COQ = total number of accesses to attributes by applications that access both TA and BA
Then find the point along the diagonal that maximizes
CTQCBQCOQ2
VF – Algorithm
Two problems :
Cluster forming in the middle of the CA matrix
 Shift a row up and a column left and apply the algorithm to find the “best”
partitioning point
 Do this for all possible shifts
 Cost O(m2)
More than two clusters
 m-way partitioning
 try 1, 2, …, m–1 split points along diagonal and try to find the best point for
each of these
 Cost O(2m)
VF – Correctness
A relation R, defined over attribute set A and key K, generates the vertical
partitioning FR = {R1, R2, …, Rr}.
 Completeness
 The following should be true for A:

A=  ARi
 Reconstruction
 Reconstruction can be achieved by

R= ⋈•
K Ri, Ri  FR

 Disjointness
 TID's are not considered to be overlapping since they are maintained by the system
 Duplicated keys are not considered to be overlapping
Hybrid Fragmentation

Uses a combination of horizontal and vertical fragmentation to generate the


fragments we need.
Two approaches
1. Generate a set of horizontal fragments and then vertically fragment one of
more of these horizontal fragments.
2. Generate a set of vertical fragments and then horizontally fragment one or
more of these vertical fragments.
Either way, the final fragments produced are the same.
This fragmentation approach provides for the most flexibility for the
designers but at the same time it is the most expensive approach with
respect to reconstruction of the original table.
Example

The nonfragmented version of the EMP table.


Example
Let’s assume that employee salary information needs to be maintained in a
separate fragment from the nonsalary information.
• A vertical fragmentation plan will generate the EMP_SAL and EMP_NON_SAL
vertical fragments.
• The nonsalary information needs to be fragmented into horizontal fragments,
where each fragment contains only the rows that match the city where the
employees work.
• We can achieve this by applying horizontal fragmentation to the
EMP_NON_SAL fragment of the EMP table.
The following three SQL statements show how this is achieved.

Create table NON_SAL_MPLS_EMPS as


Select *
From EMP_NON_SAL
Where Loc = ‘Minneapolis’;

Create table NON_SAL_LA_EMPS as


Select *
From EMP_NON_SAL
Where Loc = ‘LA’;
Example
Create table NON_SAL_NY_EMPS as
Select *
From EMP_NON_SAL
Where Loc = ‘New York’;
Fragment Allocation
 Problem Statement
Given
F = {F1, F2, …, Fn} fragments
S ={S1, S2, …, Sm} network sites
Q = {q1, q2,…, qq} applications
Find the "optimal" distribution of F to S.
 Optimality
 Minimal cost
 Communication + storage + processing (read & update)
 Cost in terms of time (usually)
 Performance
Response time and/or throughput
 Constraints
 Per site constraints (storage & processing)
Information Requirements
 Database information
 selectivity of fragments
 size of a fragment
 Application information
 access types and numbers
 access localities
 Communication network information
 unit cost of storing data at a site
 unit cost of processing at a site
 Computer system information
 bandwidth
 latency
 communication overhead
Allocation

File Allocation (FAP) vs Database Allocation (DAP):


 Fragments are not individual files
 relationships have to be maintained

 Access to databases is more complicated


 remote file access model not applicable
 relationship between allocation and query processing

 Cost of integrity enforcement should be considered


 Cost of concurrency control should be considered
Allocation – Information Requirements
 Database Information
 selectivity of fragments
 size of a fragment
 Application Information
 number of read accesses of a query to a fragment
 number of update accesses of query to a fragment
 A matrix indicating which queries updates which fragments
 A similar matrix for retrievals
 originating site of each query
 Site Information
 unit cost of storing data at a site
 unit cost of processing at a site
 Network Information
 communication cost/frame between two sites
 frame size
Allocation Model
General Form
min(Total Cost)
subject to
response time constraint
storage constraint
processing constraint

Decision Variable

1 if fragment Fi is stored at site Sj


xij =
0 otherwise
Allocation Model

 Total Cost
 query processing cost 
all queries

  cost of storing a fragment at a site


all sites all fragments

 Storage Cost (of fragment Fj at Sk)


(unit storage cost at Sk)  (size of Fj)  xjk

 Query Processing Cost (for one query)


processing component + transmission component
Allocation Model

 Query Processing Cost


Processing component
access cost + integrity enforcement cost + concurrency control

 
cost
(no. of update accesses+ no. of read accesses) 
 Access cost
all sites all fragments
xij  local processing cost at a site

 Integrity enforcement and concurrency control costs


 Can be similarly calculated
Allocation Model

 Query Processing Cost


Transmission component
cost of processing updates + cost of processing retrievals



Cost of updates
 update message cost 
all sites all fragments
  acknowledgment cost
all sites all fragments


 Retrieval Cost
min all sites (cost of retrieval command 
all fragments cost of sending back the result)
Allocation Model

 Constraints
 Response Time
execution time of query ≤ max. allowable response time for that
query


Storage Constraint
storage(for a site)
requirement of a fragment at that site 
storage capacity at that site
all fragments

 Processing constraint (for a site)

 processing load of a query at that site 


all queries processing capacity of that site
Allocation Model

 Solution Methods
 FAP is NP-complete
 DAP also NP-complete

 Heuristics based on
 single commodity warehouse location (for FAP)
 knapsack problem
 branch and bound techniques
 network flow
Allocation Model

 Attempts to reduce the solution space

 assume all candidate partitionings known; select the “best”


partitioning

 ignore replication at first

 sliding window on fragments

You might also like