Distributed Database Systems
Fall 2012
Distributed Query Optimization
SL05
Basic Concepts
Distributed Cost Model
Database Statistics
Joins and Semijoins
Query Optimization Algorithms
DDBS12, SL05
1/52
M. Bohlen
Basic Concepts/1
I
Query optimization: Process of
producing an optimal (close to
optimal) query execution plan which
represents an execution strategy
I
Centralized query optimization:
I
I
I
The main task in query optimization
is to consider different orderings of
the operations
Find (the best) query execution plan
in space of equivalent query trees
Minimize an objective cost function
Gather statistics about relations
Distributed query optimization brings additional issues
I
I
I
I
I
DDBS12, SL05
Linear query trees are not necessarily a good choice
Bushy query trees are not necessarily a bad choice
What and where to ship the relations
How to ship relations (ship as a whole, ship as needed)
When to use semi-joins instead of joins
2/52
M. Bohlen
Basic Concepts/2
I
Search space: The set of alternative query execution plans (query
trees)
I
I
I
Typically very large
The main issue is to optimize joins
For N relations, there are O (N !) equivalent join trees that can be
obtained by applying commutativity and associativity rules
Example: 3 equivalent query trees (join trees) of the joins in the
following query
SELECT ENAME,RESP
FROM
EMP, ASG, PROJ
WHERE EMP.ENO=ASG.ENO AND ASG.PNO=PROJ.PNO
DDBS12, SL05
3/52
M. Bohlen
Basic Concepts/3
I
Reduction of the search space
I
Restrict by means of heuristics
I
Perform unary operations before binary operations, etc
Restrict the shape of the join tree
I
Consider the type of trees (linear trees vs. bushy trees)
Linear Join Tree
DDBS12, SL05
Bushy Join Tree
4/52
M. Bohlen
Basic Concepts/4
I
There are two main strategies to scan the search space
I
I
Deterministic
Randomized
Deterministic scan of the search space
I
DDBS12, SL05
Start from base relations and build plans by adding one relation at
each step
Breadth-first strategy (BFS): build all possible plans before choosing
the best plan (dynamic programming approach)
Depth-first strategy (DFS): build only one plan (greedy approach)
5/52
M. Bohlen
Basic Concepts/5
I
Randomized scan of the search space
I
I
I
Search for optimal solutions around a particular starting point
e.g., iterative improvement or simulated annealing techniques
Trades optimization time for execution time
I
DDBS12, SL05
Does not guarantee that the best solution is obtained, but avoid the
high cost of optimization
The strategy is better when more than 5-6 relations are involved
6/52
M. Bohlen
Distributed Cost Model/1
Two different types of cost functions can be used
I
Reduce total time
I
Reduce response time
I
I
DDBS12, SL05
Reduce each cost component (in terms of time) individually, i.e., do as
little for each cost component as possible
Optimize the utilization of the resources (i.e., increase system
throughput)
Do as many things in parallel as possible
May increase total time because of increased total activity
7/52
M. Bohlen
Distributed Cost Model/2
I
Total time: Sum of the time of all individual components
I
I
Local processing time: CPU time + I/O time
Communication time: fixed time to initiate a message + time to
transmit the data
Total time =TCPU #instructions + TI/O #I/Os +
TMSG #messages + TTR #bytes
The individual components of the total cost have different weights:
I
Wide area network
I
I
I
Local area networks
I
I
DDBS12, SL05
Message initiation and transmission costs are high
Local processing cost is low (fast mainframes or minicomputers)
Ratio of communication to I/O costs is 20:1
Communication and local processing costs are more or less equal
Ratio of communication to I/O costs is 1:1.6 (10MB/s network)
8/52
M. Bohlen
Distributed Cost Model/3
Response time: Elapsed time between the initiation and the
completion of a query
Response time =TCPU #seq instructions + TI/O #seq I/Os +
TMSG #seq messages + TTR #seq bytes
where #seq x (x in instructions, I/O, messages, bytes) is the
maximum number of x which must be done sequentially.
Any processing and communication done in parallel is ignored
DDBS12, SL05
9/52
M. Bohlen
Distributed Cost Model/4
Example: Query at site 3 with data from sites 1 and 2.
I
I
I
DDBS12, SL05
Assume that only the communication cost is considered
Total time = TMSG 2 + TTR (x + y )
Response time = max{TMSG + TTR x , TMSG + TTR y }
10/52
M. Bohlen
Database Statistics/1
The primary cost factor is the size of intermediate relations
I
I
that are produced during the execution and
must be transmitted over the network, if a subsequent operation is
located on a different site
It is costly to compute the size of the intermediate relations precisely.
Instead global statistics of relations and fragments are
computed and used to provide approximations
DDBS12, SL05
11/52
M. Bohlen
Database Statistics/2
I
I
Let R (A1 , A2 , . . . , Ak ) be a relation fragmented into R1 , R2 , . . . , Rr .
Relation statistics
I min and max values of each attribute: min{A }, max{A }.
i
i
I length of each attribute: length (A )
i
I number of distinct values in each domain: card (dom (A ))
i
Fragment statistics
I cardinality of the fragment: card (R )
i
I cardinality of each attribute of each fragment: card ( (R )), card (A )
Ai
j
i
DDBS12, SL05
12/52
M. Bohlen
Database Statistics/3
I
Selectivity factor of an operation: the proportion of tuples of an
operand relation that participate in the result of that operation
Assumption: independent attributes and uniform distribution of
attribute values
Selectivity factor of selection
SF (A = value ) =
card (A (R ))
max(A ) value
SF (A > value ) =
max(A ) min(A )
value min(A )
SF (A < value ) =
max(A ) min(A )
DDBS12, SL05
13/52
M. Bohlen
Database Statistics/4
Properties of the selectivity factor of the selection
SF (p (Ai ) p (Aj )) = SF (p (Ai )) SF (p (Aj ))
SF (p (Ai ) p (Aj )) = SF (p (Ai )) + SF (p (Aj ))
(SF (p (Ai )) SF (p (Aj ))
SF (A {values }) = SF (A = value ) card ({values })
DDBS12, SL05
14/52
M. Bohlen
Database Statistics/5
I
Cardinality of intermediate results
I
Selection
card (P (R )) = SF (P ) card (R )
Projection
I
I
More difficult: correlations between projected attributes are unknown
Simple if the projected attribute is a key
card (A (R )) = card (R )
I
Cartesian Product
card (R S ) = card (R ) card (S )
Union
I
I
Set Difference
I
I
DDBS12, SL05
upper bound: card (R S ) card (R ) + card (S )
lower bound: card (R S ) max{card (R ), card (S )}
upper bound: card (R S ) = card (R )
lower bound: 0
15/52
M. Bohlen
Database Statistics/6
I
Selectivity factor for joins
SFZ =
card (R Z S )
card (R ) card (S )
Cardinality of joins
I
Upper bound: cardinality of Cartesian Product
card (R Z S ) card (R ) card (S )
General case (if SF is given):
card (R Z S ) = SFZ card (R ) card (S )
Special case: R .A is a key of R and S .A is a foreign key of S;
I
each S-tuple matches with at most one tuple of R
card (R ZR .A =S .A S ) = card (S )
DDBS12, SL05
16/52
M. Bohlen
Database Statistics/7
Selectivity factor for semijoins: fraction of R-tuples that join with
S-tuples
I
An approximation is the selectivity of A in S
SFB< (R B<A S ) = SFB< (S .A ) =
I
card (A (S ))
card (dom[A ])
Cardinality of semijoin (general case):
card (R B<A S ) = SFB< (S .A ) card (R )
Example: R .A is a foreign key in S (S .A is a primary key)
Then SF = 1 and the result size corresponds to the size of R
DDBS12, SL05
17/52
M. Bohlen
Join Ordering in Fragment Queries/1
Join ordering is an important aspect in centralized DBMS, and it is
even more important in a DDBMS since joins between fragments
that are stored at different sites may increase the communication
time.
Two approaches exist:
I
Optimize the ordering of joins directly
I
I
Replace joins by combinations of semijoins in order to minimize the
communication costs
I
DDBS12, SL05
INGRES and distributed INGRES
System R and System R
Hill Climbing and SDD-1
18/52
M. Bohlen
Join Ordering in Fragment Queries/2
Direct join odering of two relation/fragments located at different
sites
I
I
DDBS12, SL05
Move the smaller relation to the other site
We have to estimate the size of R and S
19/52
M. Bohlen
Join Ordering in Fragment Queries/3
I
Direct join ordering of queries involving more than two relations is
substantially more complex
Example: Consider the following query and the respective join
graph, where we make also assumptions about the locations of the
three relations/fragments
PROJ ZPNO ASG ZENO EMP
DDBS12, SL05
20/52
M. Bohlen
Join Ordering in Fragment Queries/4
I
Example (contd.): The query can be evaluated in at least 5
different ways.
I
Plan 1:
EMPSite 2
Site 2: EMP=EMPZASG
EMPSite 3
Site 3: EMPZPROJ
Plan 2:
ASGSite 1
Site 1: EMP=EMPZASG
EMPSite 3
Site 3: EMPZPROJ
Plan 4:
PROJSite 2
Site 2: PROJ=PROJZASG
PROJSite 1
Site 1: PROJZEMP
Plan 3:
ASGSite 3
Site 3: ASG=ASGZPROJ
ASGSite 1
Site 1: ASGZEMP
Plan 5:
EMPSite 2
PROJSite 2
Site 2: EMPZPROJZASG
DDBS12, SL05
21/52
M. Bohlen
Join Ordering in Fragment Queries/5
To select a plan, a lot of information is needed, including
I size (EMP ), size (ASG ), size (PROJ )
I size (EMP Z ASG ), size (ASG Z PROJ )
I
DDBS12, SL05
Possibilities of parallel execution if response time is used
22/52
M. Bohlen
Semijoin Based Algorithms/1
I
Semijoins can be used to efficiently implement joins
I
The semijoin acts as a size reducer (similar as to a selection) such
that smaller relations need to be transferred
Consider two relations: R located at site 1 and S located and site 2
I
Solution with semijoins: Replace one or both operand
relations/fragments by a semijoin, using the following rules:
R ZA S (R B<A S ) ZA S
R ZA (S B<A R )
(R B<A S ) ZA (S B<A R )
I
The semijoin is beneficial if the cost to produce and send it to the
other site is less than the cost of sending the whole operand relation
and doing the actual join.
DDBS12, SL05
23/52
M. Bohlen
Semijoin Based Algorithms/2
I
sl06.2
Cost analysis R ZA S vs. (R B<A S ) Z S, assuming that
size (R ) < size (S )
I
Perform the join R Z S:
I
I
Perform the semijoins (R B< S ) Z S:
I
I
I
I
I
R Site 2
Site 2 computes R Z S
S 0 = A (S )
S 0 Site 1
Site 1 computes R 0 = R B< S 0
R 0 Site 2
Site 2 computes R 0 Z S
Semijoin is better if: size (A (S )) + size (R B< S ) < size (R )
The semijoin approach is better if the semijoin acts as a sufficient
reducer (i.e., a few tuples of R participate in the join)
The join approach is better if almost all tuples of R participate in
the join
DDBS12, SL05
24/52
M. Bohlen
INGRES Algorithm/1
INGRES uses a dynamic query optimization algorithm that
recursively breaks a query into smaller pieces. It is based on the
following ideas:
I
An n-relation query q is decomposed into n subqueries
q1 q2 qn
I
I
For the decomposition two basic techniques are used: detachment
and substitution
There is a processor that can efficiently process mono-relation
queries
I
DDBS12, SL05
Each qi is a mono-relation (mono-variable) query
The output of qi is consumed by qi +1
Optimizes each query independently for the access to a single relation
25/52
M. Bohlen
INGRES Algorithm/2
I
Detachment: Break a query q into q0 q00 , based on a common
relation that is the result of q0 , i.e.
I
The query
q: SELECT
FROM
WHERE
AND
is decomposed by detachment of the common relation R1 into
q0 :
SELECT R1 .A1
INTO
R10
FROM
R1
WHERE P1 (R1 .A10 )
q00 :
R2 .A2 , . . . , Rn .An
R1 , R2 , . . . , Rn
P1 (R1 .A10 )
P2 (R1 .A1 , . . . , Rn .An )
SELECT
FROM
WHERE
R2 .A2 , . . . , Rn .An
R10 , R2 , . . . , Rn
P2 (R10 .A1 , . . . , Rn .An )
Detachment reduces the size of the relation on which the query q00
is defined.
DDBS12, SL05
26/52
M. Bohlen
INGRES Algorithm/3
I
Example: Consider query q1: Names of employees working on the
CAD/CAM project
q1 : SELECT EMP.ENAME
FROM
EMP, ASG, PROJ
WHERE EMP.ENO = ASG.ENO
AND
ASG.PNO = PROJ.PNO
AND
PROJ.PNAME = CAD/CAM
Decompose q1 into q11 q0 :
q11 : SELECT PROJ.PNO
INTO
JVAR
FROM
PROJ
WHERE PROJ.PNAME = CAD/CAM
q0 :
DDBS12, SL05
SELECT
FROM
WHERE
AND
EMP.ENAME
EMP, ASG, JVAR
EMP.ENO = ASG.ENO
ASG.PNO = JVAR.PNO
27/52
M. Bohlen
INGRES Algorithm/4
I
I
I
I
Example (contd.): The successive detachments may transform q0
into q12 q13 :
q0 :
SELECT EMP.ENAME
FROM
EMP, ASG, JVAR
WHERE EMP.ENO = ASG.ENO
AND
ASG.PNO = JVAR.PNO
q12 :
SELECT
INTO
FROM
WHERE
ASG.ENO
GVAR
ASG, JVAR
ASG.PNO=JVAR.PNO
q13 :
SELECT
FROM
WHERE
EMP.ENAME
EMP, GVAR
EMP.ENO=GVAR.ENO
q1 is now decomposed by detachment into q11 q12 q13
q11 is a mono-relation query
q12 and q13 are multi-relation queries, which cannot be further
detached; also called irreducible
DDBS12, SL05
28/52
M. Bohlen
INGRES Algorithm/5
I
Tuple substitution allows to convert an irreducible query q into
mono-relation queries.
I
I
Choose a relation R1 in q for tuple substitution
For each tuple in R1 , replace the R1 -attributes referred in q by their
actual values, thereby generating a set of subqueries q0 with n 1
relations, i.e.,
q(R1 , R2 , . . . , Rn ) is replaced by {q0 (t1i , R2 , . . . , Rn ), t1i R1 }
Example (contd.): Assume GVAR consists only of the tuples
{E1, E2}. Then q13 is rewritten with tuple substitution in the following
way
q13 : SELECT EMP.ENAME
FROM
EMP, GVAR
WHERE EMP.ENO = GVAR.ENO
q131 :
DDBS12, SL05
SELECT
FROM
WHERE
EMP.ENAME
EMP
EMP.ENO = E1
29/52
M. Bohlen
INGRES Algorithm/6
Example (contd.):
q132 :
DDBS12, SL05
SELECT
FROM
WHERE
EMP.ENAME
EMP
EMP.ENO = E2
q131 and q132 are mono-relation queries
30/52
M. Bohlen
Distributed INGRES Algorithm
sl06.1
The distributed INGRES query optimization algorithm is very
similar to the centralized INGRES algorithm.
I
DDBS12, SL05
In addition to the centralized INGRES, the distributed one should
break up each query qi into sub-queries that operate on fragments;
only horizontal fragmentation is handled.
Optimization with respect to a combination of communication cost
and response time
31/52
M. Bohlen
System R Algorithm/1
I
The System R (centralized) query optimization algorithm
I
Performs static query optimization based on exhaustive search of
the solution space and a cost function (IO cost + CPU cost)
I
I
I
Input: relational algebra tree
Output: optimal relational algebra tree
Dynamic programming technique is applied to reduce the number of
alternative plans
The optimization algorithm consists of two steps
1. Predict the best access method to each individual relation
(mono-relation query)
2. Consider using index, file scan, etc.
3. For each relation R, estimate the best join ordering
4. R is first accessed using its best single-relation access method
5. Efficient access to inner relation is crucial
Considers two different join strategies
I
I
DDBS12, SL05
(Indexed-) nested loop join
Sort-merge join
32/52
M. Bohlen
System R Algorithm/2
I
Example: Consider query q1: Names of employees working on the
CAD/CAM project
PROJ ZPNO ASG ZENO EMP
I
Join graph
Indexes
I
I
I
DDBS12, SL05
EMP has an index on ENO
ASG has an index on PNO
PROJ has an index on PNO and an index on PNAME
33/52
M. Bohlen
System R Algorithm/3
Example (contd.): Step 1 Select the best single-relation access
paths
I
I
I
DDBS12, SL05
EMP: sequential scan (because there is no selection on EMP)
ASG: sequential scan (because there is no selection on ASG)
PROJ: index on PNAME (because there is a selection on PROJ
based on PNAME)
34/52
M. Bohlen
System R Algorithm/4
I
sl06.4
Example (contd.): Step 2 Select the best join ordering for each
relation
I
I
(EMP PROJ) and (PROJ EMP) are pruned because they are CPs
(ASG Z PROJ) pruned because (we assume) it has higher cost than
(PROJ Z ASG); similar for (ASG Z EMP)
Best total join order ((PROJZ ASG)Z EMP), since it uses the indexes
best
I
I
I
DDBS12, SL05
Select PROJ using index on PNAME
Join with ASG using index on PNO
Join with EMP using index on ENO
35/52
M. Bohlen
Distributed System R Algorithm/1
The System R query optimization algorithm is an extension of
the System R query optimization algorithm with the following main
characteristics:
I
Only the whole relations can be distributed, i.e., fragmentation and
replication is not considered
Query compilation is a distributed task, coordinated by a master site,
where the query is initiated
Master site makes all inter-site decisions, e.g., selection of the
execution sites, join ordering, method of data transfer, ...
The local sites do the intra-site (local) optimizations, e.g., local joins,
access paths
Join ordering and data transfer between different sites are the most
critical issues to be considered by the master site
DDBS12, SL05
36/52
M. Bohlen
Distributed System R Algorithm/2
Two methods for inter-site data transfer
I
Ship whole: The entire relation is shipped to the join site and stored
in a temporary relation
I
I
I
Fetch as needed: The outer relation is sequentially scanned, and for
each tuple the join value is sent to the site of the inner relation and
the matching inner tuples are sent back (i.e., semijoin)
I
I
I
DDBS12, SL05
Larger data transfer
Smaller number of messages
Better if relations are small
Number of messages = O(cardinality of outer relation)
Data transfer per message is minimal
Better if relations are large and the selectivity is good
37/52
M. Bohlen
Distributed System R Algorithm/3
I
Four main join strategies for R Z S:
I
I
Notation:
I
I
I
R is outer relation
S is inner relation
LT denotes local processing time
CT denotes communication time
s denotes the average number of S-tuples that match an R-tuple
Strategy 1: Ship the entire outer relation to the site of the inner
relation, i.e.,
I
I
I
Retrieve outer tuples
Send them to the inner relation site
Join them as they arrive
Total cost = LT (retrieve card (R ) tuples from R ) +
CT (size (R )) +
LT (retrieve s tuples from S ) card (R )
DDBS12, SL05
38/52
M. Bohlen
Distributed System R Algorithm/4
Strategy 2: Ship the entire inner relation to the site of the outer
relation. We cannot join as they arrive; they need to be stored.
I
The inner relation S need to be stored in a temporary relation
Total cost = LT (retrieve card (S ) tuples from S ) +
CT (size (S )) +
LT (store card (S ) tuples in T ) +
LT (retrieve card (R ) tuples from R ) +
LT (retrieve s tuples from T ) card (R )
DDBS12, SL05
39/52
M. Bohlen
Distributed System R Algorithm/5
Strategy 3: Fetch tuples of the inner relation as needed for each
tuple of the outer relation.
I
I
For each R-tuple, the join attribute A is sent to the site of S
The s matching S-tuples are retrieved and sent to the site of R
Total cost = LT (retrieve card (R ) tuples from R ) +
CT (length (A )) card (R ) +
LT (retrieve s tuples from S ) card (R ) +
CT (s length (S )) card (R )
DDBS12, SL05
40/52
M. Bohlen
sl06.6
sl06.7
Distributed System R Algorithm/6
I
Strategy 4: Move both relations to a third site and compute the join
there.
I
The inner relation S is first moved to a third site and stored in a
temporary relation.
Then the outer relation is moved to the third site and its tuples are
joined as they arrive.
Total cost = LT (retrieve card (S ) tuples from S ) +
CT (size (S )) +
LT (store card (S ) tuples in T ) +
LT (retrieve card (R ) tuples from R ) +
CT (size (R )) +
LT (retrieve s tuples from T ) card (R )
DDBS12, SL05
41/52
M. Bohlen
Hill-Climbing Algorithm/1
Hill-Climbing query optimization algorithm
I
I
I
I
DDBS12, SL05
Refinements of an initial feasible solution are recursively computed
until no more cost improvements can be made
Semijoins, data replication, and fragmentation are not used
Devised for wide area point-to-point networks
The first distributed query processing algorithm
42/52
M. Bohlen
Hill-Climbing Algorithm/2
I
The hill-climbing algorithm proceeds as follows
1. Select initial feasible execution strategy ES0
I
i.e., a global execution schedule that includes all intersite
communication
Determine the candidate result sites, where a relation referenced in the
query exist
Compute the cost of transferring all the other referenced relations to
each candidate site
ES0 = candidate site with minimum cost
2. Split ES0 into two strategies: ES1 followed by ES2
I
ES1: send one of the relations involved in the join to the other relations
site
ES2: send the join result to the final result site
3. Replace ES0 with the split schedule which gives
cost (ES1) + cost (local join) + cost (ES2) < cost (ES0)
4. Recursively apply steps 2 and 3 on ES1 and ES2 until no more
benefit can be gained
5. Check for redundant transmissions in the final plan and eliminate
them
DDBS12, SL05
43/52
M. Bohlen
Hill-Climbing Algorithm/3
I
Example: What are the salaries of engineers who work on the
CAD/CAM project?
SAL (PAY ZTITLE EMP ZENO (ASG ZPNO (PNAME =CAD /CAM 00 (PROJ ))))
I
Schemas: EMP(ENO, ENAME, TITLE), ASG(ENO, PNO, RESP,
DUR), PROJ(PNO, PNAME, BUDGET, LOC), PAY(TITLE, SAL)
Statistics
Relation Size Site
EMP
8
1
PAY
4
2
PROJ
1
3
ASG
10
4
Assumptions:
I
I
I
I
I
DDBS12, SL05
Size of relations is defined as their cardinality
Minimize total cost
Transmission cost between two sites is 1
Ignore local processing cost
size(EMP Z PAY) = 8, size(PROJ Z ASG) = 2, size(ASG Z EMP) = 10
44/52
M. Bohlen
Hill-Climbing Algorithm/4
I
Example (contd.): Determine initial feasible execution strategy
I
Alternative 1: Resulting site is site 1
Total cost = cost (PAY Site1) + cost (ASG Site1) +
cost (PROJ Site1)
= 4 + 10 + 1 = 15
Alternative 2: Resulting site is site 2
Total cost = 8 + 10 + 1 = 19
Alternative 3: Resulting site is site 3
Total cost = 8 + 4 + 10 = 22
Alternative 4: Resulting site is site 4
Total cost = 8 + 4 + 1 = 13
I
DDBS12, SL05
Therefore ES0 = EMPSite4; PAY Site4; PROJ Site4
45/52
M. Bohlen
Hill-Climbing Algorithm/5
I
Example (contd.): Candidate split
I
Alternative 1: ES1,
ES2, ES3
I
I
cost ((EMP Z PAY) Site4) +
ES1: EMPSite 2
ES2: (EMPZPAY)
Site4
ES3: PROJSite 4
Alternative 2: ES1,
ES2, ES3
I
Total cost = cost (EMP Site2) +
cost (PROJ Site4)
= 8 + 8 + 1 = 17
Total cost = cost (PAYSite 1) +
ES1: PAY Site1
ES2: (PAY Z
EMP) Site4
ES3: PROJ
Site 4
cost ((PAY Z EMP) Site4) +
cost (PROJ Site4)
= 4 + 8 + 1 = 13
Both alternatives are not better than ES0, so keep ES0 (or take
alternative 2 which has the same cost)
DDBS12, SL05
46/52
M. Bohlen
Hill-Climbing Algorithm/6
Problems
I
I
I
sl06.5
Greedy algorithm determines an initial feasible solution and iteratively
improves it
If there are local minima, it may not find the global minimum
An optimal schedule with a high initial cost would not be found, since
it wont be chosen as the initial feasible solution
Example: A better schedule is
I PROJSite 4
I ASG = (PROJZASG)Site 1
I (ASGZEMP)Site 2
I Total cost= 1 + 2 + 2 = 5
DDBS12, SL05
47/52
M. Bohlen
SDD-1
I
The SDD-1 algorithm extends the hill climbing algorithm with
semijoins and has the following properties:
I
Considers semijoins
I
I
I
I
cost (R |>< A S ) = CMSG + size (A (S )) CTR
benefit (R |>< A S ) = (1 SF |>< (S .A )) size (R ) CTR
Does not consider replication and fragmentation
Cost of transferring the result to the user site from the final result site
is not considered
Can minimize either total time or response time
The SDD-1 algorithm works with and updates a database profile:
R
R1
R2
R3
DDBS12, SL05
size (R )
1500
3000
2000
A
R1.A
R2.A
R2.B
R3.B
SF |><
0.3
0.8
1.0
0.4
48/52
size (A )
36
320
400
80
M. Bohlen
SDD-1 Algorithm
Step 1 Include all local processing in the execution strategy ES.
Step 2 Update database profile with effects of local processing.
Step 3 Determine beneficial
|><
, i.e., cost ( |>< i ) < benefit ( |>< i ).
Step 4 Remove the most beneficial
|><
and append it to ES.
Step 5 Update the database profile.
Step 6 Update the set of beneficial semijoins; possibly include new
ones.
Step 7 If there are beneficial semijoins go back to Step 4.
Step 8 Find the site where the largest amount of data resides and
select it as the result site.
Step 9 For each Ri at the result site, remove semijoins of the form
Ri |>< Rj where the total cost of ES without this semijoin is
smaller than the cost with it.
Step 10 Permute the order of semijoins if doing so would improve
the total cost of ES.
DDBS12, SL05
49/52
M. Bohlen
Conclusion
I
Distributed query optimization is more complex that centralized
query processing, since
I
I
bushy query trees are not necessarily a bad choice
one needs to decide what, where, and how to ship the relations
between the sites
Query optimization searches the optimal query plan (tree)
For N relations, there are O (N !) equivalent join trees. To cope with
the complexity heuristics and/or restricted types of trees are
considered.
There are two main strategies in query optimization: randomized
and deterministic.
Semi-joins can be used to implement a join. The semi-joins require
more operations to perform, but the data transfer rate is reduced.
INGRES, System R and Hill Climbing are distributed query
optimization algorithms.
DDBS12, SL05
50/52
M. Bohlen
Course Project
I
I
Hand in of project: December 23, 2012
Report
I
I
I
I
I
problem definition
running example
description of solution
evaluation
strength, weaknesses, limitations
Report (5 pages) and implementation (source code, data, steps to
install and run) as zip/tar file
DDBS12, SL05
51/52
M. Bohlen
Course Exam
Exam date: 16.01.2013
Exam time: 12:15 - 12:45
Exam location: BIN 2.E.13
Exam form and procedure
I
I
I
oral, 20 minutes
10 minutes about project (demo, code, algorithm)
10 about a topic of the course
During exam: present solutions on examples
Prepare suitable examples beforehand
DDBS12, SL05
52/52
M. Bohlen