Today’s topic
Distributed Query Processing
Distributed DBMS Page 7-9. 1
Query Processing
high level user query
query
processor
low level data manipulation
commands
Distributed DBMS Page 7-9. 2
Query Processing Components
Query language that is used
SQL
Query execution methodology
The steps that one goes through in executing high-level user
queries.
Query optimization
How do we determine the “best” execution plan?
Distributed DBMS Page 7-9. 3
Selecting Alternatives
SELECT ENAME Project
FROM EMP,ASG Select
WHERE EMP.ENO = ASG.ENO Join
AND DUR > 37
Strategy 1
ENAME(DUR>37EMP.ENO=ASG.ENO(EMP ASG))
Strategy 2
ENAME(EMP ENO (DUR>37 (ASG)))
Strategy 2 avoids Cartesian product, so is “better”
Distributed DBMS Page 7-9. 4
What is the Problem?
Site 1 Site 2 Site 3 Site 4 Site 5
ASG1=ENO≤“E3”(ASG) ASG2=ENO>“E3”(ASG) EMP1=ENO≤“E3”(EMP) EMP2=ENO>“E3”(EMP) Result
Site 5 Site 5
result = EMP1’EMP2’ result2=(EMP1EMP2) ENODUR>37(ASG1ASG1)
EMP1’ EMP2’
ASG1 ASG2 EMP1 EMP2
Site 3 Site 4
EMP1’=EMP1 ASG1’ EMP2’=EMP2 ASG2’
ENO ENO
Site 1 Site 2 Site 3 Site 4
ASG1’ ASG2’
Site 1 Site 2
ASG1’=DUR>37(ASG1) ASG2’=DUR>37(ASG2)
Distributed DBMS Page 7-9. 5
Cost of Alternatives
Assume:
size(EMP) = 400, size(ASG) = 1000
tuple access cost = 1 unit; tuple transfer cost = 10 units
Strategy 1
produce ASG': (10+10)tuple access cost 20
transfer ASG' to the sites of EMP: (10+10)tuple transfer cost 200
produce EMP': (10+10) tuple access cost2 40
transfer EMP' to result site: (10+10) tuple transfer cost 200
Total cost 460
Strategy 2
transfer EMP to site 5:400tuple transfer cost 4,000
transfer ASG to site 5 :1000tuple transfer cost 10,000
produce ASG':1000tuple access cost 1,000
join EMP and ASG':40020tuple access cost 8,000
Total cost 23,000
Distributed DBMS Page 7-9. 6
Query Optimization Objectives
Minimize a cost function
I/O cost + CPU cost + communication cost
These might have different weights in different distributed
environments
Wide area networks
communication cost will dominate (80 – 200 ms)
low bandwidth
low speed
high protocol overhead
most algorithms ignore all other cost components
Local area networks
communication cost not that dominant (1 – 5 ms)
total cost function should be considered
Distributed DBMS Page 7-9. 7
Complexity of Relational
Operations
Operation Complexity
Select
Project O(n)
Assume (without duplicate elimination)
relations of cardinality n Project
sequential scan (with duplicate elimination) O(nlog n)
Group
Join
Semi-join O(nlog n)
Division
Set Operators
Cartesian Product O(n2)
Distributed DBMS Page 7-9. 8
Query Optimization Issues – Types
of Optimizers
Exhaustive search
cost-based
optimal
combinatorial complexity in the number of relations
Heuristics
not optimal
regroup common sub-expressions
perform selection, projection first
replace a join by a series of semijoins
reorder operations to reduce intermediate relation size
optimize individual operations
Distributed DBMS Page 7-9. 9
Query Optimization Issues –
Optimization Granularity
Single query at a time
cannot use common intermediate results
Multiple queries at a time
efficient if many similar queries
decision space is much larger
Distributed DBMS Page 7-9. 10
Query Optimization Issues –
Optimization Timing
Static
compilation optimize prior to the execution
difficult to estimate the size of the intermediate results error
propagation
can amortize over many executions
Dynamic
run time optimization
exact information on the intermediate relation sizes
have to reoptimize for multiple executions
Hybrid
compile using a static algorithm
if the error in estimate sizes > threshold, reoptimize at run time
Distributed DBMS Page 7-9. 11
Query Optimization Issues –
Statistics
Relation
cardinality
size of a tuple
fraction of tuples participating in a join with another relation
Common assumptions
independence between different attribute values
uniform distribution of attribute values within their domain
Distributed DBMS Page 7-9. 12
Query Optimization Issues –
Decision Sites
Centralized
single site determines the “best” schedule
simple
need knowledge about the entire distributed database
Distributed
cooperation among sites to determine the schedule
need only local information
cost of cooperation
Hybrid
one site determines the global schedule
each site optimizes the local subqueries
Distributed DBMS Page 7-9. 13
Query Optimization Issues –
Network Topology
Wide area networks (WAN) – point-to-point
characteristics
low bandwidth
low speed
high protocol overhead
communication cost will dominate; ignore all other cost
factors
global schedule to minimize communication cost
local schedules according to centralized query optimization
Local area networks (LAN)
communication cost not that dominant
total cost function should be considered
broadcasting can be exploited (joins)
special algorithms exist for star networks
Distributed DBMS Page 7-9. 14
Query Optimization Issues –
Replicated Fragments
Process of localization
Distributed queries expressed on global relations are
mapped into queries on physical fragments of relations by
translating relations into fragments
Distributed DBMS Page 7-9. 15
Query Optimization Issues – Use of
Semi joins
Reduce the size of operand relation
Size of data exchanged between sites is reduced
Distributed DBMS Page 7-9. 16
Distributed Query Processing
Methodology
Calculus Query on Distributed
Relations
Query
Query
GLOBAL
GLOBAL
Decomposition
Decomposition SCHEMA
SCHEMA
Algebraic Query on Distributed
Relations
CONTROL
Data FRAGMENT
SITE Data FRAGMENT
Localization
Localization
SCHEMA
SCHEMA
Fragment Query
Global STATS ON
Global STATS ON
Optimization
Optimization
FRAGMENTS
FRAGMENTS
Optimized Fragment Query
with Communication Operations
LOCAL Local LOCAL
Local LOCAL
SITES Optimization
Optimization
SCHEMAS
SCHEMAS
Optimized Local
Queries
Distributed DBMS Page 7-9. 17
Query Decomposition
Decomposes a calculus query into algebraic query on
global relations in following steps
Calculus query is rewritten in normalized form
Normalized query is analyzed semantically
Correct query is simplified
Calculus query is restructured as an algebraic query
Distributed DBMS Page 7-9. 18
Data Localization
Input is algebraic query on distributed relations
Determines which fragments are involved in the query
and transforms a distributed query into a fragment
query
Distributed DBMS Page 7-9. 19
Global Query Optimization
Find an optimal execution strategy
Find best ordering of operations to minimize the cost
Distributed DBMS Page 7-9. 20
Local Query Optimization
Done at all sites having fragments involved
Each sub-query is optimized using local schema of the
site
Uses algorithms of centralized systems
Distributed DBMS Page 7-9. 21
Restructuring
Convert relational calculus to relational
algebra
ENAME Project
Make use of query trees
Example DUR=12 OR DUR=24
Find the names of employees other than J. Doe
who worked on the CAD/CAM project for
either 1 or 2 years. PNAME=“CAD/CAM” Select
SELECT ENAME
FROM EMP, ASG, PROJ
WHERE EMP.ENO = ASG.ENO ENAME≠“J. DOE”
AND ASG.PNO = PROJ.PNO
AND ENAME ≠ “J. Doe” PNO
AND PNAME = “CAD/CAM”
AND (DUR = 12 OR DUR =
24) ENO Join
PROJ ASG EMP
Distributed DBMS Page 7-9. 22
Example
Recall the previous example: ENAME Project
Find the names of employees other than J.
Doe who worked on the CAD/CAM
project for either one or two years. DUR=12 OR DUR=24
SELECT ENAME PNAME=“CAD/CAM” Select
FROM PROJ, ASG, EMP
WHERE ASG.ENO=EMP.ENO
ENAME≠“J. DOE”
AND ASG.PNO=PROJ.PNO
AND ENAME≠“J. Doe”
AND PROJ.PNAME=“CAD/CAM” PNO
AND (DUR=12 OR DUR=24)
ENO Join
PROJ ASG EMP
Distributed DBMS Page 7-9. 23
Equivalent Query
ENAME
PNAME=“CAD/CAM” (DUR=12 DUR=24) ENAME≠“J. DOE”
PNO ENO
ASG PROJ EMP
Distributed DBMS Page 7-9. 24
Restructuring
ENAME
PNO
PNO,ENAME
ENO
PNO PNO,ENO PNO,ENAME
PNAME = "CAD/CAM" DUR =12 DUR=24 ENAME ≠ "J. Doe"
PROJ ASG EMP
Distributed DBMS Page 7-9. 25
Cost Functions
Total Time (or Total Cost)
Reduce each cost (in terms of time) component individually
Do as little of each cost component as possible
Optimizes the utilization of the resources
Response Time
Do as many things as possible in parallel
May increase total time because of increased total activity
Distributed DBMS Page 7-9. 26
Total Cost
Summation of all cost factors
Total cost = CPU cost + I/O cost + communication cost
CPU cost = unit instruction cost no.of instructions
I/O cost = unit disk I/O cost no. of disk I/Os
communication cost = message initiation + transmission
Distributed DBMS Page 7-9. 27
Total Cost Factors
Wide area network
message initiation and transmission costs high
local processing cost is low
ratio of communication to I/O costs = 20:1
Local area networks
communication and local processing costs are more or less
equal
ratio = 1:1.6
Distributed DBMS Page 7-9. 28
Response Time
Elapsed time between the initiation and the completion of a query
Response time = CPU time + I/O time + communication time
CPU time = unit instruction time no. of sequential instructions
I/O time = unit I/O time no. of sequential I/Os
communication time = unit msg initiation time no. of
sequential msg + unit transmission time no. of
sequential bytes
Distributed DBMS Page 7-9. 29