Ch-2 (B) Overview of Query Processing
Ch-2 (B) Overview of Query Processing
1
2.2 Query Processing
The activities involved in retrieving data from the database.
In declarative languages such as SQL, which is suitable for
human use but ill suited to be the system internal
representation of a query:
The user specifies what data is required rather than how it is
retrieved.
It give the responsibility for selecting the best strategy to
the DBMS
Which prevents users from choosing strategies that are
known to be inefficient
And gives the DBMS more control system performance
2
Cont…
The aims of query processing
To transform a query written in a high-level language into
a low-level language (implementing the relational algebra)
To determine a strategy the one which is the most cost
effective and efficient.
To execute the strategy to retrieve the required data.
System
Query Decomposition Catalog
Relational Algebra expression
/intermediate form of query
Compile
time Database
Query Optimization
statistics
Execution plan
Code Generation
Generated code
Main
Runtime Query Execution databases
Runtime
Query output
4
2.2.1 Query Decomposition
The aims of query decomposition
To transform a high-level query into a relational algebra
query.
To check that the query is syntactically and semantically
correct.
The typical stages of query decomposition are:
Analysis
Normalization
Semantic analysis
Simplification and
Query restructuring
5
1) Analysis
In this stage,
The query is lexically and syntactically analyzed using the
techniques of programming language in compilers.
Verifies that the relations and attributes specified in the
query are defined in the system catalog.
Verifies that any operations applied to database objects are
appropriate for the object type.
Checks whether the operations on attributes donot conflict
with the types of the attributes, e.g., a comparison >
operation with an attribute of type string.
Transform the query into some internal representation
6
1) Analysis
Example:
Table Name: Staff
StaffNo fName lName Position Sex DOB Salary branchNo
7
Cont…
On compilation of this stage, the high-level query has been
transformed into some internal representation that is more
suitable for processing.
The internal form that is typically chosen is some kind of tree,
which is constructed as follows:
A leaf node is created for each base relation in the query
A non-leaf node is created for each intermediate relation
produced by a relational algebra operation.
The root of the tree represents the result of the query
The sequence of operations is derived from the leaves to the
root
8
Cont…
Example: Find all managers who work at a London branch
SELECT * FROM Staff s, Branch b WHERE S.branchNo = b.branchNo
AND (s.position = ’manager’AND b.city = ’London’);
∞s.branchNo=b.branchNo root
Staff Branch
leaves
σ (position=manager’(Staff)) ∞ staff.branchNo=Branch.branchNo
9 (σ city=’London’(Branch))
2) Normalization
Converts the query into a normalized form that can be more
easily manipulated.
There are two different normal forms, Conjunctive normal
form and Disjunctive normal form.
Conjunction normal form
A sequence of conjunction that are connected with the
∧(AND) operator.
Each conjunct contains one or more terms connected by the
∨(OR) operator.
(p11 ∨ p12 ∨ ··· ∨ p1n) ∧ ··· ∧ (pm1 ∨ pm2 ∨ ··· ∨ pmn)
A conjunction selection contains only those tuples that satisfy all
conjuncts
10
Cont…
Disjunctive normal form
A sequence of disjunction that are connected with the ∨(OR)
operator.
Each disjunction contains one or more terms connected by the
∧(AND) operator.
(p11 ∧ p12 ∧ ··· ∧ p1n) ∨ ··· ∨ (pm1 ∧ pm2 ∧ ··· ∧ pmn)
Disjunctive selection contains those tuples formed by the union
of all tuples that satisfy the disjunctions.
Example:
(position=’Manager’ ∨ salary > 20000) ∧ branchNo =
‘B003’
(position=‘Manager’ ∧
branchNo=‘B003’) ∨
11 (salary>20000 ∧ branchNo=’B003’
Cont…
Example: Consider the following query:
Find the names of employees who have been working on
project P1 for 12 or 24 months?
The query in SQ L :
SELECTE NAME FROM EMP, ASG WHERE EMP.ENO = ASG.ENO
AND ASG.PNO =‘‘P1’’ AND DUR = 12 OR DUR = 24
The qualification in conjunctive normal form:
EMP.ENO=ASG.ENO ∧ ASG.PNO=”P1”∧ DUR=12 ∨
DUR=2 4 )
The qualification in disjunctive normal form :
(EMP.ENO=ASG.ENO ∧ ASG.PNO =”P1” ∧ DUR =12 ) ∨
(EMP.ENO = ASG.ENO ∧ ASG.PNO =”P1” ∧ DUR=24)
12
3) Semantic Analysis
Applied to normalized queries
Rejects contradictory queries:
Qualification condition cannot be satisfied by any tuple
Rejects incorrectly formulated queries:
Condition components do not contribute to generation of the
result.
A query is contradictory if it is predicate cannot be satisfied
by any tuple
Example:
(position=’Manager’ ∧ position =’Assistance’) ∨ salary >20000
Could be simplified to (salary>20000)
@ F ∨ salary >20000
13
4) Simplification
The objective of the simplification stage are:
To detect redundant qualifications
To eliminate common sub expression and
To transform the query to a semantically equivalent but
more easily and efficiently computed form.
Typically:
Access restrictions
View definitions and
Integrity constraints are considered at this stage ,some of
which may also introduce redundancy
If the user does not have the appropriate access to all the
components of the query the query must be rejected
14
Cont…
View definition
CREATE VIEW Staff3 AS SELECT staffNo, fName, lName,
salary, branchNo FROM Staff WHERE branchNo=‘B003’;
15
Cont…
Assuming that the user has the appropriate access privileges an
initial optimization is to apply the well-known idempotent rules of
Boolean algebra, such as:
● p ∧ p =p ● P ∧ False= False ● P ∧ true = P
● P ∧ -P=False ● p ∧ (p V q)=p ● p v p=p
● p v False =p ● p v true =true
● p v –p = true ● p v (p ∧ q)=p
Generally in View resolution:
View select-list is translated into corresponding select-list in the
view defining query
From-list of the query is modified to hold the names of base
tables
Qualifications from WHERE clause are combined
16 GROUP BY and HAVING clauses are modified
5) Query restructuring
The final stage of query decomposition
The query is restructured to provide a more efficient
implementation.
Rewriting a query using relational algebra operators
Modifying relational algebra expression to provide more
efficient implementation
Heuristically Approach:- uses transformation rules to convert
one relational algebra expression into an equivalent form that
is known to be more efficient.
Example:
‘Select’ before ‘join’ manager working in London Branch
17
Transformation Rules
Used in restructuring query
In listing these rules ,we use three relations R,S, and T with
R defines over the attributes A={ A1, A2, …An}, and
S defines over the attributes B ={B1, B2, … Bn},
p, q, and r denotes predicates, and
L, L1, L2, M, M1, M2, and N denote sets of attributes
1. Conjunctive selection operations can cascade into individual
selection operations (and vice versa)
σp^q^r(R)= σp( σp( σq( σr(R) ))
Example
σ branchNo=’B0003’^salary>15000(Staff)
18
σbranchNo=’B0003’(σsalary>15000(Staff)
Cont…
2. Commutativity of selection operations
σp( σq(R)) = σq( σp(R))
Example
σ branchNo=’B0003’(σ salary>15000(Staff))
(σsalary>15000(σ branchNo=’B0003’(Staff))
3. In a sequence of projection operations, only the last in the sequence
is required.
Example
∏lName ∏ branchNo,lName(Staff)
∏lName(Staff)
19
Cont…
4. Commutativity of selection and projection
If the predicate P involves only the attributes in the projection list,
then the selection and projection operations commute:
∏A1,….Am( σp(R) = σp( ∏ A1,….Am(R))
Example
∏fName,lName(σlName=’Beech’(Staff))
σlName=’Beech’(∏ fName,lName(Staff))
5. Commutativity of Theta Join (and Cartesian Product)
● R ⋈ p S = S ⋈p R ● R X S=SXR
Example: staff ⋈ staff.branchNo=Branch.branchNo Branch
σposition =’manager’(Staff))
staff.brancNo=Branch.branchNo
(σcity =’London’(Branch))
23
Cont…
10. commutativity of projection and union
πL(RUS)= πL(S) U πL(R)
11. Associativity of Theta join(and Cartesian product)
(R ⋈ S) ⋈ T = R ⋈ (S ⋈ T)
(R X S) X T = R X (S X T)
12. Associativity of union and Intersection(but not set Difference)
(R ⋈ S) U T = S U ( R U T)
(R n S) n T = S n ( R n T)
24
Cont…
Example: To justify importance of query restructuring
Find all managers who work at a London branch
SELECT * FROM Staff s, Branch b WHERE
S.branchNo=b.branchNo
AND (s.position=’manager’AND b.city =’London’);
The equivalent relational algebra queries corresponding to this SQL
statements are:
1. σ (position=manager’) ᴧ(city=’London’) ᴧ (staff.branchNo =
Branch.branchNo)(StaffXBranch)
2. σ(position=manager’) ᴧ(city=’London’) (Staff ∞ staff.branchNo =
Branch.branchNo Branch)
3. σ (position=manager’(Staff)) ∞
staff.branchNo=Branch.branchNo(σ city=’London’(Branch))
25
Cont…
Assume there are:
1000 tuples in staff and 50 tuples in Branch,
50 Managers(one for each branch), and
5 London branches.
We compare these queries based on the number of disk
accesses required
Assume there are no indexes or sort keys on either relations
and that the result of any intermediate operations are stored on
disk.
Assume tuples are accessed one at a time and
Main memory is large enough to process entire relations for
each relational algebra operation
26
Cont…
The first query calculates the Cartesian product of staff and
Branch, which requires:
(1000+50) disk accesses to read the relations
(1000*50) disk accesses to create a relation with
(1000*50) tuples which is the result of Cartesian product
(1000*50) cost of another disk accesses, to read each of
these tuples again to test them against the selection
predicate
Giving a total cost of (1000 +50) +2*(1000*50)=101050
disk accesses
27
Cont…
The second query joins staff and Branch on the branch
number branchNo, which again requires:
(1000+50)disk accesses to read each of the relations
We know that the join of the two relations has 1000
tuples ,one for each member of staff( a member of staff
can only work at one branch).
Selection operation requires 1000 disk accesses to read
the result of the join,
Giving a total cost 2*1000+(1000+50)=3050 disk
accesses
28
The final query
First reads each staff tuple to determine the manager tuples,
which requires 1000 disk access and produces a relations
with 50 tuples.
The second selections operations reads each Branch tuple to
determine the London branch, which requires 50 disk access
and produces a relations with 5 tuples.
The final operation is the join of the reduced staff and
Branch relations, which requires(50+5) disk accesses
Giving a total cost of 1000 + 50 + 2 *(50 + 5) =1160 disk
access
29
Cont…
Exercise 4:
Find all books which have a price of greater than 300 and whose
author reside at New York.
Assume there are 100 tuples in the catalog table, 50 tuples in the
author table.
There are 40 books, which have a price greater than 300 in the
catalog table and 10 authors who reside in the ‘New York’ in the
author table.
30
Cont…
Further assume that there are no indexes or sort keys on either
relations and that the result of any intermediate operations are
stored on disk.
Also assume tuples are accessed one at a time and
Main memory is large enough to process entire relations for
each relational algebra operation
32
2.2.2 Query Optimization
The activity of choosing an efficient execution strategy for
processing a query .
An important aspect of query processing is query optimization.
As there are many equivalent transformations of same high- l
evel query ; the aim of query optimization is to choose the one
that minimizes resource usage.
Generally the Optimization criteria‘s are:
Reduce total execution time of the query:
Minimize the sum of the execution times of all individual
operations that makeup the query
Reduce the number of disk access
Reduce response time of the query:
Ensure well usage of resource
33
Maximize parallel operations (pipelining)
Cont…
Both methods (criteria) of query optimization depend on
database statistics to evaluate properly the difference options
that are available.
The accuracy and currency of these statistics have a
significant bearing on the efficiency of the execution strategy
chosen.
The statistics cover information about relations, attributes and
indexes.
Example: The system catalog may store statistics
Giving the cardinality of relations
The number of distinct values for each attribute and
The number of levels in a multilevel index
34
Cont…
keeping the statistics current can be problematic
If the DBMS updates the statistics every time a tuple is
inserted, updated, or deleted, this would have a significant
impact on performance during peak period.
An alternative approach is to update the statistics on a periodic
basis, for example nightly, or whenever the system is idle.
Various DBMS implementations have used different
optimization techniques to obtain efficient execution plans.
Some of the techniques are
Syntactical Optimization
Semantic Optimization
Heuristic (Rule-based) Optimization
Cost-based Optimization
35
Dynamic and Static query optimization
1)Dynamic query optimization
Query decomposition and optimization carry out every time
the query is run.
Advantage: All information required to select an optimum
strategy is up to date.
Disadvantage:
The performance of the query is affected because the
query has to be parsed, validated, and optimized before it
can be executed.
In some case, the number of execution strategies analyzed
need to be reduced in order to keep the overhead costs
within acceptable limits which might result in selecting a
strategy that is not best or there is a chance of left out best
strategy.
36
Cont…
2) Static query optimization
The query is parsed, validated, and optimized once that is similar to
the approach taken by a compiler for a programming language.
DBMS analyze large number of alternative strategies before
selecting the optimum strategy.
More suitable for query that is executed frequently
Advantages
The runtime overhead is removed
More time available to evaluate a larger number of execution
strategies.
Disadvantage:
The execution strategy that is chosen as being optimal when
the query is compiled may no longer be optimal when the
query is run.
37
A) Syntactical Optimization
Relies on the user’s understanding of both the underlying
database schema and the distribution of the data within the
table.
Tables are joined in the original order specified by the user
Can be extremely efficient when accessing data in a relatively
static environment.
The drawback of this techniques are:
It is up to user to find more efficient method of accessing
the data.
When query dynamically changed (embedded query) it
need to be recompiled to improve their data access
performance.
38
B) Semantic optimization
Operates on the premise that the optimizer has a basic
understanding of the actual database schema.
39
C) Heuristic query optimization
Heuristic: problem-solving by experimental methods
Applying general rules to choose the most appropriate
internal query representation
Based on transformation rules for relational algebra operators
Used by most of DBMSs to determine the best strategies.
Heuristics rules include
Performing selection and projections as early as possible
Compute common expression only once and store it.
Combining Cartesian product with a subsequent selection
whose predicate represents a join condition into a join
operation.
Use associatively of binary operations to rearrange leaf
nodes so that leaf nodes with the most restrictive
40
selections are executed first
D) Cost based query optimization
A method of optimizing the query by choosing a strategy that
result in minimum cost.
Optimizer needs specific information about the stored data.
The information is system dependent and may include
information such as:
File size
File structure types
Available primary and secondary indexes
Attribute selectivity (percentage of tuples expected to be
retrieved for a given predicate)
Its goal is not to produce the ‘optimal’ execution plan for
retrieving data, to provide a reasonable execution plan.
41
Cont…
The cost of executing a query includes the following
components:
Secondary storage access cost:
The cost of accessing, reading, searching for and
writing data blocks that reside on secondary storage.
Most important than other one
Basically, most database system compare different
execution strategies in terms of number of block
transfers between the secondary storage and main
memory.
Storage cost:
The cost of storing any intermediate files that are
generated by an execution strategy for the query.
42
Cont…
Computation cost:
The cost of performing in-memory operations on the data
buffer during query execution. Example sorting and
merging records.
Memory usage cost:
The cost of pertaining to the number of memory buffer
needed during query execution.
Communication cost:
Cost of communicating the query from the source to the
database and then return the query result to where query
originated.
Take more attention in distributed database system where
communication cost most significant.
43
Cost Estimation and statistics
Cost estimation depends on statistical information held in the
system catalog.
The dominant cost in query processing is usually that of disk
access which are slow compared with memory access.
Many of the cost estimates are based on the cardinality of the
relation
The success of estimating the size and cost of intermediate
relational algebra operations depends on the amount and
currency of the statistical information that the DBMS holds.
If we wish to maintain accurate statistics, then, every time a
relation is modified, we must also update the statistics
44
Cont…
Typically, we would expect a DBMS to hold the following
types of information in its system catalog:
For each base relation R
n Tuples(R) nr:- The number of tuples (records) in
relation R (that is, its cardinality)
Sr:- The size of a tuple of relation r in bytes.
bFactor(R) fr:- The blocking factor of R (that is, the
number of tuples of R that fit into one block)
nBlocks(R) br :- The number of blocks required to store
R. If the tuples of R are stored physically together, then:
nBlocks(R) = [nTuples(R) / bFactor(R)
br = n r / f r
45
Cont…
For each multilevel index I on attribute set A
nLevelsA(I) X:- The number of levels in I
nLfBlocksA(I) B11:- The number of first level index in
blocks in I
For each attribute A of base relation R
nDistinctA(R) D V(A, r):- The number of distinct
values that appear in the relation R for attribute A.
Equal to size of A(r ). If A is a key for relation R
minA(R), maxA(R):- The minimum and maximum
possible values for the attributes A in relation R.
SCA(R) S:- The selection cardinality of attribute A in
relation R which is the average number of tuples that satisfy
46
an equality conditions on attribute A.
Cont…
If A is a key attribute of R or selection condition α forces A to take
on a specified value SCA(A, α(R)) = 1 and otherwise SCA(A,
α(R)) = [nTuples(R) / nDistinctA(R)]
47
Selection operation(S= σp(R)
The selection operations in the relational algebra works on a
single relation R.
There are a number of different implementations for the
selection operation, depending on the structure of the file in
which the relation is stored, and on whether the attribute(s)
involved in the predicate have been indexed(hashed).
The cost are given in terms of secondary storage access and
other coasts like computation time, storage cost, and so on are
ignored for the time being, as they are not more significant.
The commonly used search algorithms and their associated
costs are discussed in the following slide.
48
Cont…
The main strategies that we consider are:
(S1) Linear search(unordered file, no index):-
Retrieve every record in the file, and test whether its
attribute values satisfy the selection condition.
(S2) Binary search(ordered file, no index):-
If the selection condition involves an equality comparison
on a key attribute on which the file is ordered, binary
search (which is more efficient than linear search) can be
used.
(S3) Using a primary index or hash key to retrieve a single
record:
If the selection condition involves an equality comparison
on a key attribute with a primary index (or a hash key),
use the primary index (or the hash key) to retrieve the
49 record.
Cont…
(S4) Using a primary index to retrieve multiple records:
If the comparison condition is >, ≥, <, or ≤ on a key field
with a primary index, use the index to find the record
satisfying the corresponding equality condition.
(S5) Using a clustering index to retrieve multiple records:
If the selection condition involves an equality comparison
on a non-key attribute with a clustering index, use the
clustering index to retrieve all the records satisfying the
selection condition.
(S6) Using a secondary (B+-tree) index:
On an equality comparison, this search method can be
used to retrieve a single record if the indexing field has
unique values (is a key) or to retrieve multiple records if
the indexing field is not a key.
Also can be used to retrieve records on conditions
involving >,>=, <, or <=. (FOR RANGE QUERIES)
50
Cont…
56
Cont…
(J1) Nested-loop join (brute force):
For each record ‘t’ in the outer loop, retrieve every record
from ‘s’ inner loop and test whether the two records satisfy
the join condition t[A] = s[B].
The records of each file are scanned only once each for
matching with the other file. Unless both A and B are non-
key attributes, in which case the method needs to be
58
modified slightly.
Cont…
(J4) Hash-join:
The records of files R and S are both hashed to the same
hash file, using the same hashing function on the join
attributes A of R and B of S as hash keys.
A single pass through the other file (S) then hashes each of
its records to the appropriate bucket, where the record is
combined with all matching records from R.
59
Cost
Block nested loop join
nBlocks(R) +(nBlocks(R)* nBlocks(S)), if buffer has
only one block for R and S
nBlocks(R)+ [nBlocks(S)*(nBlocks(r)/nBuffer-2))], if
(nBuffer-2)blocks for R
nBlocks(R) + nBlocks(S), if all blocks of R can be read
into database buffer
Indexed nested loop join
nBlocks(R) + nTuples(R)*(nLevelsA(I) +1), if join
attribute A in S is the primary key
nBlocks(R) + nTuples(R) *(nLevelsA(I)
+[SCA(R)/bFactor(R)]), for clustering index I on
60
attribute A
Cost…
Sort-merge join
nBlocks(R) *[log2(nBlocks(R)+nBlocks(S) *
[log2(nBlocks(S)], for sorts
nBlocks(R)+nBlockss(S), for merg
Hash join
3(nBlocks(R) + nBlocks(S), if hash index is held in
memory
2(nBlocks®+nBlocks(S))*[lognBuffer-1(nBlocks(S)-
1]+nBlocks(R) +nBlocks(S), otherwise
61
2.2.3 Materialization and Pipelining
Materialization
The process in which the results of intermediate relational
algebra operations are written temporarily to disk.
Pipelining
It is also known as on-the fly processing
Used to improve the performance of queries
In this case the results of one operation to another operation
without creating a temporary relation to hold the
intermediate result.
Saves the cost of creating temporary relations and reading
the result back
62
Cont…
A buffer is created for each pair of adjacent operations to
hold the tuples being passed from the operations to second
one.
One drawback ,with pipelining is that the inputs to operation
are not necessarily available all at once for processing.
Example:
Position=’Manager’ and salary>20000(Staff)
If we assume that there is an index on the salary attribute, we
use the cascade of solution rule to transform this selection
into two operations:
Position=’Manager’(salary>20000(Staff))
63
Query Optimization
in
Oracle
64
2.3 Query Optimization in Oracle
Oracle supports the two approaches to query optimization
Rule-based and
Cost-based
67
Cont…
An unbounded range scan using the index on the rooms
column from the WHERE condition(rooms>7).This access
path has rank 11
69
Cont…
However, Oracle does not gather statistics automatically but
makes it the users’ responsibility to generate these statistics
and keep them current
EXECUTE
DBMS_STATS.GATHER_SCHEMA_STATS(‘Manager’);
70
QUESTIONS
71
Disk Structure
same size as
memory
page
72
Suppose you were given a chance to visit 10 pre-selected different
cities in Ethiopia. The only constraint would be ‘Time’
Would you have a plan to visit the cities in any order?
Place the 10 cities in different groups based on their proximity to
each other. And start with one group and move on to the next
group.
Important point made over here is that you would have visited the
cities in a more organized manner, and the ‘Time’ constraint
mentioned earlier would have been dealt with efficiently.