Unit 4
Unit 4
1. What do you mean by query processing? What are the various steps involved in query processing?
Explain with the help of a block diagram.
Ans: Query processing includes translation of high-level queries into low-level expressions that can be
used at the physical level of the file system, query optimization and actual execution of the query to get
the result. It is a three-step process that consists of parsing and translation, optimization and execution
of the query submitted by the user (see Figure 8.1). These steps are discussed below:
1. Parsing and translation: Whenever a user submits a query in high-level language (such as SQL) for
execution, it is first translated into its internal representation suitable to the system. The internal
representation of the query is based on the extended relational algebra. Thus, an SQL query is first
translated into an equivalent extended relational algebra expression. During translation, the parser
checks the syntax of the user's query according to the rules of the query language. It also verifies that all
the attributes and relation names specified in the query are the valid names in the schema of the
database being queried.
2. Optimization: Generally, there are several possible ways of executing a query (known as execution
strategies), and different execution strategies can have different costs. It is the responsibility of
the query optimizer, a component of DBMS, to choose a least costly execution strategy. This process of
choosing a suitable execution strategy for processing a query is knows as query optimization. The
metadata stored in the special tables called DBMS catalog is used to find the best way of evaluating a
query.
3. Execution: The chosen execution strategy is finally submitted to the query-evaluation enginefor
actual execution of the query to get the desired result.
2. What is a query tree? How is it different from a query evaluation plan? Explain with the help of an
example.
Ans: The relational-algebra expressions are typically represented as query trees or parse trees. A query
tree (also known as operator tree or expression tree) is a tree data structure in which the input
relations are represented as the leaf nodes and the relational-algebra operations as the internal nodes.
When the query tree is executed, first the internal node operation is executed whenever its operands
are available. Then the internal node is replaced by the relation resulted after the execution of the
operation. The execution terminates when the root node is executed and the result of the query is
produced. For example, consider the following relational-algebra expression:
Each operation in the query (such as select, project, join, etc.) can be evaluated using one of the several
different algorithms, which are discussed later in this chapter. The query tree representation does not
specify how to evaluate each operation in the query. To fully specify how to evaluate a query, each
operation in the query tree is annotated with the instructions that specify the algorithm or the index to
be used to evaluate that operation. The resultant tree structure is known as the query-evaluation
plan or query-execution plan or simply plan. One possible query-evaluation plan for our example query
tree is given in Figure 8.3. In this figure, the select operator s is annotated with the instruction Linear
Scan that specifies a complete scan of the relation.
Figure 8.3 Query-evaluation Plan
Ans: If a relation entirely fits in the main memory, in-memory sorting (called internal sorting) can be
performed using any standard sorting techniques such as quick sort. If the relation does not fit entirely
in the main memory, external sorting is required. In external sorting, some part of the relation is read in
the main memory, sorted and written back to the disk. This process continues until the entire relation is
sorted. The most commonly used external sorting algorithm is the external sort-merge algorithm. The
external sort-merge algorithm is divided into two phases, namely sort phase and merge phase.
Sort phase: In this phase, the records in the file to be sorted are divided into several groups,
called runs such that each run can fit in the available buffer space. An internal sort is applied to each run
and the resulting sorted runs are written back to the disk as temporarily sorted runs. The number of
initial runs (ni) is computed as bR/M , where bR is the number of blocks containing records of
relation R and M is the size of available buffer in blocks.
Merge phase: In this phase, the sorted runs created during the sort phase are merged into larger runs
of sorted records. The merge continues until all records are in one large run. The output of merge phase
is the sorted relation.
Figure 8.4 External Sort-merge Algorithm
5. Compare the linear search and binary search algorithms for select operation. Which one is better?
Ans: Algorithm 1 (Linear search algorithm): In linear search (also called brute force)algorithm, every file
block is scanned and all the records are tested to see whether their attribute values satisfy the selection
condition. It is the simplest algorithm and can be applied to any file, regardless of the ordering of the file
or the availability of indexes. However, it may be slower than the other algorithms that are used for
implementing selection. Since the linear search algorithm searches all the blocks of the file, it
requires bR block transfers, where bR is the total number of blocks of relation R. If the selection condition
involves a key attribute, the scan is terminated when the required record is found, without looking
further at other records of the relation. The average cost in this case is equal to bR/2 block transfers if
the record is found. In the worst case, the select operations involving the key attribute may also require
the scanning of all the records (bR block transfers) if no record satisfies the condition or the record is at
the last location.
Algorithm 2 (Binary search): If the file is ordered on an attribute and the selection condition involves an
equality comparison on that attribute, binary search can be used to retrieve the desired tuples. Binary
search is more efficient than the linear search as it requires scanning of lesser number of file blocks. For
example, for the relational-algebra operation σP_ID=“P002” (PUBLISHER), binary search algorithm can be
applied if the PUBLISHER relation is ordered on the basis of the attribute P_ID. The cost of this algorithm
is equal to log2 (bR) 1 block transfers. If the selection condition involves a non-key attribute, it is
possible that more than one block contains the required records.
In general, binary search algorithm is not suited for searching purposes in the database since sorted files
are not used unless they have a corresponding primary index file also.
6. What are file scans and index scans? Discuss the search algorithms for select operations that
involve the use of indexes.
Ans: The search algorithms that scan the records of file to locate and retrieve the records satisfying the
selection condition are known as file scans. The search algorithms that involve the use of indexes are
known as index scans. An index is also referred to as access path as it provides a path through which the
data can be found and retrieved. Though indexes provide a fast, direct and ordered access, they also add
an extra overhead of accessing those blocks containing the index. The algorithms that use an index are
described here.
Algorithm 3 (use of primary index, equality on key attribute): If the selection condition involves an
equality comparison on a key attribute with a primary index, the index can be used to retrieve the
record satisfying the equality condition. For example, for the relational-algebra operation
σP_ID=“P002” (PUBLISHER), the primary index on the key attribute P_ID (if available) can be used to directly
retrieve the desired record. If B+-tree is used, the cost of this algorithm is equal to the height of the tree
plus one block transfer to retrieve the desired record.
Algorithm 4 (use of clustering index, equality on non-key attribute): If the selection condition
involves an equality comparison on the non-key attribute with a clustering index, the index can be used
to retrieve all the records satisfying the selection condition. For example, for the relational-algebra
expression σP_ID=“P002” (BOOK), the clustering index on the non-key attribute P_ID of the BOOK relation (if
available) can be used to retrieve the desired records. The cost of this algorithm is equal to the height of
the tree plus the number of blocks containing the desired records, say b.
Algorithm 5 (use of secondary index, equality on key or non-key attribute): If the selection condition
involves an equality condition on an attribute with a secondary index, the index can be used to retrieve
the desired records. This search method can retrieve a single record if the indexing field is a key
(candidate key) and multiple records if the indexing field is a non-key attribute. In the first case, only one
record is retrieved; thus, the cost is equal to the height of the tree plus one block transfer to retrieve the
desired record. In the second case, multiple records may be retrieved and each record may reside on a
different disk block; thus, the cost is equal to the height of the tree plus n block transfers, where n is the
number of records fetched. For example, for the relational-algebra expression σPname=“Hills
Publications” (PUBLISHER), the secondary index on the candidate key Pname (if available) can be used to
Algorithm 6 (use of primary index, comparison): If the selection condition involves comparisons of
the form A>v, A>=v, A<v or A<=v, where A is an attribute with a primary index and v is the attribute
value, the primary ordered index can be used to retrieve the desired records. The cost of this algorithm
is equal to the height of the tree plus bR/2 block transfers.
For comparison conditions of the form A>=v, the record satisfying the corresponding equality
condition A=v is first searched using the primary index. A file scan starting from that record up to the
end of the file returns all the records satisfying the selection condition. For A>v, the file scan starts from
the first tuple that satisfies the condition A>v—the record satisfying the condition A=vis not included.
For the conditions of the form A<v, the index is not searched, rather the file is simply scanned starting
from the first record until the record satisfying the condition A=v is encountered. For A<=v, the file is
scanned until the first record satisfying the condition A>v is encountered. In both the cases, the index is
not useful.
Algorithm 7 (use of secondary index, comparison):The secondary ordered index can also be used for
the retrievals based on comparison conditions such as <, <=, > or >=. For the conditions of the
form A<v or A<=v, the lowest-level index blocks are scanned from the smallest value up to v and for the
conditions of the form A>v or A>=v, the index blocks are scanned from vup to the end of the file. The
secondary index provides the pointers to the records and not the actual records. The actual records are
fetched using the pointers. Thus, fetching each record may require a disk access, since records may be
on different blocks. This implies that using the secondary index may even be more expensive than using
the linear search if the number of fetched records is large.
8. Discuss the nested-loop join and its several variations. Also analyze their respective costs in terms
of block transfers.
Ans: Nested-loop join: The nested-loop join algorithm consists of a pair of nested for loops, one for each
relation say, R and S. One of these relations, say R is chosen as the outer relation and the other
relation S as the inner relation. The outer relation is scanned row by row and for each row in the outer
relation the inner relation is scanned and looked for the matching rows. In other words, for each
tuple r in R, every tuple s in S is retrieved and then checked whether the two tuples satisfy the join
condition r[A]=s[B]. If the join condition is satisfied, then the values of these two tuples are
concatenated, which is then added to the result. The tuple constructed by concatenating the values of
the tuples r and s is denoted by r.s. Since this search scans the entire relation, it is also called naïve
nested-loop join.
For natural join, this algorithm can be extended to eliminate the repeated attributes using a project
operation. The nested-loop algorithms for equijoin and natural join operations are given in Figure 8.7.
This algorithm is similar to the linear search algorithm for the select operation in a way that it does not
require any indexes and it can be used with any join condition. Thus, it is also called the brute
forcealgorithm.
An extra buffer block is required to hold the tuples resulted from the join operation. When this block is
filled, the tuples are appended to the disk file containing the join result (known as result file). This buffer
block can then be reused to hold additional result records.
Figure 8.7 Nested-loop Join
The two variations of nested-loop join algorithm are block nested-loop join and indexed nested-loop join
algorithm, which are discussed here.
Block nested-loop join: In this algorithm, the relations are processed on a per-block basis rather than
on a per-tuple basis. Thus, each block in the inner relation S is read only once for each block in the outer
relation R (instead of once for each tuple in R). The algorithm for block nested-loop join operation is
given in Figure 8.8.
Index nested-loop join: If an index is available on the join attribute of one relation say S, it is taken as
the inner relation, and the index scan (instead of file scan) is used to retrieve the matching tuples. In this
algorithm, each tuple r in R is retrieved, and the index available on the join attribute of relation S is used
to directly retrieve all the matching tuples. Indexed nested-loop join can be used with the existing as
well as with a temporary index—an index created by the query optimizer and destroyed when the query
is completed. If a temporary index is used to retrieve the matching records, the algorithm is
called temporary indexed nested-loop join. If indexes are available on both the relations, the relation
with fewer tuples can be used as the outer relation for better efficiency. The algorithm for indexed
nested-loop join operation is given in Figure 8.9.
Cost analysis
To analyze the cost of these join algorithms, consider the following operation on PUBLISHER relation of
the Online Book database:
Further, assume the following information about these two relations. This information will be used to
estimate the cost of each algorithm.
Nested-loop join: The nested-loop join algorithm is expensive, as it requires scanning of every tuple
in S for each tuple in R. The pairs of tuples that need to be scanned are nR *nS, where nRand nS denote
the number of tuples in R and S, respectively. For example, if PUBLISHER is taken as the outer relation
and BOOK as the inner relation, 2000*5000 = 107 pairs of tuples need to be scanned.
In terms of the number of block transfers, the cost of this algorithm is equal to bR+ (nR *bS),
where bR and bS denote the number of blocks of R and S, respectively. For example, if PUBLISHER is
taken as the outer relation and BOOK is taken as the inner relation, 100+(2000*500) = 1,000,100 block
transfers are required, which is very high. On the other hand, if BOOK is taken as the outer relation
and PUBLISHER as the inner relation, then 500+(5000*100) = 5,00,500 block transfers are required.
Thus, by making the smaller relation as the inner relation, the overall costs of the join operation can be
reduced.
If both or one of the relations fit entirely in the main memory, the cost of the join operation will
be bR+bS block transfers. This is the best-case scenario. For example, if either of the
relations PUBLISHERand BOOK (or both) fits entirely in the memory, 100+500=600 block transfers are
required, which is optimal.
Block nested-loop join: In the worst case, when neither of the relations fit entirely in the memory, this
algorithm requires bR+ (bR *bS) block transfers. It is efficient to use the smaller relation as the outer
relation. For example, if PUBLISHER is taken as the outer relation and BOOKas the inner relation,
100+(100*500) = 50,100 block transfers are required (which is much lesser than 1,000,100 or 5,00,500
block transfers as in the nested-loop join). In the best case, if one of the relations fits entirely in the main
memory, the algorithm requires bR +bS block transfers, which is the same as that of the nested-loop join.
In this case, the smaller relation is chosen as the inner relation.
Indexed nested-loop join: The cost of this algorithm can be computed as bR +nR *c, where c is the cost
of traversing the index and retrieving all the matching tuples of S.
9. If the tuples of the relations (on which the join operation is to be performed) are physically sorted
by the value of their respective join attributes, which algorithm can be used to perform the join
operation? Explain with the help of an example. Also discuss its cost in terms of block transfers.
Ans: If the tuples of two relations say, R and S are physically sorted by the value of their respective join
attributes, the sort-merge join algorithm can be used to compute equijoins and natural joins. Both the
relations are scanned concurrently in the order of the join attributes to match the tuples that have the
same values for the join attributes. Since the relations are in a sorted order, each tuple needs to be
scanned only once and hence, each block is also read only once. Thus, if both the relations are in sorted
order, the sort-merge join requires only a single pass through both the relations. The sort-merge join
algorithm is given in Figure 8.10(a) and an example of sort-merge join algorithm is given in Figure
8.10(b). In this figure, the relations R and S are sorted on the join attribute a1
Figure 8.10 Implementing Sort-merge Join
The number of block transfers is equal to the sum of the number of blocks in both the
relations R and S, that is, bR +bS. Thus, the join operation BOOK PUBLISHER requires 100+500=600
block transfers. If the relations are not sorted, they can be first sorted using external sorting.
10. Explain the hash join algorithm. Analyze its cost in terms of block transfers. What is the additional
optimization in hybrid hash join?
Ans: In a hash join algorithm, a hash function h is used to partition the tuples of each relation into sets
of tuples with each set containing tuples that have the same hash value on the join attributes A of
relation R and B of relation S. The algorithm is divided into two phases, namely, partitioning
phaseand probing phase. In the partitioning (also called building) phase, the tuples of
relations R and Sare partitioned into the hash file buckets using the same hash function h. Let Pr0, Pr1, …,
Prn be the partitions of tuples in relation R and Ps0, Ps1, …, Psn be the partitions of tuples in
relation S.Here, n is the number of partitions of relations R and S. During the partitioning phase, a single
in-memory buffer with size one disk block is allocated for each partition to store the tuples that hash to
this partition. Whenever the in-memory buffer gets filled, its contents are appended to a disk subfile
that stores this partition. For a simple two-way join, the partitioning phase has two iterations. In the first
iteration, the relation R is partitioned into n partitions, and in the second iteration, the relation Sis
partitioned in n partitions.
In the probing (also called matching or joining) phase, n iterations are required. In the ithiteration, the
tuples in the partition Pri are joined with the tuples in the corresponding partition Psi.
Cost analysis
The hash join algorithm requires 3(bR +bS) +4n block transfers. The first term 3(bR +bS)represents that the
block transfer occurs three times—once when the relations are read into the main memory for
partitioning, second when they are written back to the disk and third, when each partition is read again
during the joining phase. The second term 4n represents an extra overhead (2n for each relation) of
reading and writing of each partition to and from the disk.
If the memory size is relatively large, a variation of hash join called hybrid hash join can be used for
better performance. In hybrid hash join technique, the probing phase for one of the partitions is
combined with the partitioning phase. The main objective of this technique is to join as many tuples
during the partitioning phase as possible in order to save the cost of storing these tuples back to the disk
and accessing them again during the joining (probing) phase.
To better understand the hybrid hash join algorithm, consider a hash function h(K)=K mod n that
creates n partitions of the relation, where n<M and consider the join operation BOOK PUBLISHER. In
the first iteration, the algorithm divides the buffer space among n partitions such that all the blocks of
first partition of the smaller relation PUBLISHER completely reside in the main memory. A single input
buffer (with size one disk block) is kept for each of the other partitions, and they are written back to the
disk as in the regular hash join. Thus, at the end of the first iteration of the partitioning phase, the first
partition of the relation PUBLISHER resides completely in the main memory and other partitions reside
in a disk subfile.
In the second iteration of the partitioning phase, the tuples of the second relation BOOK are being
partitioned. If a tuple hashes to the first partition, it is immediately joined with the matching tuple
of PUBLISHER relation. If the tuple does not hash to the first partition, it is partitioned normally. Thus, at
the end of second iteration, all the tuples of BOOK relation that hash to the first partition have been
joined with the tuples of PUBLISHER relation. Now, there are n-1 pairs of partitions on the disk.
Therefore, during the joining phase, n-1 iterations are needed instead of n.
14. Discuss the algorithms for implementing union, intersection and set different operations.
Ans: The set operations, namely, union, intersection, and set difference can be implemented by first
sorting the relations on the basis of same attribute and then scanning each sorted relation once to
produce the result. For example, the operation R S can be implemented by concurrently scanning and
merging both the sorted relations—whenever same tuple exists in both the relations, only one of them
is kept in the merged result. The operation R ∩ S can be implemented by adding only those tuples to the
merged result that exist in both the relations. Finally, the operation R - S is implemented by retaining
only those tuples that exist in R but are not present in S.
If the relations are initially sorted in the same order, all these operations require bR +bS block transfers. If
the relations are not sorted initially, the cost of sorting the relations has to be included. The algorithms
for the operations R S, R ∩ S, and R-S using sorting are given in Figure 8.12.
Figure 8.12 Implementing Set Operations
15. Discuss the two ways of evaluating a query. Which one is better and why? Does the available
buffer space affect the query evaluation speed? If yes, how?
Ans: There are two ways of evaluating a query, namely, materialized evaluation and pipelined
evaluation.
Materialized evaluation: In this approach, each operation in the expression is evaluated one by one in
an appropriate order and the result of each operation is materialized (created) in a temporary relation,
which becomes input for the subsequent operations. For example, consider this relation-algebra
expression:
This expression consists of three relational operations, join, select and project. The query tree of this
expression is given in Figure 8.13.
In materialized evaluation, the lowest-level operations (operations at the bottom of operator tree) are
evaluated first. The inputs to the lowest-level operations are the relations in the database. For example,
in Figure 8.13, the operation σCategory=“Novel” (BOOK) is evaluated first and the result is stored in a temporary
relation. The temporary relation and the relation PUBLISHER are then given as inputs to the next-level
operation, that is, the join operation. The join operation then performs a join on these two relations and
stores the result in another temporary relation, which is given as input to the project operation, which is
at the root of the tree.
Pipelined evaluation: In this type of evaluation, the operations form a queue and results are passed
from one operation to another as they are calculated; rather than storing them in temporary relations.
For example, the operations given in Figure 8.13 can be placed in a pipeline. As soon as a tuple is
generated from the select operation, it is immediately passed to the join operation for processing.
Similarly, a tuple generated from the join operation is passed immediately to the project operation for
processing. Thus, the evaluation of these three operations in a pipelined manner directly generates the
final result without creating temporary relations.
It is clear from our discussion that pipelined evaluation is better as it increases the query-evaluation
efficiency by reducing the number of temporary relations. The system maintains one buffer for each pair
of adjacent operations to hold the tuples being passed from one operation to the next. Moreover, since
the results of the operations are not stored for a long time, lesser amount of memory is required in the
pipelining approach. However, the queries that involve aggregate or grouping functions cannot be
evaluated using pipelined evaluation because the inputs are not available to the operations for
processing all at once in the pipelining.
The available buffer space definitely affects the query evaluation speed. If a single buffer is used to hold
the intermediate result, it forces the CPU to wait for an I/O operation to complete. This is because while
the filled buffer is being written back to the disk, the CPU has nothing to do. On the other hand, in case
of double buffering, two buffers are used—one of them continues with the execution of the algorithm
while the other being written to the disk. This allows CPU activity to perform in parallel with I/O activity,
which in turn results in faster query evaluation.
Demand-driven pipeline: In this approach, the parent operation requests the next tuple from its child
operations as required, in order to output its next tuple. Whenever an operation receives a request for
the next set of tuples, it computes those tuples and then returns them. Since, the tuples are
generated lazily, on demand; this technique is also known as lazy evaluation.
Producer-driven pipeline: In this approach, each operation at the bottom of the pipeline produces
tuples without waiting for the request from the next higher-level operations. Each operation then puts
the output tuples in the output buffer associated with it. This output buffer is used as input buffer by the
operation at next higher level. The operation at any other level consumes the tuples from its input
buffer to produce the output tuples, and puts them in its output buffer. In any case, if the output buffer
is full, the operation has to wait until tuples are consumed by the operation at the next higher level. As
soon as the buffer has some space for more tuples, the operation again starts producing the tuples. This
process continues until all the tuples are generated. Since the tuples are generated eagerly, this
technique is also known as eager pipelining.
17. What are the two techniques of implementing query optimization? How would you estimate the
cost of the query?
Ans: There are two main techniques of implementing query optimization, namely, cost-based
optimization and heuristics-based optimization.
cost of accessing secondary storage—cost of searching, reading and writing disk blocks.
communication cost—cost of communicating the query result from one site to another.
the blocking factor of relation R, denoted by fR is the number of tuples that fit in one block.
the number of distinct values appearing in relation R for the attribute A, denoted by V (A, R). This
value is the same as the size of π A (R). If A is a key attribute, then V (A, R) = nR.
Assuming that the tuples of relation R are stored together physically in a file, then
In addition to the relation sizes and indexes height, real-world query optimizers also maintain some
other statistical information to increase the accuracy of cost estimates. For example, most databases
maintain histograms representing the distribution of values for each attribute. The values for the
attribute are divided into a number of ranges, and the number of tuples whose attribute value falls in a
particular range is associated with that range in the histogram. For example, the histograms for the
attributes Price and Page_count of the BOOK relation are shown in Figure 8.14.
A histogram takes very little space so histograms on various attributes can be maintained in the system
catalog. A histogram can be of two types, namely equi-width histogram and equi-depth histogram.
In equi-width histograms, the range of values is divided into equal-sized sub-ranges. On the other hand,
in equi-depth histograms, the sub-ranges are chosen in such a way that the number of tuples within
each sub-range is equal.
Ans: An expression is said to be equivalent to a given relational-algebra expression if, on every legal
database instance, it generates the same set of tuples as that of original expression irrespective of the
ordering of the tuples. For example, consider a relational-algebra expression for the query "Retrieve the
publisher name and category of the book C++".
In this expression, first the join operation between the BOOK and PUBLISHER relation is performed, and
then the tuples with book title C++ are retrieved from the resultant relation, and finally, the project
operation is applied to retrieve the attributes Pname and Category. When this sequence of operations is
followed, it produces a large intermediate relation, which results from the join operation. However, the
user is only interested in those tuples having Book_title="C++". Thus, another way to execute this query
is to first select only those tuples from the BOOK relation satisfying the condition Book_title="C++", and
then join it with the PUBLISHER relation. This reduces the size of the intermediate relation. The query
can now be represented by the following expression:
This expression is equivalent to the original expression, however, with less number of tuples in the
intermediate relation and thus, it is more efficient.
Ans: An equivalence rule states that expressions of two forms are equivalent if an expression of one
form can be replaced by expression of the other form, and vice versa. While discussing the equivalence
rules, it is assumed that c, c1, c2, …, cn are the predicates, A, A1, A2, …, An are the set of attributes and R,
R1, R2, …, Rn are the relation names.
Rule 1: The selection condition involving the conjunction of two or more predicates can be
deconstructed into a sequence of individual select operations. That is,
Rule 3: If a query contains a sequence of project operations, only the final operation is needed, the
others can be omitted. That is,
This transformation is called cascading of project operator.
Rule 4: If the selection condition c involves only the attributes A1, A2, …, An that are present in the
projection list, the two operations can be commuted. That is,
Rule 5: Selections can be combined with Cartesian products and joins. That is,
Rule 6: Cartesian product and join operations are commutative. That is,
Rule 7: Cartesian product and join operations are associative. That is,
Rule 8: The select operation distributes over the join operation in these two conditions:
a. If all the attributes in the selection condition c1 involve only the attributes of one of the relations
(say, R1), then the select operation distributes. That is,
b. If the selection condition c1 involves only the attributes of R1 and c2 involves the attributes of R2, then
the select operation distributes. That is,
Rule 9: The project operation distributes over the join operation in these two conditions:
a. If the join condition c involves only the attributes in A1 U A2, where A1 and A2 are the attributes
of R1 and R2 respectively, the project operation distributes. That is,
b. If the attributes A3 and A4 of R1 and R2, respectively, are involved in the join condition c, but not
in A1 A2, then,
Rule 10: The set operations union and intersection are commutative and associative. That is, a.
Commutative
b. Associative
Rule 11: The select operation distributes over the union, intersection and set difference operations. That
is,
Rule 12: The project operation distributes over the union operation. That is,
21. Discuss the steps followed by heuristic query optimizer to transform an initial query tree into an
optimal query tree. Give a suitable example in support of your answer.
Ans: The heuristic query optimizer follows these steps to transform initial query tree into an optimal
query tree.
Conjunctive selection operations are transformed into cascading selection operations (equivalence
rule 1—refer Question 23).
Since the select operation is commutative with other operations according to the equivalence rules 2,
4, 8, 11, it is moved as far down the query tree as permitted by the attributes involved in the selection
condition. That is, select operation is performed as early as possible to reduce the number of tuples
returned.
The select and join operations that are more restrictive (result in relations with the fewest tuples) are
executed first. That is, the select and join operation are rearranged so that they are accomplished with
the least amount of system overhead. This is done by reordering the leaf nodes of the tree among
themselves by applying rule 6, 7 and 10 (commutativity and associativity of binary operations).
Any Cartesian product operation followed by a select operation is combined into a single join
operation according to the equivalence rule 5(a).
The cascaded project operation is broken down and the attributes in the projection list are moved
down the query tree as far as possible. New project operation can also be created as required.
The attributes that are required in the query result and in the subsequent operations should only be
kept after each project operation. This will eliminate the return of unnecessary attributes (rules 3, 4, 9
and 12).
The subtrees that represent groups of operations that can be executed by a single algorithm are
identified.
The initial query tree of this expression is given in Figure 8.15(a). After applying the select operation on
the BOOK relation early, the number of tuples can be reduced. Similarly, after applying the project
operation to extract the desired attributes P_ID, Pname and Address from the PUBLISHER relation,
and P_ID from the result of a on BOOK relation (P_ID is the join attribute), the size of the intermediate
result can be reduced further. The final expression now becomes
The transformed query tree of the final expression after applying the select and project operations
earlier is given in Figure 8.15(b).
Figure 8.15 Transformation from Initial Query Tree to Final Query Tree
25. Consider the following relational-algebra queries on the Online Book database. Draw the initial
query trees for each of these queries.
Ans:
Ans:
Ans: