Unit 4
Computation of Data Cubes and OLAP Queries
Data warehouses contain huge volumes of data. OLAP servers demand that
decision support queries be answered in the order of seconds.
Therefore, it is crucial for data warehouse systems to support highly
efficient cube computation techniques, access methods, and query
processing techniques. At the core of multidimensional data analysis is
the efficient computation of aggregations across many sets of dimensions. In
SQL terms, these aggregations are referred to as group-by’s. Each group-by
can be represented by cuboids, where the set of group-by’s forms a lattice
of cuboids defining a data cube.
One approach to cube computation extends SQL so as to include a compute
cube operator. The compute cube operator computes aggregates over all
subsets of the dimensions specified in the operation. This can require
excessive storage space, especially for large numbers of dimensions.
Suppose that you would like to create a data cube for Electronics sales that
contains the following: city, item, year, and sales in dollars.
Taking the three attributes, city, item, and year, as the dimensions for the
data cube, and sales in dollars as the measure, the total number of cuboids,
or groupby’s, that can be computed for this data cube is 23 = 8. The possible
group-by’s are the following: f(city, item, year), (city, item), (city, year),
(item, year), (city), (item), (year), ()g, where () means that the group-by is
empty (i.e., the dimensions are not grouped). These group-by’s form a lattice
of cuboids for the data cube, as shown in Figure below. The base cuboid
contains all three dimensions, city, item, and year. It can return the total
sales for any combination of the three dimensions. The apex cuboid, or 0-D
cuboid, refers to the case where the group-by is empty. It contains the total
sum of all sales. The base cuboid is the least generalized (most specific) of
the cuboids. The apex cuboid is the most generalized (least specific) of the
cuboids, and is often denoted as all. If we start at the apex cuboid and
explore downward in the lattice, this is equivalent to drilling down within the
data cube. If we start at the base cuboid and explore upward, this is akin to
rolling up.
1
An SQL query containing no group-by, such as “compute the sum of total
sales,” is a zero-dimensional operation. An SQL query containing one group-
by, such as “compute the sum of sales, group by city,” is a one-dimensional
operation. A cube operator on n dimensions is equivalent to a collection of
group by statements, one for each subset of the n dimensions. Therefore,
the cube operator is the n-dimensional generalization of the group by
operator. Based on the syntax of DMQL, the data cube in this example could
be defined as
define cube sales cube [city, item, year]: sum(sales in dollars)
For a cube with n dimensions, there are a total of 2n cuboids, including the
base cuboid. A statement such as
compute cube sales cube
would explicitly instruct the system to compute the sales aggregate cuboids
for all of the eight subsets of the set {city, item, year}, including the empty
subset.
On-line analytical processing may need to access different cuboids for
different queries.
2
Therefore, it may seem like a good idea to compute all or at least some of
the cuboids in a data cube in advance. Pre-computation leads to fast
response time and avoid some redundant computation. A major
challenge related to this pre-computation, however, is that the required
storage space may explode if all of the cuboids in a data cube are pre-
computed, especially when the cube has many dimensions. The storage
requirements are even more excessive when many of the dimensions have
associated concept hierarchies, each with multiple levels. This problem is
referred to as the curse of dimensionality.
If there were no hierarchies associated with each dimension, then the total
number of cuboids for an n-dimensional data cube, as we have seen above,
is 2n. However, in practice, many dimensions do have hierarchies. For
example, the dimension time is usually not explored at only one conceptual
level, such as year, but rather at multiple conceptual levels, such as in the
hierarchy “day < month < quarter < year”. For an n-dimensional data cube,
the total number of cuboids that can be generated is:
Where Li is the number of levels associated with dimension i. One is added to
Li in
Equation (3.1) to include the virtual top level, all.
Example
If the cube has 10 dimensions and each dimension has 4 levels, what will be
the number of cuboids generated?
Solution
Here n=10 Li=4 for i=1,2…….10
Thus
Total number of cuboids= 5×5×5×5×5×5×5×5×5×5=5 109.8 ×106
From above example, it is unrealistic to pre-compute and materialize all of
the cuboids that can possibly be generated for a data cube (or from a base
cuboid). If there are many cuboids, and these cuboids are large in size, a
more reasonable option is partial materialization, that is, to materialize only
some of the possible cuboids that can be generated.
Computation of Selected Cuboids
3
There are three choices for data cube materialization given a base cuboid:
1. No materialization: Do not pre-compute any of the cuboids. This
leads to computing expensive multidimensional aggregates on the fly,
which can be extremely slow.
2. Full materialization: Pre-compute all of the cuboids. The resulting
lattice of computed cuboids is referred to as the full cube. This choice
typically requires huge amounts of memory space in order to store all
of the pre-computed cuboids.
3. Partial materialization: Selectively compute a proper subset of the
whole set of possible cuboids. Alternatively, we may compute a subset
of the cube, which contains only those cells that satisfy some user-
specified criterion, such as where the tuple count of each cell is above
some threshold. We will use the term sub-cube to refer to the latter
case, where only some of the cells may be pre-computed for various
cuboids. Partial materialization represents an interesting trade-off
between storage space and response time.
The partial materialization of cuboids or sub-cubes should consider three
factors: (1) identify the subset of cuboids or sub-cubes to materialize; (2)
exploit the materialized cuboids or sub-cubes during query processing; and
(3) efficiently update the materialized cuboids or sub-cubes during load and
refresh.
The selection of the subset of cuboids or sub-cubes to materialize should
take into account the queries in the workload, their frequencies, and their
accessing costs. In addition, it should consider workload characteristics, the
cost for incremental updates, and the total storage requirements. .A popular
approach is to materialize the set of cuboids on which other frequently
referenced cuboids are based. Alternatively, we can compute an iceberg
cube, which is a data cube that stores only those cube cells whose
aggregate value (e.g., count) is above some minimum support threshold.
Another common strategy is to materialize a shell cube. This involves pre-
computing the cuboids for only a small number of dimensions (such as 3 to
5) of a data cube. Queries on additional combinations of the dimensions can
be computed on-the-fly. An iceberg cube can be specified with an SQL query,
as shown in the following example.
compute cube sales iceberg as
select month, city, customer group, count(*)
from salesInfo
cube by month, city, customer group
4
having count(*) >= min_sup
The compute cube statement specifies the pre-computation of the iceberg
cube, sales iceberg, with the dimensions month, city, and customer group,
and the aggregate measure count(). The input tuples are in the salesInfo
relation. The cube by clause specifies that aggregates (group-by’s) are to be
formed for each of the possible subsets of the given dimensions.
Data Warehouse Back end Tools
As already mentioned, data warehousing systems use various data
extraction and cleaning tools, and load and refresh utilities for populating
data warehouses. Below we describe the back-end tools and utilities.
Data Cleaning
The data warehouse involves large volumes of data from multiple sources,
which can lead to a high probability of errors and anomalies in the data.
Inconsistent field lengths, inconsistent descriptions, inconsistent value
assignments, missing entries and violation of integrity constraints are some
of the examples. The three classes of data cleaning tools are popularly used
to help detect data anomalies and correct them:
Data migration tools allow simple transformation rules to be
specified.
Data scrubbing tools use domain-specific knowledge to do the
scrubbing of data. Tools such as Integrity and Trillum fall in this
category.
Data auditing tools make it possible to discover rules and
relationships by scanning data. Thus, such tools may be considered
variants of data mining tools.
Load
After extracting, cleaning, and transforming, data will be loaded into the data
warehouse. A load utility has to allow the system administrator to monitor
status, to cancel, suspend and resume a load, and to restart after failure with
no loss of data integrity. Sequential loads can take a very long time to
complete especially when it deals with terabytes of data. Therefore,
pipelined and partitioned parallelism are typically used. Also incremental
loading over full load is more popularly used with most commercial utilities
since it reduces the volume of data that has to be incorporated into the data
warehouse.
Refresh
5
Refreshing a warehouse consists in propagating updates on source data to
correspondingly update the base data and derived data stored in the
warehouse. There are two sets of issues to consider: when to refresh, and
how to refresh. Usually, the warehouse is refreshed periodically (e.g., daily or
weekly). Only if some OLAP queries need current data, it is necessary to
propagate every update. The refresh policy is set by the warehouse
administrator, depending on user needs and may be different for different
sources.
Refresh techniques may also depend on the characteristics of the source and
the capabilities of the database servers. Extracting an entire source file or
database is usually too expensive, but may be the only choice for legacy
data sources. Most contemporary database systems provide replication
servers that support incremental techniques for propagating updates from a
primary database to one or more replicas. Such replication servers can be
used to incrementally refresh a warehouse when the sources change.
Data Warehouse Tuning
The process of applying different strategies in performing different
operations of data warehouse such that performance measures will enhance
is called data warehousing tuning. For this, it is very important to have a
complete knowledge of data warehouse. We can tune the different aspects of
a data warehouse such as performance, data load, queries, etc. A data
warehouse keeps evolving and it is unpredictable what query the user is
going to post in the future. Therefore it becomes more difficult to tune a data
warehouse system. Tuning a data warehouse is a difficult procedure due to
following reasons:
Data warehouse is dynamic; it never remains constant.
It is very difficult to predict what query the user is going to post in the
future.
Business requirements change with time.
Users and their profiles keep changing.
The user can switch from one group to another.
The data load on the warehouse also changes with time.
Data Warehouse - Testing
Testing is very important for data warehouse systems to make them work
correctly and efficiently. There are three basic levels of testing performed on
a data warehouse:
Unit testing
6
Integration testing
System testing
Unit Testing
In unit testing, each component is separately tested.
Each module, i.e., procedure, program, SQL Script, Unix shell is tested.
This test is performed by the developer.
Integration Testing
In integration testing, the various modules of the application are
brought together and then tested against the number of inputs.
It is performed to test whether the various components do well after
integration.
System Testing
In system testing, the whole data warehouse application is tested
together.
The purpose of system testing is to check whether the entire system
works correctly together or not.
System testing is performed by the testing team.
Since the size of the whole data warehouse is very large, it is usually
possible to perform minimal system testing before the test plan can be
enacted.