0% found this document useful (0 votes)

5 views7 pages

Chapter 4

This document discusses the computation of data cubes and OLAP queries in data warehouses, emphasizing the importance of efficient cube computation techniques for quick decision support. It explains the concept of cuboids and group-by operations, as well as the challenges of pre-computing all possible cuboids due to storage limitations, leading to strategies like partial materialization. Additionally, it covers data warehouse back-end tools for data cleaning, loading, refreshing, tuning, and testing to ensure optimal performance and accuracy.

Uploaded by

rakeshbachchan018833

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views7 pages

Chapter 4

Uploaded by

rakeshbachchan018833

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Unit 4

Computation of Data Cubes and OLAP Queries

Data warehouses contain huge volumes of data. OLAP servers demand that
decision support queries be answered in the order of seconds.
Therefore, it is crucial for data warehouse systems to support highly
efficient cube computation techniques, access methods, and query
processing techniques. At the core of multidimensional data analysis is
the efficient computation of aggregations across many sets of dimensions. In
SQL terms, these aggregations are referred to as group-by’s. Each group-by
can be represented by cuboids, where the set of group-by’s forms a lattice
of cuboids defining a data cube.

One approach to cube computation extends SQL so as to include a compute

cube operator. The compute cube operator computes aggregates over all
subsets of the dimensions specified in the operation. This can require
excessive storage space, especially for large numbers of dimensions.
Suppose that you would like to create a data cube for Electronics sales that
contains the following: city, item, year, and sales in dollars.

Taking the three attributes, city, item, and year, as the dimensions for the
data cube, and sales in dollars as the measure, the total number of cuboids,
or groupby’s, that can be computed for this data cube is 23 = 8. The possible
group-by’s are the following: f(city, item, year), (city, item), (city, year),
(item, year), (city), (item), (year), ()g, where () means that the group-by is
empty (i.e., the dimensions are not grouped). These group-by’s form a lattice
of cuboids for the data cube, as shown in Figure below. The base cuboid
contains all three dimensions, city, item, and year. It can return the total
sales for any combination of the three dimensions. The apex cuboid, or 0-D
cuboid, refers to the case where the group-by is empty. It contains the total
sum of all sales. The base cuboid is the least generalized (most specific) of
the cuboids. The apex cuboid is the most generalized (least specific) of the
cuboids, and is often denoted as all. If we start at the apex cuboid and
explore downward in the lattice, this is equivalent to drilling down within the
data cube. If we start at the base cuboid and explore upward, this is akin to
rolling up.

1
An SQL query containing no group-by, such as “compute the sum of total
sales,” is a zero-dimensional operation. An SQL query containing one group-
by, such as “compute the sum of sales, group by city,” is a one-dimensional
operation. A cube operator on n dimensions is equivalent to a collection of
group by statements, one for each subset of the n dimensions. Therefore,
the cube operator is the n-dimensional generalization of the group by
operator. Based on the syntax of DMQL, the data cube in this example could
be defined as
define cube sales cube [city, item, year]: sum(sales in dollars)

For a cube with n dimensions, there are a total of 2n cuboids, including the
base cuboid. A statement such as
compute cube sales cube
would explicitly instruct the system to compute the sales aggregate cuboids
for all of the eight subsets of the set {city, item, year}, including the empty
subset.

On-line analytical processing may need to access different cuboids for

different queries.

2
Therefore, it may seem like a good idea to compute all or at least some of
the cuboids in a data cube in advance. Pre-computation leads to fast
response time and avoid some redundant computation. A major
challenge related to this pre-computation, however, is that the required
storage space may explode if all of the cuboids in a data cube are pre-
computed, especially when the cube has many dimensions. The storage
requirements are even more excessive when many of the dimensions have
associated concept hierarchies, each with multiple levels. This problem is
referred to as the curse of dimensionality.

If there were no hierarchies associated with each dimension, then the total
number of cuboids for an n-dimensional data cube, as we have seen above,
is 2n. However, in practice, many dimensions do have hierarchies. For
example, the dimension time is usually not explored at only one conceptual
level, such as year, but rather at multiple conceptual levels, such as in the
hierarchy “day < month < quarter < year”. For an n-dimensional data cube,
the total number of cuboids that can be generated is:

Where Li is the number of levels associated with dimension i. One is added to

Li in
Equation (3.1) to include the virtual top level, all.

Example
If the cube has 10 dimensions and each dimension has 4 levels, what will be
the number of cuboids generated?
Solution
Here n=10 Li=4 for i=1,2…….10
Thus
Total number of cuboids= 5×5×5×5×5×5×5×5×5×5=5 109.8 ×106

From above example, it is unrealistic to pre-compute and materialize all of

the cuboids that can possibly be generated for a data cube (or from a base
cuboid). If there are many cuboids, and these cuboids are large in size, a
more reasonable option is partial materialization, that is, to materialize only
some of the possible cuboids that can be generated.

Computation of Selected Cuboids

3
There are three choices for data cube materialization given a base cuboid:
1. No materialization: Do not pre-compute any of the cuboids. This
leads to computing expensive multidimensional aggregates on the fly,
which can be extremely slow.
2. Full materialization: Pre-compute all of the cuboids. The resulting
lattice of computed cuboids is referred to as the full cube. This choice
typically requires huge amounts of memory space in order to store all
of the pre-computed cuboids.
3. Partial materialization: Selectively compute a proper subset of the
whole set of possible cuboids. Alternatively, we may compute a subset
of the cube, which contains only those cells that satisfy some user-
specified criterion, such as where the tuple count of each cell is above
some threshold. We will use the term sub-cube to refer to the latter
case, where only some of the cells may be pre-computed for various
cuboids. Partial materialization represents an interesting trade-off
between storage space and response time.

The partial materialization of cuboids or sub-cubes should consider three

factors: (1) identify the subset of cuboids or sub-cubes to materialize; (2)
exploit the materialized cuboids or sub-cubes during query processing; and
(3) efficiently update the materialized cuboids or sub-cubes during load and
refresh.

The selection of the subset of cuboids or sub-cubes to materialize should

take into account the queries in the workload, their frequencies, and their
accessing costs. In addition, it should consider workload characteristics, the
cost for incremental updates, and the total storage requirements. .A popular
approach is to materialize the set of cuboids on which other frequently
referenced cuboids are based. Alternatively, we can compute an iceberg
cube, which is a data cube that stores only those cube cells whose
aggregate value (e.g., count) is above some minimum support threshold.
Another common strategy is to materialize a shell cube. This involves pre-
computing the cuboids for only a small number of dimensions (such as 3 to
5) of a data cube. Queries on additional combinations of the dimensions can
be computed on-the-fly. An iceberg cube can be specified with an SQL query,
as shown in the following example.
compute cube sales iceberg as
select month, city, customer group, count(*)
from salesInfo
cube by month, city, customer group

4
having count(*) >= min_sup

The compute cube statement specifies the pre-computation of the iceberg

cube, sales iceberg, with the dimensions month, city, and customer group,
and the aggregate measure count(). The input tuples are in the salesInfo
relation. The cube by clause specifies that aggregates (group-by’s) are to be
formed for each of the possible subsets of the given dimensions.

Data Warehouse Back end Tools

As already mentioned, data warehousing systems use various data
extraction and cleaning tools, and load and refresh utilities for populating
data warehouses. Below we describe the back-end tools and utilities.
Data Cleaning
The data warehouse involves large volumes of data from multiple sources,
which can lead to a high probability of errors and anomalies in the data.
Inconsistent field lengths, inconsistent descriptions, inconsistent value
assignments, missing entries and violation of integrity constraints are some
of the examples. The three classes of data cleaning tools are popularly used
to help detect data anomalies and correct them:
 Data migration tools allow simple transformation rules to be
specified.
 Data scrubbing tools use domain-specific knowledge to do the
scrubbing of data. Tools such as Integrity and Trillum fall in this
category.
 Data auditing tools make it possible to discover rules and
relationships by scanning data. Thus, such tools may be considered
variants of data mining tools.
Load
After extracting, cleaning, and transforming, data will be loaded into the data
warehouse. A load utility has to allow the system administrator to monitor
status, to cancel, suspend and resume a load, and to restart after failure with
no loss of data integrity. Sequential loads can take a very long time to
complete especially when it deals with terabytes of data. Therefore,
pipelined and partitioned parallelism are typically used. Also incremental
loading over full load is more popularly used with most commercial utilities
since it reduces the volume of data that has to be incorporated into the data
warehouse.

Refresh

5
Refreshing a warehouse consists in propagating updates on source data to
correspondingly update the base data and derived data stored in the
warehouse. There are two sets of issues to consider: when to refresh, and
how to refresh. Usually, the warehouse is refreshed periodically (e.g., daily or
weekly). Only if some OLAP queries need current data, it is necessary to
propagate every update. The refresh policy is set by the warehouse
administrator, depending on user needs and may be different for different
sources.
Refresh techniques may also depend on the characteristics of the source and
the capabilities of the database servers. Extracting an entire source file or
database is usually too expensive, but may be the only choice for legacy
data sources. Most contemporary database systems provide replication
servers that support incremental techniques for propagating updates from a
primary database to one or more replicas. Such replication servers can be
used to incrementally refresh a warehouse when the sources change.

Data Warehouse Tuning

The process of applying different strategies in performing different
operations of data warehouse such that performance measures will enhance
is called data warehousing tuning. For this, it is very important to have a
complete knowledge of data warehouse. We can tune the different aspects of
a data warehouse such as performance, data load, queries, etc. A data
warehouse keeps evolving and it is unpredictable what query the user is
going to post in the future. Therefore it becomes more difficult to tune a data
warehouse system. Tuning a data warehouse is a difficult procedure due to
following reasons:
 Data warehouse is dynamic; it never remains constant.
 It is very difficult to predict what query the user is going to post in the
future.
 Business requirements change with time.
 Users and their profiles keep changing.
 The user can switch from one group to another.
 The data load on the warehouse also changes with time.

Data Warehouse - Testing

Testing is very important for data warehouse systems to make them work
correctly and efficiently. There are three basic levels of testing performed on
a data warehouse:
 Unit testing

6
 Integration testing
 System testing
Unit Testing
 In unit testing, each component is separately tested.
 Each module, i.e., procedure, program, SQL Script, Unix shell is tested.
 This test is performed by the developer.
Integration Testing
 In integration testing, the various modules of the application are
brought together and then tested against the number of inputs.
 It is performed to test whether the various components do well after
integration.
System Testing
 In system testing, the whole data warehouse application is tested
together.
 The purpose of system testing is to check whether the entire system
works correctly together or not.
 System testing is performed by the testing team.
 Since the size of the whole data warehouse is very large, it is usually
possible to perform minimal system testing before the test plan can be
enacted.

(Ebook PDF) Concepts of Database Management 10th Edition Instant Download
67% (3)
(Ebook PDF) Concepts of Database Management 10th Edition Instant Download
46 pages
Data Ware House Concept 2019 (Compatibility Mode) PDF
No ratings yet
Data Ware House Concept 2019 (Compatibility Mode) PDF
25 pages
Supported Oracle NetSuite PBCS Sync SuiteApp Saved Searches
No ratings yet
Supported Oracle NetSuite PBCS Sync SuiteApp Saved Searches
2 pages
SAP BI Practical Guide: 350+ Resources
No ratings yet
SAP BI Practical Guide: 350+ Resources
16 pages
DM Module 2
No ratings yet
DM Module 2
47 pages
CSE Data Mining & Warehousing Notes
No ratings yet
CSE Data Mining & Warehousing Notes
30 pages
DWDM Module 2
No ratings yet
DWDM Module 2
76 pages
BMW M-2
No ratings yet
BMW M-2
41 pages
Lecture 4
No ratings yet
Lecture 4
2 pages
Data Cube Computation and Data Generalization: Lesson Introduction
No ratings yet
Data Cube Computation and Data Generalization: Lesson Introduction
11 pages
Data Warehousing for ISE Students
No ratings yet
Data Warehousing for ISE Students
41 pages
Data Warehousing Explained
No ratings yet
Data Warehousing Explained
21 pages
DM and DW Notes-Module2
No ratings yet
DM and DW Notes-Module2
18 pages
Unit 2
No ratings yet
Unit 2
26 pages
DWDM Unit 2 Part 2 by Jithender Tulasi
No ratings yet
DWDM Unit 2 Part 2 by Jithender Tulasi
63 pages
Data Cube Computation
No ratings yet
Data Cube Computation
5 pages
1.6 Efficient Data Cube Computation & Indexing OLAP
No ratings yet
1.6 Efficient Data Cube Computation & Indexing OLAP
25 pages
Implementation: Data Warehouse
No ratings yet
Implementation: Data Warehouse
56 pages
Unit-4 Finalized
No ratings yet
Unit-4 Finalized
7 pages
Module 2
No ratings yet
Module 2
19 pages
Unit1 Dwbi
No ratings yet
Unit1 Dwbi
59 pages
Data Warehousing & Modeling: Module - 2
No ratings yet
Data Warehousing & Modeling: Module - 2
144 pages
Data Cube
No ratings yet
Data Cube
55 pages
Module 2 DMDW
No ratings yet
Module 2 DMDW
132 pages
Chapter 3 Topic - 4
No ratings yet
Chapter 3 Topic - 4
29 pages
DW Seminar
No ratings yet
DW Seminar
13 pages
Data Cube Insights for Analysts
No ratings yet
Data Cube Insights for Analysts
14 pages
Unit-I: Introduction and Data Warehousing
No ratings yet
Unit-I: Introduction and Data Warehousing
17 pages
DMDW 1 2nd Module
No ratings yet
DMDW 1 2nd Module
29 pages
Data Warehousing Implementation
No ratings yet
Data Warehousing Implementation
18 pages
Unit 4 - Data Cube Technology
No ratings yet
Unit 4 - Data Cube Technology
27 pages
Data Cube
No ratings yet
Data Cube
5 pages
Data Warehouse - Logical Design
No ratings yet
Data Warehouse - Logical Design
40 pages
Bca DM Unit Ii
No ratings yet
Bca DM Unit Ii
17 pages
Chapter 2 and 3
No ratings yet
Chapter 2 and 3
89 pages
Unit - 4 Final
No ratings yet
Unit - 4 Final
71 pages
Duck Data Umpire by Cubical Kits: Sarathchand P.V. B.E (Cse), M.Tech (CS), (PHD) Professor and Research Scholar
No ratings yet
Duck Data Umpire by Cubical Kits: Sarathchand P.V. B.E (Cse), M.Tech (CS), (PHD) Professor and Research Scholar
4 pages
P7 CubeTech
No ratings yet
P7 CubeTech
34 pages
Alwajidi 2018
No ratings yet
Alwajidi 2018
5 pages
3 Business Analysis in Data Mining L6 7 8-9-10
No ratings yet
3 Business Analysis in Data Mining L6 7 8-9-10
39 pages
Data Reduction Techniques Guide
No ratings yet
Data Reduction Techniques Guide
39 pages
OLAP Operations in The Multidimensional Data Model
No ratings yet
OLAP Operations in The Multidimensional Data Model
1 page
DWDM Unit 2 PDF
No ratings yet
DWDM Unit 2 PDF
16 pages
2.a Multidimensional Model Slides
No ratings yet
2.a Multidimensional Model Slides
27 pages
Unit 2 - Data Science BCA
No ratings yet
Unit 2 - Data Science BCA
20 pages
3 - Business Analysis in Data Mining - L6 - 7 - 8 - 9 - 10
No ratings yet
3 - Business Analysis in Data Mining - L6 - 7 - 8 - 9 - 10
39 pages
Data Cubes
No ratings yet
Data Cubes
17 pages
Iarjset 2024 11912
No ratings yet
Iarjset 2024 11912
14 pages
Data Cube Technology
No ratings yet
Data Cube Technology
66 pages
Cube Computation and Indexes For Data Warehouses: CPS 196.03 Notes 7
No ratings yet
Cube Computation and Indexes For Data Warehouses: CPS 196.03 Notes 7
28 pages
Datamining 1
No ratings yet
Datamining 1
21 pages
DM Unit 2
No ratings yet
DM Unit 2
12 pages
Data Warehousing & OLAP Insights
No ratings yet
Data Warehousing & OLAP Insights
53 pages
Batch B DWM Experiments
No ratings yet
Batch B DWM Experiments
90 pages
6 1 DWM 2019 S
No ratings yet
6 1 DWM 2019 S
7 pages
UNIT2DM
No ratings yet
UNIT2DM
63 pages
Hybrid Fuzzy Approches For Networks
No ratings yet
Hybrid Fuzzy Approches For Networks
5 pages
What Is Data Warehouse?: Data Mining by IK Unit 2
No ratings yet
What Is Data Warehouse?: Data Mining by IK Unit 2
21 pages
09 Data Serving
No ratings yet
09 Data Serving
46 pages
Unit2 Olap
No ratings yet
Unit2 Olap
13 pages
Chapter 8
No ratings yet
Chapter 8
15 pages
Chapter 2
No ratings yet
Chapter 2
5 pages
Chapter 7
No ratings yet
Chapter 7
8 pages
Vipashyana Dhyan Sadhana Dolendraratna Shakya
No ratings yet
Vipashyana Dhyan Sadhana Dolendraratna Shakya
172 pages
17-SQL (GROUP BY & HAVING Clause)
No ratings yet
17-SQL (GROUP BY & HAVING Clause)
16 pages
Dr. TV. Geetha
No ratings yet
Dr. TV. Geetha
176 pages
Personalized Mobile Search Optimization
No ratings yet
Personalized Mobile Search Optimization
5 pages
SQL Exercises: Bruce Momjian February 2004
No ratings yet
SQL Exercises: Bruce Momjian February 2004
2 pages
SAP Archiving Process-Simple Steps
No ratings yet
SAP Archiving Process-Simple Steps
20 pages
SQL Syntex
No ratings yet
SQL Syntex
18 pages
Mysql Cheat Sheet A4 - 20231206 - 223338
No ratings yet
Mysql Cheat Sheet A4 - 20231206 - 223338
1 page
Siebel Fundamentals
No ratings yet
Siebel Fundamentals
174 pages
21CSC205P Database Management Systems Project Report: Register No.: RA2211003050027
No ratings yet
21CSC205P Database Management Systems Project Report: Register No.: RA2211003050027
38 pages
AWS - Certified Cloud Practitioner (CLF-C01) Notes 30
No ratings yet
AWS - Certified Cloud Practitioner (CLF-C01) Notes 30
1 page
SQL (Relational) Databases - FastAPI
No ratings yet
SQL (Relational) Databases - FastAPI
228 pages
Data Warehousing Quick Guide
No ratings yet
Data Warehousing Quick Guide
43 pages
Searching
No ratings yet
Searching
39 pages
Database Management System (DBMS) (BEG376CO)
No ratings yet
Database Management System (DBMS) (BEG376CO)
2 pages
Irecruitment Data Migration - How Can We Migrate Resume
No ratings yet
Irecruitment Data Migration - How Can We Migrate Resume
5 pages
21cs71BDA Question Bank
No ratings yet
21cs71BDA Question Bank
4 pages
Storage Tek Storage
No ratings yet
Storage Tek Storage
27 pages
MySQL Commands PDF
No ratings yet
MySQL Commands PDF
12 pages
Database Management System Lab Guide
No ratings yet
Database Management System Lab Guide
26 pages
Chapter 4
No ratings yet
Chapter 4
6 pages
Intro to Information Retrieval
No ratings yet
Intro to Information Retrieval
15 pages
Prepare Sample Data To Practice SQL
No ratings yet
Prepare Sample Data To Practice SQL
21 pages
SQL For Interview Prep - Multiple Tables Cheatsheet - Codecademy
No ratings yet
SQL For Interview Prep - Multiple Tables Cheatsheet - Codecademy
2 pages
Informatica HCL
100% (1)
Informatica HCL
221 pages
Hypertable An Open Source, High Performance, Scalable Database
100% (2)
Hypertable An Open Source, High Performance, Scalable Database
37 pages
Stemming Algorithms - A Case Study For Detailed Evaluation
No ratings yet
Stemming Algorithms - A Case Study For Detailed Evaluation
27 pages

Chapter 4

Uploaded by

Chapter 4

Uploaded by

Unit 4

Computation of Data Cubes and OLAP Queries

One approach to cube computation extends SQL so as to include a compute

On-line analytical processing may need to access different cuboids for

Where Li is the number of levels associated with dimension i. One is added to

From above example, it is unrealistic to pre-compute and materialize all of

Computation of Selected Cuboids

The partial materialization of cuboids or sub-cubes should consider three

The selection of the subset of cuboids or sub-cubes to materialize should

The compute cube statement specifies the pre-computation of the iceberg

Data Warehouse Back end Tools

Data Warehouse Tuning

Data Warehouse - Testing

You might also like