Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
15 views213 pages

12 DataWarehousing

12-DataWarehousing

Uploaded by

asdf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views213 pages

12 DataWarehousing

12-DataWarehousing

Uploaded by

asdf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 213

OLAP/Data Warehouses

Yannis Kotidis
What is a Database?
• From Wikipedia:
– A database is a structured collection of records or data. A computer database relies upon software to
organize the storage of data. The software models the database structure in what are known as
database models. The model in most common use today is the relational model. Other models such as
the hierarchical model and the network model use a more explicit representation of relationships …

– Database management systems (DBMS) are the software used to organize and maintain the database.
These are categorized according to the database model that they support. The model tends to
determine the query languages that are available to access the database. A great deal of the internal
engineering of a DBMS, however, is independent of the data model, and is concerned with managing
factors such as performance, concurrency, integrity, and recovery from hardware failures. ...

2
Note

• Term “database” often used interchangeably for both the


data and the system that manages it

3
Basic Database Usage (1): Querying

Statements
Relations Results
(select columns and rows)

A B C D E A D

A D

4
Basic Database Usage (2): Updates
• Banking transaction: transfer 100 euro from
account A to account B
– What can go wrong?

Account Balance

A 275 -100

B 64 +100

5
Issue 1: Partial results
• System failure prior to adding funds to
account B (but after deleting them from A)

Account Balance

A 175 -100 

B 64 +100 SYSTEM FAILURE

6
Issue 2: No isolation
• For an observer that monitors all funds
money seem to temporality disappear (and
reappear again)
Account Balance

A 175 -100 total funds are


reduced by 100

B 64 +100

7
Issue 3: lost update
• Two concurrent transactions on account A
– T1: remove 100
– T2: remove 50
Account Balance

T2 T1
Read balance (275) Read balance (275)
Subtract 50 A 275 Subtract 100
Write balance 225 Write balance 175
B 64

8
Programming abstraction: Transactions
• Implement real-world transactions

Commit
Begin Run
Abort

• DBMSs guarantee ACID properties


– Atomicity
– Consistency
– Isolation
– Durability

9
Atomicity (A.C.I.D.)
• The "all or nothing" property.
– Programmer needn't worry about partial states persisting.
– Two possible outcomes: transaction commits or rollbacks
(aborts)
Abort

Begin Run

Commit

• Examples:
– T1: Delete person from consultants table, insert person into
employees table
– T2: Transfer funds from account A to account B

10
Consistency (A.C.I.D)
• The database should start out "consistent“
(legal state), and at the end of transaction
remain "consistent".
• The definition of "consistent" is up to the
database administrator to define to the
system
– integrity constraints
– other notions of consistency must be handled by
the application.

11
Integrity or correctness of data
• Would like data to be “accurate” or
“correct” at all times

Name Age
EMP: CREATE TABLE EMP (
John 52 Name varchar(255) NOT NULL,
Jim 24 Age int,
CHECK (Age>=18)
Martha 1 );

12
Integrity/consistency constraints
• Predicates data must satisfy
• Examples:
– age >= 18 and age < 65
– x is key of relation R
– x → y holds in R
– Domain(x) = {Red, Blue, Green}
– no employee should make more than twice the
average salary

13
Isolation (A.C.I.D)
• Each transaction must appear to be executed
as if no other transaction is executing at the
same time.
• Transfer funds from A to B (T1).
• Another teller makes a query on A and B (T2).
• T2 could see funds on A or B but not in both!
– Result may be independent of the time
transactions were submitted

14
Durability (A.C.I.D.)
• Once committed, the transactions effects
should not disappear.
– Of course, they may be overwritten by subsequent
committed transactions.

15
Implementation
• A, C, and D are mostly guaranteed by recovery
(usually implemented via logging).
• I is mostly guaranteed by concurrency control
(usually implemented via locking).
• Of course, life is not so simple. For example,
recovery typically requires concurrency
control and depends on certain behavior by
the buffer manager…

16
Operational DBs: OLTP systems
• OLTP= On-Line Transaction Processing
– order update: pull up order# XXX and update status
flag to “completed”
update Orders set status=“Completed”
where orderID=“XXX”
Index on Orders.orderID
orderID=“XXX”

XXX

17
Reconstruction of logical records
Employees Projects Assignments
EmpID Ename ProjID Pname EmpID ProjID Hours
101 John Smith 2 Web_TV 101 3 16
102 Nick Long 3 Web_portal 102 2 24
103 Susan Goal 4 Billing 102 3 8
104 John English 104 4 32
105 Alice Web 105 4 24
106 Patricia Kane 106 4 24

• List projects & hours assigned to employee Nick Long


Select Pname,Hours
From Employees E, Projects P, Assignments A
Where E.Ename = “Nick Long”
And E.EmpID=A.EmpID
And A.ProjID=P.ProjID
18
Physical Plan (step a): IndexSeek
Employees Projects Assignments
EmpID Ename ProjID Pname EmpID ProjID Hours
101 John Smith 2 Web_TV 101 3 16
102 Nick Long 3 Web_portal 102 2 24
103 Susan Goal 4 Billing 102 3 8
104 John English 104 4 32
105 Alice Web 105 4 24
106 Patricia Kane 106 4 24

Nick Long
σE.name=“Nick Long”(Employees) <102,Nick Long>

Index on Employees.Ename
19
Physical Plan (step b):
INLJ(Employees,Assignments)
Employees Projects Assignments
EmpID Ename ProjID Pname EmpID ProjID Hours
101 John Smith 2 Web_TV 101 3 16
102 Nick Long 3 Web_portal 102 2 24
103 Susan Goal 4 Billing 102 3 8
104 John English 104 4 32
105 Alice Web 105 4 24
106 Patricia Kane 106 4 24

EmpID=102
<102,2,24>
Employees Assignments <102,3,8>

Index on Assignments.EmpID
20
Physical Plan (step c):
INLJ(Assignments,Projects)
Employees Projects Assignments
EmpID Ename ProjID Pname EmpID ProjID Hours
101 John Smith 2 Web_TV 101 3 16
102 Nick Long 3 Web_portal 102 2 24
103 Susan Goal 4 Billing 102 3 8
104 John English 104 4 32
105 Alice Web 105 4 24
106 Patricia Kane 106 4 24

ProjID=2
<2,Web_TV>
ProjID=3
Assignments Projects <3,Web_portal>

Index on Projects.ProjID (primary key)


21
On-Line Transaction Processing
• Examples
– order update: pull up order# XXX and update status flag to “completed”
– banking: transfer 100 euros from account #A to account #B
• Transactions:
– Implement structured, repetitive clerical data processing tasks
– Require detailed, up-to-date data
– Are (most of the times) short-lived
• read and/or update a few records
• Integrity of the database is critical
– DBMS should manage hundreds or thousands of concurrent
transactions
• Systems supporting this kind of activity are called transactional
systems
– Most traditional database management systems

22
Transactional Systems
• Transactional systems are optimized primarily for the
here and now

• Can support many simultaneous users


– concurrent read/write access

• Transactional systems don’t necessarily record all


previous data states
– E.g. customer updates its address (moves to new town)

• Lots of data gets thrown away or archived


– Old orders are deleted/archived to reduce size

23
Analytical queries on a production
system?
• CEO wants to report total sales per store in Athens,
for stores with at least 500 sales
• 3 tables: Sales(custid, productid,storeid,amt)
Stores(storeid, manager,addressid)
Addresses(addressid,number,street,city)
SELECT Stores.storeid, SUM(amt) as totalSales Aggregation
FROM Sales, Stores, Addresses
WHERE Stores.storeid = Sales.storeid Joins
AND Stores.addressid=Addresses.addresid
AND Addresses.city=“Athens”
GROUP BY Stores.storeid Group by
HAVING count(*)  500 Filter/Aggregation
Ι. Κωτίδης 24
Logical Plan
πstoreId,totalSales

COUNT(*)  500

Stores.storeId,SUM(amt)->totalSales, COUNT(*)

Sales(custid, productid,storeid,amt)

city = “Athens” Stores(storeid, manager,addressid)


What happens if
new sales take
Addresses(addressid,number,street,city)
place while this
query executes?
Sad realization
• Analytical queries on an operational database
often take for ever
– Schema favors small atomic actions
• Excessive normalization results in costly joins
– Need to scan LOTS of records
• Indexes are not very useful when queries are not
selective
– Interference with daily transactions
• Overhead of OLTP engine (logging, locking)

26
My employees & their projects
EmpID Ename ProjID Pname City Hours
101 John Smith 3 Web_portal Thessaloniki 16
102 Nick Long 2 Web_TV Athens 24
103 Susan Goal 3 Web_portal Thessaloniki 8
104 John English 4 Billing Athens 32
105 Alice Web 4 Billing Athens 24
106 Patricia Kane 4 Billing Athens 24

• Schema is bad for OLTP (1NF)


– Update anomalies, repetition of values
• But is all we need for reporting our employees and their projects!

27
OLAP:
ONLINE ANALYTICAL PROCESSING
OLAP (Online analytical processing)
• OLAP is the process of creating and
summarizing historical, multidimensional data
– Enhances organizational understanding of data
– Supports informed decision-making through
Decision Support Systems and Business
Intelligence
– Enables users to manipulate and explore data
easily and intuitively

34
Data Analytics Stack
OLAP Data Mining Machine Learning

• Well defined • Seek to find • Build models for


computations over relationships and prediction,
data categorized by patterns in data classification etc.
multiple dimensions • Frequent itemsets • Classification
of interest • Association rules • Forecasting
• Enables users to • Clustering • Churn
easily and
• Sentiment analysis
selectively extract
and query data in
order to analyse it
from different
points of view
OLAP Examples
OLAP A. Group
Data sales data (facts) across
Mining different
Machine Learning
dimensions: Product, Customer, Location (point
• Well defined • Seek to find
of sale) and Time • Build models for
computations over relationships andidentify what,prediction,
▪ Dimensions who, where & when
data categorized by B. patterns in data
Compute interesting stats classification etc.
on selected measures
multiple dimensions • Frequent itemset • Image classification
of interest Data Analysis Stack
• Association rules
Examples: • Speech processing
• Enables users to • Link prediction
1. “Average January sales (€) •for
Sentiment analysis
all stores in Attika”
easily and
2. “Number of shoes over 100€ • NLP
sold to female
selectively extract
and query data in customers between ages 18 and 25”
order to analyse it 3. “Top-10 product-categories whose sales (%)
from different
increased the most over the past year“
points of view
Can you identify the dimensions
in these queries???
1st query in more details
OLAP “Average January sales (€) forMachine
Data Mining all storesLearning
in Attika”

• Well defined • Seek to find • Build models for


computations over relationships and prediction,
data categorized by patterns in data classification etc.
1st dimension denotes when (time)
multiple dimensions • Frequent itemset • Image classification
of interest Data Analysis Stack
• Association rules • Speech processing
• Enables users to • Link prediction • Sentiment
easily and 2nd dimension denotes whereanalysis
(location)
• NLP
selectively extract
and query data in
order to analyse it
from different A common aggregate function: AVG() over the available
points of view measure (sales €)

Other examples: Max(), Min(), Count(), StDev(), Median()


OLAP vs. OLTP
OLTP OLAP
User Clerk, IT professional Knowledge worker
Function Day to day operations Decision support
DB design Application-oriented Subject-oriented
(E-R based) (Star, snowflake)
Data Current, Isolated Historical, Consolidated
View Detailed, Flat relational Summarized, Multidimensional
Usage Structured, Repetitive Ad hoc (+reporting)
Unit of work Short, simple transaction Complex query
Access Read/write Read mostly
Operations Index/hash on prim. key Lots of scans
# Records accessed Tens Millions
# Users Thousands Hundreds
Db size 100 MB - GB 100 GB - TB
Metric Trans. throughput Query throughput, response
46
DATA WAREHOUSES
The Data Warehouse
• In order to support OLAP, data is collected from
multiple data sources, cleansed and organized in data
warehouses

• The data warehouse is a huge repository of enterprise


data that will be used for decision making

• After data is loaded in the data warehouse, OLAP cubes


are often pre-summarized across dimensions of
interest to drastically improve query time
Data Warehouse definition
• A decision support database that is
maintained separately from the organization’s
operational databases.
• A data warehouse is a
• subject-oriented,
• integrated,
• time-varying,
• non-volatile
collection of data that is used primarily in
organizational decision making.
-- W.H. Inmon, Building the Data Warehouse, 1992.
Subject-Oriented
• Organized around major subjects, such as
customer, product, sales
• Focusing on the modeling and analysis of data
for decision makers, not on daily operations or
transaction processing
• Provide a simple and concise view around
particular subject issues by excluding data that
are not useful in the decision support process

50
Integrated
• Constructed by integrating multiple,
heterogeneous data sources
– relational databases, files, external sources
• Data cleaning and data integration techniques are
applied
– Ensure consistency in naming conventions, keys,
attribute measures, etc. among different data sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.
– When data is moved to the warehouse, it is
transformed

51
Time-Variant
• The time horizon for the data warehouse is
significantly longer than that of operational
systems
– Operational database: current data, old values
overwritten, deleted or archived
– Data warehouse: provides data from a historical
perspective (e.g., past 5-10 years) for trend
analysis

52
Non-volatile
• A physically separate store of data
transformed from the operational
environment
• Operational update of data does not occur in
the data warehouse environment
– Does not require transaction processing, recovery,
and concurrency control mechanisms
– Requires only two operations in data accessing:
• loading of data and access to data
53
Data Warehouse Architecture
Metadata
Monitor
& OLAP Cubes
Other Integrator OLAP
Staging Area
sources
json csv
Query/Reporting
Operational Extract
DBs Transform Data Serve
Load Data Mining
Warehouse

Front-End Tools
Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools


56
Implementation
• Warehouse database server
– Almost always a relational DBMS.
• OLAP Servers (for computing OLAP Cubes)
– Relational OLAP (ROLAP): extended relational DBMS that maps
operations on multidimensional data to standard relational
operations.
– Multidimensional OLAP (MOLAP): special purpose server that
directly implements multidimensional data and operations.
• Clients
– Query and reporting tools.
– Analysis tools.
– Data mining tools.

57
Data Marts

• Smaller warehouses
• Span part of organization
– e.g., marketing (customers, products, sales)
• Do not require enterprise-wide consensus
– But may lead to long term integration
problems

58
ETL: Extract-Transform-Load
• Data is periodically (e.g. every night) pulled from the sources and feeds
the Data Warehouse
– Modern application stretch the need for (near) real time processing of
updates (will not be covered in this class)
• To update the Data Warehouse with new data, ETL (Extract, Transform,
Load) processes are utilized to extract, validate, cleanse, correct,
transform, and load the data
• Verifying data accuracy to ensure that the data is correct and consistent
• Removing duplicates to eliminate redundant entries
• Filling in or removing incomplete data to ensure that all data points are
complete and consistent
• Standardizing the data to ensure consistency in format and representation.
• High-quality data leads to better business decisions!
• Once the data has been loaded, precomputations are carried out in the
form of data cubes (either complete or partial) to accelerate the
processing of common queries
Data Warehouse Summary
• A data warehouse serves as a centralized
repository housing structured data, primarily
geared towards facilitating analytics and business
intelligence endeavours.
• Data warehouses are highly efficient, due to the
structured nature of the data inside them that
enables efficient SQL-based analytics.
• The enforcement of a consistent schema across
all stored data greatly enhances the usability and
reliability of the stored data.
Related Technologies
Data Warehouse Data Lake Data Lakehouse

• Structured Repository • Flexible Repository • Hybrid approach


• Primarily for Analytics & (typically cloud based) • Merges Data Lake
BI • Stores raw data flexibility with Data
• Optimized for SQL- including structured, Warehouse structure
based analytics semi-structured, and • Supports both
• Expensive ETL unstructured data operational and
processes for data formats analytical workloads
transformation • Supports diverse data
types and formats (text,
video, audio, streams,
logs)
• Suitable for exploration
and experimentation
and ML workflows
Basic Query Pattern
• The analyst selects a subset of dimensions from
the data and computes relevant statistics to
derive insights.
• In SQL this is expressed by grouping records using
the selected attributes and computing aggregate
functions (e.g. sum(), average(), count(), max())
over each group
– “Group by followed by aggregation”
– Additional filtering may be used to restrict the scope
of the query
Example
• “Compute the total revenue (=sum) the
minimum and maximum price for each
combination of customer and store”
Time Customer Store Product Price

⚫ Sales Data: T1 C1 S2 P1 $90


T2 C2 S1 P2 $70
T3 C1 S1 P2 $45
T4 C3 S1 P1 $40
T5 C1 S2 P2 $25
facts
T6 C1 S2 P2 $50
T7 C2 S1 P4 $45
T8 C3 S1 P1 $10

available dimensions measure


63
In SQL: Group By + Aggregation
Select Customer, Store, SUM(Price) as Revenue, MIN(Price) as MinPrice,
MAX(Price) as MaxPrice
From Sales Group by Customer, Store
Time Customer Store Product Price
T1 C1 S2 P1 $90
T2 C2 S1 P2 $70
1. Identify T3 C1 S1 P2 $45
groups: T4 C3 S1 P1 $40

C1,S1 T5 C1 S2 P2 $25

C2,S1
T6 C1 S2 P2 $50
T7 C2 S1 P4 $45
C3,S1 T8 C3 S1 P1 $10
C1,S2
Customer Store Revenue Min Price Max Price
C2 S1 $115 $45 $70

Perform
C1 S1 $45 $45 $45
2.
aggregation
C3 S1 $50 $10 $40
C1 S2 $165 $25 $90
64
Relational Algebra (logical plan)

γStore, Customer, SUM(Price)->Revenue, MIN(Price)->MinPrice, MAX(Price)->MaxPrice

Sales
Map data and aggregates into a high-
dimensional space
• Example: compute total sales volume per
productID and storeID
StoreID
Total ProductID
Sales 1 2 3 4
1 $454 - - $925
468
StoreID

2 $468 $800 - -
3 $296 - $240 -
4 $652 - $540 $745
ProductID

All sales of ProductID 1 This value denotes


at storeID 2 are the result of the
accumulated here aggregation 66
Multidimensional Data Model
• A data warehouse is a collection of data points or facts that exist in a multidimensional space.
These data points can represent various entities such as sales, orders, contracts, and so on.
• A fact has
– A set of dimensions with respect to which data is analyzed
• e.g., store, product, date associated with a sale
– A set of measures
• quantity that is analyzed, e.g., sale amount, quantity
• The dimensions create a sparsely populated coordinate system, where not all possible
combinations exist as facts.
– For example, it is unlikely that a customer has visited every single store. Therefore, some
combinations of dimensions may have no corresponding facts or data points.
• Each dimension is associated with a set of attributes that provide additional information
about the data points. These attributes can be used to provide context and details about the
data.
– e.g., owner, city and state of store
• Values of a dimension in a database may be related to one another.
– For example, the "product" dimension may have a hierarchical relationship, where each product
belongs to a category and each category belongs to a larger group. This relationship between values
can be used to create hierarchies or drill-down paths for analysis.

67
Product Hierarchy

ΠΡΟΙΟΝΤΑ

Κρεατικά Γαλακτοκομικά Όσπρια

Χοιρινό Μοσχάρι Γιαούρτι Τυρί Γάλα Φακές Φασόλια Ρύζι

Μπριζόλα Καπνιστό Κιμάς Φιλέτο Γραβιέρα Φέτα Πλήρες Χοντρές Ψηλές Κίτρινο Καρολίνα

Στραγγιστό Βελουτέ Πρεσπών Γίγαντες


Σουβλάκι Εβαπορέ Γλασέ

p187 p96

Κωδικοί για όλα τα τυριά τύπου «φέτα»

68
More on Attribute Hierarchies
• Values of a dimension may be related
– Hierarchies are most common
• Dependency graph may be: year
– Hierarchy (tree): e.g.,
city → state → country month
week
– Lattice:
date → month → year date
date → week (of a year) → year

69
Another example
• VIN: Vehicle
Identification Number Manufacturer
Type
(unique key)
• Model: e.g. Fiesta
Model
• Type: e.g. Compact
Car
VIN
• Manufacturer: e.g.
Ford
Using hierarchies
• When projecting data into a set of dimensions, it is
common to select an appropriate hierarchy level for
each dimension based on the analysis being
performed.
– “Compute total sales per productID”
Vs
– “Compute total sales per product-category”

• In the second query, sales of different productIDs that


all belong to the same category e.g. “Milk” will be
accumulated together in the same “coordinate” (value)
of the category dimension
Multidimensional View of selected
hierarchy levels per dimension
• Aggregate sales volume as a function of product (category),
time (day-of-week), geography (city)
All NY’s sales of
milk on a Sunday
are aggregated in
LA this cell
SF
Product-category

NY
Milk 20

Soda 45

Beer 18

Bread 22

Toothpaste 07

Soap 06

1 2 3 4 5 67 1: Sunday
Day-of-week 2: Monday
….
72
Roll-up Operation
• Dimension reduction: Product
1 2 3 4
– e.g., total sales by city by product
NY $454 - - $925
– e.g., total sales by city
SF $468 $485 - $315

City
LA $296 - $340 -
SE $652 - $640 $645

• Navigating attribute hierarchy: Roll-up


– e.g., sales by city
→ total sales by state Total Sales by City
→ total sales by country NY $1379

– e.g., total sales by city and year SF $1268 3


City LA $636 -
→ total sales by state and by year
SE $1937
→ total sales by country

73
Σημείωση
• Σε αυτό το σημείο συζητάμε για τελεστές με
τους οποίους «μετακινούμαστε» στο
πολυδιάστατο χώρο δημιουργώντας
διαφορετικές «προβολές» των δεδομένων της
αποθήκης.
• Ο τελεστής ROLLUP υπάρχει και στην SQL
γλώσσα αλλά εκεί λειτουργεί ελαφρώς
διαφορετικά υπολογίζοντας πολλαπλά group
bys με μία επερώτηση.
Drill-Down
• Drill-down: Inverse operation of roll-up
– Provides the data set that was aggregated
• e.g., show “base” data for total sales figure of the state of CA

75
Other Operations
• Selection (slice & dice) defines a
subcube
• Project the cube on fewer
dimensions by specifying coordinates
of remaining dimensions
• e.g., sales to customer XXX
• Ranking
• top 3% of cities by average sales

76
Warehouse Database Schema
• Relational design should reflect
multidimensional view
• Typical schemas:
– Star Schema
– Snowflake Schema
– Fact Constellation Schema

• Data tables (relations) are of two types: fact


tables and dimension tables
77
The Star Schema (Example 1)
TIME PRODUCT
time_key
product_key
day
product_name
day_of_the_week SALES category
month
time_key brand
quarter
color
year product_key supplier_name
location_key LOCATION
Dimension Table
units location_key
measures { amount
store
street_address
city
Fact Table state
country
region 78
time_key product_key location_key units amount
SALES T1 P44 L4 1 12

Fact Table T2

T2
P157

P6
L4

L1
3

14
180

2560

T3 P25 L3 1 2
• A table in the data warehouse T3 P157 L1 1 60
that contains facts consisting of
– Numerical performance measures
– Foreign keys that tie the fact data
to the dimension tables Foreign keys to dimension tables measures

• Each row records measurements


describing a fact
– Where? When? Who? How much?
How many?
• Provides the most detailed view
of the data an analyst has access
to in the data warehouse
– this denotes the grain of the
design

79
Dimension product_key
P1
product_name
i7-8700K
category
CPU
brand
Intel
color
black
supplier name
Jim

Tables P2

P3
i5-2400

Samsung 830
CPU

SSD
Intel

Samsung
black

brown
Jim

Ben

Keys uniquely P4 Barracuda HDD Seagate silver Ben


identify each
P5 MQ01ABD032 HDD Toshiba silver John
product

encodes product → category hierarchy

• Dimension Tables contain


– a key column linked to a foreign key in the fact table
– textual descriptors such as name of products, addresses
etc
– attributes that encode dependences within the dimension
(e.g. hierarchies)
• Dimension tables may be wide
• Dimension tables are usually shallow (e.g. few
thousand rows)
80
Advantages of Star Schema
• A single fact table where to look for facts to
analyze
• One table for each dimension
– dimensions are clearly depicted in the schema
• Easy to comprehend (and write queries)

• Loading of data
– dimension tables are relatively static
– data is loaded (append mostly) into fact table(s)
– new indexing opportunities

82
Querying the Star Schema

“Find total sales per product-category in our stores in Europe”

TIME PRODUCT
time_key
product_key
day
product_name
day_of_the_week SALES category
month
time_key brand
quarter
color
year product_key supplier_name
location_key
LOCATION
units_sold location_key
amount store
street_address
city
state
country
region

83
Querying the Star Schema
“Find total sales per product-category in our stores in Europe”

SELECT PRODUCT.category, SUM(SALES.amount)


FROM SALES, PRODUCT,LOCATION
WHERE SALES.product_key = PRODUCT.product_key
AND SALES.location_key = LOCATION.location_key
AND LOCATION.region=“Europe”
GROUP BY PRODUCT.category

Join fact table SALES with dimension tables PRODUCT, LOCATION to


fetch required attributes (category & region in this example)
84
Star Schema Query Processing
TIME PRODUCT
πproduct_key,category
time_key
product_key
day
product_name
day_of_the_week SALES category
month
time_key brand
quarter
color
year product_key supplier_name
location_key LOCATION
units_sold location_key
measures { amount
store
street_address
city
state
country
σregion=“Europe” region
Another Example
Product
Order ProdNo
OrderNo ProdName
OrderType ProdDescr
OrderNotes Fact table Category
Customer OrderNo CategoryDescr
SalespersonID UnitPrice
CustomerNo
CustomerNo QOH
CustomerName Date
CustomerAddress ProdNo
DateKey DateKey
City
CityName Date
Quantity Month
Salesperson
TotalPrice Year
SalespersonID
SalespersonName City
City CityName
Quota State
Country
86
Fact constellation
Supplier
• Multiple fact tables that
share common
Delivery
dimension tables
– Example: Delivery and Time Product
Sales fact tables share
dimension tables Time & Sales
Product
Store Customer

87
Snowflake Schema: represents dimensional
hierarchy by normalization
Product
Order Category
ProdNo
OrderNo CategoryID
ProdName
OrderType CategoryDescr
ProdDescr
OrderNotes Fact table
CategoryID
Customer OrderNo UnitPrice
CustomerNo SalespersonID QOH
CustomerName CustomerNo
Date Month
CustomerAddress DateKey Year
CityName DateKey Month
City Date
ProdNo Year Year
Quantity Week
Salesperson Week
TotalPrice Month
SalespersonID Year
SalespesonName
City City State
Quota CityName StateName
State Country
88
Multidimensional Modeling Stages
(adapted from https://www.kimballgroup.com/)
Gather Business Requirements
and Data Realities

Determine the grain of the


data

Choose your dimensions

Choose your facts


Gather Business Requirements and
Data Realities
• Study the underlying business processes
– Understand their objectives based on key
performance indicators (KPIs), compelling
business issues, decision-making processes, and
supporting analytic need
• Identify available data sources (internal and
external)
– Assess their quality and completeness
Grain
• Establishes exactly what a single fact table row
represents
– Different grains must not be mixed in the same
fact table
• Atomic grain refers to the lowest level at
which data is captured by a given business
process
– Safer to start with the atomic grain in order to
cope with unpredictable query workload
Identify the dimensions
• Dimensions provide the “who, what, where,
when, why, and how” context surrounding a
business process event.

• Dimension tables contain descriptive


attributes used by BI applications for filtering
and grouping the facts.
Identify the facts
• A single fact table row has a one-to-one
relationship to a measurement event as
described by the fact table’s grain.

• Facts contain measurements that result from a


business process event.

• Within a fact table, only facts consistent with the


declared grain are allowed.
Indexing Techniques
• Exploiting indexes to reduce scanning of data
is of crucial importance
• ROLAP
– Bitmap Indexes
– Join Indexes
• MOLAP
– Array representation

94
Bitmap Index Example
Base Table Region Index
Cust Region Rating RowID N S E W
C1 N H 1 1 0 0 0
C2 S M 2 0 1 0 0
C3 W L 3 0 0 0 1
C4 W H 4 0 0 0 1
C5 S L 5 0 1 0 0
C6 W L 6 0 0 0 1
C7 N H 7 1 0 0 0

95
Bitmap Index Example
Base Table Region Index
Cust Region Rating RowID N S E W
C1 N H 1 1 0 0 0
C2 S M 2 0 1 0 0
C3 W L 3 0 0 0 1
C4 W H 4 0 0 0 1
C5 S L 5 0 1 0 0
C6 W L 6 0 0 0 1
C7 N H 7 1 0 0 0

Bitmap encodes position of


customer records in the
base table (rows 1,7) that
reside in the North Region
96
Bitmap Index Example
Base Table Region Index Rating Index
Cust Region Rating RowID N S E W RowID H M L
C1 N H 1 1 0 0 0 1 1 0 0
C2 S M 2 0 1 0 0 2 0 1 0
C3 W L 3 0 0 0 1 3 0 0 1
C4 W H 4 0 0 0 1 4 1 0 0
C5 S L 5 0 1 0 0 5 0 0 1
C6 W L 6 0 0 0 1 6 0 0 1
C7 N H 7 1 0 0 0 7 1 0 0

97
Bitmap Index Example
Base Table Region Index Rating Index
Cust Region Rating RowID N S E W RowID H M L
C1 N H 1 1 0 0 0 1 1 0 0
C2 S M 2 0 1 0 0 2 0 1 0
C3 W L 3 0 0 0 1 3 0 0 1
C4 W H 4 0 0 0 1 4 1 0 0
C5 S L 5 0 1 0 0 5 0 0 1
C6 W L 6 0 0 0 1 6 0 0 1
C7 N H 7 1 0 0 0 7 1 0 0

Customers where Region = W and Rating = L

0011010 AND 0010110=0010010 (rows 3,6)


98
Bit Map Index Example 2
Base Table Region Index
Cust Region Rating RowID N S E W
C1 N H 1 1 0 0 0
C2 S M 2 0 1 0 0
C3 W L 3 0 0 0 1
C4 W H 4 0 0 0 1
C5 S L 5 0 1 0 0
C6 W L 6 0 0 0 1
C7 N H 7 1 0 0 0

How many customers in W region?

99
Bitmap Index
• An alternative representation of RID-list
• Comparison, join and aggregation operations are
reduced to bit arithmetic
• Especially advantageous for low-cardinality
domains
– Significant reduction in space and I/O (30:1)
– Have been adapted for higher cardinality domains
– Compression (e.g., run-length encoding) exploited
• Products: Model 204, Redbrick, IQ (Sybase),
Oracle, etc

100
Join Index
• Traditional index maps the value in a column
to a list of rows with that value
• Join index maintains relationships between
attribute value of a dimension and the
matching rows in the fact table
• Join index may span multiple dimensions
(composite join index)

101
Example: Join Indexes
• “Combine” SALE, PRODUCT relations

sale prodId storeId date amt product id name price


p1 c1 1 12 p1 bolt 10
p2 c1 1 11 p2 nut 5
p1 c3 1 50
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4

joinTb prodId name price storeId date amt


p1 bolt 10 c1 1 12
p2 nut 5 c1 1 11
p1 bolt 10 c3 1 50
p2 nut 5 c2 1 8
p1 bolt 10 c1 2 44
p1 bolt 10 c2 2 4
102
Join Indexes
join index
product id name price jIndex
p1 bolt 10 r1,r3,r5,r6
p2 nut 5 r2,r4

sale rId prodId storeId date amt


r1 p1 c1 1 12
r2 p2 c1 1 11
r3 p1 c3 1 50
r4 p2 c2 1 8
r5 p1 c1 2 44
r6 p1 c2 2 4

103
Example: Compute total sales in
AFRICA
TIME PRODUCT
time_key
product_key
day
product_name
day_of_the_week SALES category
month
time_key brand
quarter
color
year product_key supplier_name
location_key LOCATION
units_sold location_key
store
amount
street_address
city
SELECT SUM(sales.amount)
state
FROM sales, location
country
WHERE sales.location_key=location.location_key
region
AND location.region=“AFRICA”
Join-Index in the Star Schema
• Join index relates the values of
the dimensions of a star SALES
LOCATION
schema to rows in the fact
region = Africa
table.
region = America R102 1
– a join index on region region = Asia
maintains for each distinct region = Europe
region a list of ROW-IDs of
R117 1
the tuples recording the sales
in the region R118 1

• Join indices can be


implemented as bitmap-
R124 1
indexes (next slides)

105
Join Index on Location.Region
implemented as bitmap index

Fact Table Sales Bitmaps for Location.Region


time_key product_key location_key units amount Africa Asia Europe America
T1 P44 L4 1 12 0 0 0 1

T2 P157 L4 3 180 0 0 0 1

T2 P6 L1 14 2560 1 0 0 0

T3 P25 L3 1 2 0 0 1 0

T3 P157 L1 1 60 1 0 0 0

Assuming L1 refers to a store location in Africa, L2 to a store location in Asia etc


This information is stored in the dimension table Location
In SQL
• Join index implemented as bitmap index:
CREATE BITMAP INDEX loc_sales_bit
ON sales(location.region)
FROM sales, location
WHERE sales.loc_location_key = location.location_key;
• The following query uses the index to avoid
computing the join
SELECT SUM(sales.amount)
FROM sales,location
WHERE sales.location_key=location.location_key
AND location.region=“AFRICA”

107
THE DATA CUBE
Aggregation
(on a single group via filtering)
• Sum up amounts for day 1
• In SQL: SELECT sum(amt)
FROM SALE
WHERE day = 1

Assume following fact table:


sale prodId storeId day amt
p1 s1 1 12
p2 s1 1 11
p1 s3 1 50
p2 s2 1 8
81
p1 s1 2 44
p1 s2 2 4

109
Group by & Aggregation
• Sum up amounts by day

SELECT day, sum(amt) FROM SALE


GROUP BY day

sale prodId storeId day amt


p1 s1 1 12
p2 s1 1 11 ans day sum
p1 s3 1 50 1 81
p2 s2 1 8 2 48
p1 s1 2 44
p1 s2 2 4

110
Common operations
• Sum up amounts by day, product
• In SQL: SELECT prodid,day,sum(amt) FROM SALE
GROUP BY prodId, day
sale prodId storeId day amt
p1 c1 1 12 sale prodId day amt
p2 c1 1 11
p1 1 62
p1 c3 1 50
p2 1 19
p2 c2 1 8
p1 c1 2 44 p1 2 48
p1 c2 2 4

rollup

drill-down

111
Recall: Star Schema Example 1
TIME PRODUCT
time_key
product_key
day
product_name
day_of_the_week SALES category
month
time_key brand
quarter
color
year product_key supplier_name
location_key LOCATION
units_sold location_key
{ amount
store
street_address
city
state
country
region
Compute volume of sales
per product_key and store
Store Product_key sum(amount)
1 1 454
Product_key 1 4 925
Sales
1 2 3 4 ALL 2 1 468
2 2 800
1 454 - - 925 1379
3 1 296
2 468 800 - - 1268
Store

3 3 240
3 296 - 240 - 536 4 1 652
4 3 540
4 652 - 540 745 1937
4 4 745

SQL: SELECT LOCATION.store, SALES.product_key, SUM (amount)


FROM SALES, LOCATION
WHERE SALES.location_key=LOCATION.location_key
GROUP BY SALES.product_key, LOCATION.store

114
Multiple Simultaneous Aggregates

Cross-Tabulation (products/store)
How many queries
to obtain this result?
Product_key
Sales
1 2 3 4 ALL
1 454 - - 925 1379
2 468 800 - - 1268
Sub-totals per store
Store

3 296 - 240 - 536


4 652 - 540 745 1937
ALL 1870 800 780 1670 5120

Total sales
Sub-totals per product_key

115
Multiple Simultaneous Aggregates

Cross-Tabulation (products/store)

Product_key
Sales
1 2 3 4 ALL
1 454 - - 925 1379 Aggregate sales*
group by (store,product_key)
2 468 800 - - 1268
Store

*See SQL query in previous slides


3 296 - 240 - 536
4 652 - 540 745 1937
ALL 1870 800 780 1670 5120
116
Multiple Simultaneous Aggregates

Cross-Tabulation (products/store)

Product_key
Sales
1 2 3 4 ALL
1 454 - - 925 1379 Aggregate sales
group by (store)
2 468 800 - - 1268
Store

3 296 - 240 - 536 SQL: SELECT LOCATION.store, SUM (amount)

4 652 - 540 745 1937 FROM SALES, LOCATION


WHERE SALES.location_key=LOCATION.location_key
ALL 1870 800 780 1670 5120 GROUP BY LOCATION.store

117
Multiple Simultaneous Aggregates

Cross-Tabulation (products/store)
Product_key
Sales
1 2 3 4 ALL
1 454 - - 925 1379
2 468 800 - - 1268
Store

3 296 - 240 - 536


4 652 - 540 745 1937
SQL: SELECT SALES.product_key, SUM (amount)
ALL 1870 800 780 1670 5120 FROM SALES
GROUP BY SALES.product_key

Aggregate sales
group by (product_key) 118
Total sales: group by “none”
Product_key
Sales
1 2 3 4 ALL
1 454 - - 925 1379
2 468 800 - - 1268
Store

3 296 - 240 - 536


4 652 - 540 745 1937
ALL 1870 800 780 1670 5120 Total sales

SQL: SELECT SUM (amount)


FROM SALES Notice lack of group by clause

119
Multiple Simultaneous Aggregates

4 Group-bys here:
Cross-Tabulation (products/store) (store,product_key)
(store)
Product_key (product_key)
Sales ()
1 2 3 4 ALL Need to write 4 queries!!!
1 454 - - 925 1379
2 468 800 - - 1268
Sub-totals per store
Store

3 296 - 240 - 536


4 652 - 540 745 1937
ALL 1870 800 780 1670 5120
Total sales
Sub-totals per product_key 120
Multiple Simultaneous Aggregates:
Optimizations?
4 Group-bys here:
Cross-Tabulation (products/store) (store,product_key)
(store)
(product_key)
()
Product_key
Sales
1 2 3 4 ALL Fact Table (raw data)
1 454 - - 925 1379
2 468 800 - - 1268 product_key, store
Store

3 296 - 240 - 536


product_key store
4 652 - 540 745 1937
none
ALL 1870 800 780 1670 5120
121
The Data Cube Operator
(Gray et al)

• All previous aggregates in a single query:


SELECT LOCATION.store, SALES.product_key, SUM (amount)
FROM SALES, LOCATION
WHERE SALES.location_key=LOCATION.location_key
GROUP BY CUBE (SALES.product_key, LOCATION.store)

Challenge: Optimize Cube Computation

122
Relational View of Data Cube
Store Product_key sum(amount)
Product 1 1 454
Sales 1 4 925
1 2 3 4 ALL 2 1 468
1 454 - - 925 1379 2 2 800
2 468 800 - - 1268 3 1 296
Store

3 3 240
3 296 - 240 - 536 4 1 652
4 652 - 540 745 1937 4 3 540
ALL 1870 800 780 1670 5120 4 4 745
1 ALL 1379
2 ALL 1268
SELECT LOCATION.store, SALES.product_key, SUM (amount) 3 ALL 536
FROM SALES, LOCATION 4 ALL 1937
WHERE SALES.location_key=LOCATION.location_key ALL 1 1870
GROUP BY CUBE (SALES.product_key, LOCATION.store)
ALL 2 800
ALL 3 780
ALL 4 1670
ALL ALL 5120
123
Relational View of Data Cube
Store Product_key sum(amount)
1 1 454
1 4 925
2 1 468
2 2 800
group by(store, product_key) 3 1 296
3 3 240
4 1 652
4 3 540
4 4 745
1 ALL 1379
group by(store) 2 ALL 1268
3 ALL 536
4 ALL 1937
ALL 1 1870
ALL 2 800
group by(product_key)
ALL 3 780
ALL 4 1670
group by() ALL ALL 5120
124
Quiz
• SALES(customer,sales_person,store,product,amt)
• Assume the SUM() aggregate function
• What is the meaning of the following data cube
records?
(ALL,’JOHN’,ALL,ALL,5000)
(‘NICK’,ALL,ALL,’BEER’,250)
(ALL,ALL,ALL,’MILK’,70000)
(ALL,ALL,ALL,ALL,250000)

125
Group by (Product, Quarter, Region)
SUM() aggregate function

Quarter
1Qtr 2Qtr 3Qtr 4Qtr
DVD
PC America
VCR

Region
Europe

1.2M Asia

Total sales of VCRs in


the 4th Qtr in Europe

126
Group by (Product, Quarter, Region)
Quarter
1Qtr 2Qtr 3Qtr 4Qtr
DVD
PC America
VCR

Region
Europe

Asia
4M

Total sales of PCs in


the 4th Qtr in Asia

127
Group by (Product, Quarter, Region)
Total sales of DVDs in the
1st Qtr in America
Quarter
1Qtr 2Qtr 3Qtr 4Qtr
DVD 2.2M
PC America
VCR

Region
Europe

Asia

128
Data Cube: Multidimensional View
Total annual sales
Quarter of DVDs in America
1Qtr 2Qtr 3Qtr 4Qtr ALL
DVD
PC America
VCR
ALL
Europe

Region
Asia

ALL

All, All, All 131


How are aggregates computed?
1. Bring all records with same values in the
groupping attributes together
2. Aggregate their measures

• (1) is done via Hashing / Sorting


• (2) depends on the type of function used
– Simple calculations for max, sum, count etc
– Harder for median
132
Example: Sum sales/prodId ?
Raw data (fact table)

sale prodId storeId date amt


p1 s1 1 12
p2 s1 1 11
p1 s3 1 50
p2 s2 1 8
p1 s1 2 44
p1 s2 2 4

133
Step 1: Sort tuples by prodId
Raw data (fact table)
sale prodId storeId date amt
p1 s1 1 12
p2 s1 1 11
p1 s3 1 50
p2 s2 1 8
p1 s1 2 44
p1 s2 2 4

sale prodId storeId date amt


Sort(prodId) p1 s1 1 12
p1 s1 2 44
p1 s2 2 4
p1 s3 1 50
p2 s1 1 11
p2 s2 1 8

134
Step 2: Aggregate records (sum amt)
Sorted Raw data
sale prodId storeId date amt
p1 s1 1 12
p1 s1 2 44 Sales for prodId=1
p1 s2 2 4
p1 s3 1 50
p2 s1 1 11
p2 s2 1 8

Aggregate ans prodId sum


p1 110
p2 19

135
More on aggregate
• Assumed SUM() function
• How much space needed? sale prodId storeId date amt
p1 s1 1 12
• How about AVG()? p1 s1 2 44
p1 s2 2 4
• How about MEDIAN()? p1 s3 1 50
p2 s1 1 11
p2 s2 1 8

136
Aggregate Computation
• Certain functions
(SUM,MIN,MAX,COUNT,AVERAGE, etc) require
small (bounded) space for storing their state
and may be computed on the fly, while
executing the merging phase of the 2-phase
sort algorithm.

• Cost = 3 * B(R) , assuming M2 ≥ B(R) > M

137
Hashing

key → h(key)
<key>

Buckets
(typically 1
. disk block)
.
.

138
Example: 2 records/bucket

INSERT:
0
h(a) = 1 d
1
h(b) = 2 a e
c
h(c) = 1 2
b
h(d) = 0
3

h(e) = 1

139
How does this work for aggregates?
Hash on prodId Possibly keep
sale prodId storeId date amt
p1 s1 1 12 records sorted
p1 s3 1 50
p1 s1 2 44
within bucket
p1 s2 2 4
p3 s5 1 7

h(prodId) = p7 s2 2 1 Two buckets


prodId mod 2
sale prodId storeId date amt
p2 s1 1 11
p2 s2 1 8
Not the best hash function one
could use

140
Naïve Data Cube Computation
• Fact table:
sale prodId storeId amt
p1 s1 12
p2 s1 11
p1 s3 50
p2 s2 8
p1 s1 44
p1 s2 4

• Compute: SUM(amt) GROUP BY prodId,storeId WITH


CUBE
– 4 group bys contained in this Data Cube:
prodId storeId sum(amt) prodId amt storeId amt amt
p1 s1 56 p1 110 s1 67 129
p1 s2 4 p2 19 s2 12
p1 s3 50 s3 50
p2 s1 11
p2 s2 8

141
Full Data Cube
(from previous example)
prodId storeId sum(amt)
p1 s1 56
p1 s2 4
p1 s3 50
p2 s1 11
p2 s2 8
p1 ALL 110
p2 ALL 19
ALL s1 67
ALL s2 12
ALL s3 50
ALL ALL 129

142
How much does it cost to compute?
• Assume B(SALES)=1 Million Blocks, larger than
available memory
• Our (brute force) strategy: compute each
group by independently
– Compute GROUP BY prodId,storeId
– Compute GROUP BY prodId
– Compute GROUP BY storeId
– Compute GROUP BY none (=total amt)

143
First Group By: prodId,storeId
• In SQL

SELECT prodId,storeId,sum(amt)
FROM SALES
GROUP BY prodId,storeId

• Use sorting: 3*B(SALES) = 3M I/O

144
Second Group By: prodId
• In SQL

SELECT prodId,sum(amt)
FROM SALES
GROUP BY prodId

• Use sorting: 3*B(SALES) = 3M I/O (same)

145
Third Group By: storeId
• In SQL

SELECT storeId,sum(amt)
FROM SALES
GROUP BY storeId

• Use sorting: 3*B(SALES) = 3M I/O (same)

146
Group By (none) = sum(amt)
• SQL:

SELECT sum(amt)
FROM SALES

• Cost ?

147
Recap
• Group By prodId,storeId : 3M I/Os
• Group By prodId : 3M I/Os
• Group By storeId : 3M I/Os
• Group By none : 1M I/Os
– Compute aggregate function over all records, no
sorting necessary

• Total Cost for the Data Cube: 10M I/Os


– Is this a lot?

148
Practice Problem
• Rotation speed 7200rpm
• 128 sectors/track
• 4096 bytes/sector
• 4 sectors/block (16KB page size)
• Sequential I/O: ignore SEEKTIME, gaps, etc

149
Sustained disk speed

• 1 full rotation
– takes 60/7200=8.33ms
– retrieves 1 track = 128 sectors = 32 pages (blocks)
• 10 Million blocks in
8.33/1000 * 10M/32 = 43.5 minutes

• Can we do better?

150
Share sort orders

If sorted on (prodId,storeId) Then, also sorted on (prodId)


prodId storeId date amt prodId storeId date amt
p1 s1 1 12 p1 s1 1 12
p1 s1 2 44 p1 s1 2 44
p1 s2 2 4 p1 s2 2 4
p1 s3 1 50 p1 s3 1 50
p2 s1 1 11 p2 s1 1 11
p2 s2 1 8 p2 s2 1 8

Thus, no need to sort SALES twice!


Two group-bys with a single sort on (prodId, storeId)
Output of 2-phase sort algorithm
Maintain 2 variables output
(one row at a time)
prodId storeId date amt SUM1 SUM2
p1 s1 1 12 12 12
p1 s1 2 44 + 56 56
p1 s2 2 4 4 60 p1,s1,56
p1 s3 1 50 50 110 p1,s2,4
p2 s1 1 11 11 11 p1,s3,50 p1,110
p2 s2 1 8 8 19 p2,s1,11
p2,s2,8 p2,19
EOT (End-Of-Table)

- SUM1 is used for group-by(prodId,storeId), SUM2 for group-by(prodId)


-Each time we see a new (prodId,storeId) combination we report the previous
pair and SUM1 value and initialize SUM1 to the new amt
- Similar logic for SUM2
- Report last combination at EOT 152
Share sort orders for multiple group bys
• Sort SALES on prodId,storeId
– At the merging phase compute prodId storeId date amt
both group by prodId and p1 s1 1 12
prodId,storeId p1 s1 2 44
p1 s2 2 4
– Also compute group by none p1 s3 1 50
• Then compute group by storeId p2 s1 1 11
by sorting SALES on storeId p2 s2 1 8

• Cost = 3B(SALES) + 3B(SALES) =


6M I/Os
– Compared to 10M I/Os
– 40% savings

153
Can we do better?
• Sort SALES on prodId,storeId gb(prodId,storeId)
– At the merging phase compute prodId storeId sum(amt)
both group by (prodId,storeId)) p1 s1 56
and group by (prodId) p1 s2 4
– Also compute group by none at p1 s3 50
the same time p2 s1 11
p2 s2 8
• Compute group by (storeId) by s

sorting the result of group by o


r
t

(prodId,storeId) on storeId gb(storeId)


– Notice that by construction storeId sum(amt)
B(gb(prodId,storeId)) ≤ B(SALES) s1 67 p1
s2 12
• Each tuple in gb(prodId,storeId) is
produced by one or more tuples s3 50
in SALES

Cost = 3*B(SALES) + 3*B(gb(prodId,storeId))


154
3D Data Cube Lattice
• Model dependencies among the aggregates
(independently of the method of computation, e.g.
by sorting or otherwise)
most detailed “group by”

product,store,quarter

can be computed from grouby


product,quarter store,quarter product, store
(product,store,quarter) by
summing-up all quarterly sales
quarter product store
gb(product,store) is equivalent
none to gb(store,product)

155
Discussed optimization (sharing sort
orders) on the 3D Data Cube
• Sort SALES on product,store,quarter (also get
gb product,store, gb product and gb none)
• Sort SALES on product,quarter
• Sort SALES on store,quarter (also get gb store)
• Sort SALES on quarter product,store,quarter

product,quarter store,quarter product, store


Cost of new plan
4*3M=12M I/Os
quarter product store
(45% savings)
none

156
Compute from “smallest parent”
vs
“sharing sort orders”
• Consider computation of gb product, quarter
• Previously: Sort SALES on product,quarter
• Alternative: read and sort previously computed gb
product,store,quarter
– This gb will be smaller than SALES
• It may even fit in memory (one-pass sort)
– This gb is partially sorted (common prefix) product,store,quarter

product,quarter store,quarter product, store

quarter product store

none

157
ESTIMATING THE DATA CUBE SIZE
How many group bys in the Data Cube?

• Data Cube Lattice


– N-dimensional data, no hierarchies
– 2N group bys The notation is agnostic to the
order of dimensions.

product,store,quarter

product,quarter store,quarter product, store

quarter product store

none
159
2D Data Cube lattice

• 2-dimensional data (product, store)


22 =4 group bys

product, store

product store

none

160
Let’s add a simple hierarchy
• Assume that products are organized into
categories
• Utilizing this information becomes possible
when we aggregate the sales data.
– Aggregate sales per category
– Aggregate sales per category and store
– But it does not make sense to aggregate sales per
product and category (WHY?)
Compare these two results
Group by (product,category) Group by (product)
product category sum(amt) product sum(amt) sum(amt)
p1 cat1 110 p1 110 56
p2 cat1 19 p2 19 4
p3 cat3 240 p3 240 50
p4 cat2 255 p4 255 11
p5 cat1 75 p5 75 8

Notice that there is no difference in the


computed aggregates, since prodId→category
2D Data Cube lattice with simple hierarchy

product, store category, store

product store category

none

163
2D Data Cube lattice with 2 separate
hierarchies on the product dimension
store, brand, category

product, store category, store brand, store brand, category

product store category brand

none

Notice omission of group by on (product,store,brand,category) ….

165
#of group bys when there is a single
hierarchy per dimension
• N dimensions
• Dimension di has a hierachy of length Li
• Location: store→city→country
LLocation =3
– If no hierarchy, then Li =1
• Number of group bys = (1+L1) (1+L2)… (1+LN)
– No need to memorize this formula! Seek to
understand its derivation instead (next slide)

166
How is the formula derived
• Consider Location dimension with hierarchy
– store→city→country (i.e. LLocation =3)
• In a group by (aggregate) query I may
– Not consider location at all (e.g. total sales per product)
• Another way to think about this is that +1 stands for ALL
– Consider location information at the store-level
• (e.g. total sales per customer, store)
– Consider location information at the city-level
• (e.g. total sales per product, city)
– Consider location information at the country-level
• (e.g. total sales per sales_person, country)
• There are (1+3) choices regarding that dimension independently on
what other dimensions I select in a gb
– Thus, (1+L1) (1+L2)… (1+LN) possible combinations of dimensions in a query

167
Example
• 8 dimensions (typical)
• 3-level hierarchy/dimension
• Number of group bys = 48=65536 group bys!
• BUT, how many tuples in the cube?
– Depends on data distribution
– Worst case is uniform
?
product
customer
168
Upper bound on the size of each group by

• Assume relation R (fact table) has T(R) tuples


• Each dimension has cardinality ti
• Size of group by (d1, d2,… dk) is upper bounded
by both
– t1* t2 *..* tk (size of the subspace defined by these
dimensions)
– T(R) (since records in the group by are produced
by combination of attribute values that appear in
existing facts)

169
Example gb(customer,product)
• Assume I have 1000 customers and 50 products
• Assume uniform distribution (customers buy
products with same probability)
– There can be 1000 x 50 combinations of pairs
(customer, product) in the fact table (sales)
– Thus, 50000 records in gb(customer,product) (at most)
• Each record in this gb is derived from a real sale
– There can not be an aggregated record if there are not
base records in the fact table to support it
• Thus, there can not be more records in the gb
than the number of actual sales in the fact table
170
Example
• Consider R(product,store,quarter,amt) with 1M records
• 10,000 products, 30 stores, 4 quarters
• Let G(x,y) denote the maximum number of records in group
by x,y
– G(product,store,quarter)=min(1M,10000*30*4)=1,000,000
– G(product,store)=min(1M,10000*30)=300,000
– G(product,quarter)=min(1M,10000*4)=40,000
– G(store,quarter)=min(1M,30*4)=120
– G(product)=min(1M,10000)=10,000
– G(store)=min(1M,30)=30
– G(quarter)=min(1M,4)=4
– G(none)=1

– Maximum cube size = 1,350,155 records

171
Quick and Dirty Upper Bound

MAX-SIZE<=10001*31*5 = 1550155

(1+t1)*(1+t2)*(1+t3)

(compare with 1350155)

This upper bound ignores size of fact table

172
Data Cube: Multidimensional View
Total annual sales
Quarter of DVDs in America
1Qtr 2Qtr 3Qtr 4Qtr ALL
DVD
PC America
VCR
ALL
Europe

Region
Asia

ALL

All, All, All 173


Extended Cube with Hierarchies
• Products are organized in 50 categories
• Additional group bys in extended cube
– +G(category,store,quarter)=min(1M,50*30*4)=6,000
– +G(category,store)=min(1M,50*30)=1,500
– +G(category,quarter)=min(1M,50*4)=200
– +G(category)=min(1M,50)=50

– Maximum ext-cube size = 1,357,905 records

174
Correlated Attributes
• In practice there is some correlation between
different dimensions
• Example 1: each store sells up to 1,000
products (specialized stores)
• Example 2: some products are not sold
through-out the year
– Ice cream, watermelon, snow-chains

175
Solve Example-1
• R(product,store,customer) with 1M records
• 1,000 products, 20 stores, 100 customers
• Each customer buys from only one store
(closest)
– Functional Dependency: customer → store

G(store,customer)=min(1M,1*100)=100

G(product,store,customer)=min(1M,1000*1*100)
=100,000

176
More realistic example
• 100,000 parts
• 20,000 customers
• 2,000 suppliers
• 5 years (=365 *5 days)
• 100 stores
• 1,000 sales persons

• Max-cube size = 738,855,253,876,896,582,426


(tuples)
177
Catch With Data Cube
• …. toooooo many aggregates
• So Data Cube is large!
– And takes time to compute…

178
What to Materialize?
• Data Cube extremely large for many
applications
• Store in warehouse results useful for
common queries
• Example:
– Total sales per product, store
– Max sales per product
– Avg sales per store,day
–…
179
Materialization Factors
• Type/Frequency of Queries
– Examine frequently occurring query patterns
• Query Response Time
– Aim to expedite long-running queries
• Storage cost
– Evaluate disk space required to hold materialized
results
• Update cost
– Materialized results need to be refreshed following
star schema updates
180
MATERIALIZED VIEWS
Preliminaries
• We will consider solutions that
selectively materialize certain groups by
in the Data Cube
– We will be referring to the group bys as
“views”
– When a group by is materialized we will call
it a “materialized view”

182
Views in OLTP databases
Employee(ename, age, dept, address, telno, salary)
• Views are derived tables
– Instance of view is generated on demand by
executing the view query:
create view V as
select ename,age, address,telno
from employee
where employee.dept = “Sales”
• Views have many uses
– Shortcuts for complex queries
– Logical-physical independence
– Hide details from the end-user
– Integration systems

183
Materialized Views (OLAP)
• Sometimes, we may want to compute and store the
content of the view in the database
– Such Views are called materialized
– Queries on the materialized view instance will be much
faster
– Materialized views are now supported by some vendors
• Otherwise, we will be storing their data in regular tables
• This is our extended architecture:
Data Warehouse=
detailed records (star schema) + aggregates (materialized views)

Used to speed up certain queries of interest


Materialized views in OLAP
• Contain derived data
– Can be computed from the star schema
• Populated while updating the data warehouse
– Usually, they contain results of complex aggregate
queries
• Several interesting problems:
– How to select which views to materialize?
– How to compute/refresh these views?
– How to store these views in the relational schema?
– How to use these views at query time?
185
View selection problem
• Set up as an optimization problem
– VDC = set of all group bys (=views) in the Data Cube
– Give a constraint
• Usually space bound B, e.g. materialize up to 100GB from the CUBE
• What else?
– Give an objective
• Minimize cost of answering set of (frequent/interesting) queries Q
• The “View Selection” problem (with space constraint):
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝐶𝑜𝑠𝑡(𝑄)
V⊆VDC

such that Size(V) ≤ B

• Problem is NP-hard

186
View Selection Problem: Heuristic
• Use some notion of benefit per view considering the
interdependencies illustrated in the Data Cube lattice
group by(product,store)
product store sum(amt)
p1 s1 56
product,store,quarter p1 s2 4
p1 s3 50
p2 s1 11
p2 s2 8
product,quarter store,quarter product, store
Regardless of the specific
quarter product store computation method (such as
sorting, hashing, etc.), queries
related to these GROUP BYs
none
can be effectively performed
by leveraging a materialized
view on the grouping
attributes (product, store)
187
A simple greedy algorithm
• Employ a benefit criterion to evaluate and compare the potential advantages of
different views. Select the one with the highest benefit at each step.
• Assume V represents the set of views that have been selected thus far, reflecting
the current state.
– Let v be a candidate view under consideration, which is not currently included in set V.
– Benefit(v) = cost of answering queries using V – cost of answering queries using V U {v}
• Assesses the reduction in the cost associated with answering queries if the candidate view, v, is
materialized
• The utilization of view v may potentially result in a decrease in the cost of certain queries, although it
is also possible that no cost reduction would occur.
• Thus, Benefit(v) ≥ 0
• Simple Greedy algorithm:
– At each iteration, select the view that offers the highest benefit among the available options.
– Re-compute benefits of remaining views
– Update space budget B, set B=B-sizeof(v)
– Remove views that do not fit in new budget B
– Stop if no more space available or no view fits in the remaining space or remaining views
provide no benefit (query cost reduction)

188
Simple Example
• Star schema with three dimensions and one measure
– Product (p), Store location (s), Quarter (q), amount (amt)
– Fact table: SALES(product, store, quarter, amt)
• Assume the following set of queries
– Q = {(p,s),(s,q), (p,q), (p),(s)}
– Notation (1): (s,q) is a query on group by (store,quarter), i.e.
(s,q): SELECT store, quarter, sum(amt)
FROM SALES
GROUP BY store, quarter
– Notation (2): View vstore,quarter is a materialized view containing
the result of the previous query

189
Query computation cost
• For ease of presentation, let us assume that
each query can be computed from the fact
table SALES with the same cost 100 I/O
(s,q): SELECT store, quarter, sum(amt)
FROM SALES
GROUP BY store, quarter

Cost = 100 I/O

190
Data Cube result size
• Assume each group by in the Data Cube
requires the depicted number of blocks, when
stored as a materialized view
80
product,store,quarter

25 13 60
product,quarter store,quarter product, store Size vproduct,store = 60 blocks

quarter
1 product
4 store
3

none 1
191
Assumption (linear cost model)
• A group by query is computable from an
ancestor materialized view v with Cost=size(v)
Computation for (s,q) from SALES:
(s,q): SELECT store, quarter, sum(amt)
80 FROM SALES
product,store,quarter
GROUP BY store, quarter
25 13 60 Cost = 100 I/O
product,quarter store,quarter product, store Alternative computation for (s,q):
SELECT store, quarter, sum(amt)
quarter
1 product
4 store
3 FROM vproduct,store,quarter
GROUP BY store, quarter
Cost = 80 I/O
none 1 192
View Selection Problem
• Minimize the cost of answering the depicted
queries when available space B=100 blocks
80
product,store,quarter

25 13 60
product,quarter store,quarter product, store

quarter
1 product
4 store
3

none 1
193
Initial Benefits
(no view is materialized yet, V={})
Group By (Materialized Benefit for
View) Q={(p,s),(s,q), (p,q), (p),(s)}
p,s,q (100-80)+(100-80)+(100-
80)+(100-80)+(100-80) = 100
p,q 2*(100-25) = 150
s,q 2*(100-13) = 174
p,s 3*(100-60) = 120
p 100-4 = 96
s 100-3 = 97
q 0
None 0 195
First Iteration
• Materialize view vs,q
• Update space budget B = 100-13 = 87
• Recompute benefits (next slide)

196
Space=87
Updated Benefits V={vs,q}

Group By (Materialized Benefit for


View) Q={(p,s),(s,q), (p,q), (p),(s)}
p,s,q 3*(100-80) = 60
p,q (100-25)+(100-25) = 150
s,q MATERIALIZED
p,s 2*(100-60) = 80 (careful)
p 100-4 = 96
s 13-3 = 10 (careful)
q 0
None 0
197
Second Iteration
• Materialize view vp,q
– V={vs,q,vp,q}
• Update space budget B = 87-25 = 62
• Update benefits (next slide)

198
Space=62
Updated Benefits V={vs,q,vp,q}

Group By (Materialized Benefit for


View) Q={(p,s),(s,q), (p,q), (p),(s)}
p,s,q Not-enough-space-left
p,q MATERIALIZED
s,q MATERIALIZED
p,s (100-60) = 40 (careful)
p 25-4 = 21 (careful)
s 13-3 = 10 (careful)
q 0
None 0
199
Third Iteration
• Materialize view vp,s
• V={vs,q,vp,q,vp,s}
• Update space budget B = 62-60 = 2
• Update benefits

200
Space=2
Updated Benefits V={vs,q,vp,q,vp,s}

Group By (Materialized Benefit for


View) Q={(p,s),(s,q), (p,q), (p),(s)}
p,s,q Not-enough-space-left
p,q MATERIALIZED
s,q MATERIALIZED
p,s MATERIALIZED
p Not-enough-space-left
s Not-enough-space-left
q 0
None 0
201
Greedy algorithm selection
• Final choice V={vs,q,vp,q,vp,s}
– Utilize 25+13+60=98 blocks out of 100 available
80
product,store,quarter

25 13 60
product,quarter store,quarter product, store

quarter
1 product
4 store
3

none 1
202
Considerations
• To account for the varying sizes of views, it is advisable to select
views based on their amortized benefit.
– amortizedBenefit(v) = (cost of answering queries using V – cost of
answering queries using V U {v}) / size(v)
• Or dynamically materialize views while answering user queries!
– DynaMat: A Dynamic View Management System for Data Warehouses.
Y. Kotidis, N. Roussopoulos. In Proceedings of ACM SIGMOD
International Conference on Management of Data, pages 371-382,
Philadelphia, Pennsylvania, June 1999.
– Smart-Views: Decentralized OLAP View Management using
Blockchains.
K. Messanakis, P. Demetrakopoulos, Y. Kotidis. In Proceedings of the
23rd International Conference on Big Data Analytics and Knowledge
Discovery (DaWaK 2021), September 27-30, Linz, Austria, 2021.
Query costs for this selection of
materialized views
• Q = {(p,s),(s,q),
(p,q), (p),(s)} 80
product,store,quarter
– Cost(p,s) = 60 25 13 60
– Cost(s,q) = 13 product,quarter store,quarter product, store

– Cost(p,q) = 25
1 4 3
– Cost(p) = 25 quarter product store

– Cost(s) = ?
none 1

204
Benefit of using Materialized Views
(following the assumptions of this running example)
Q = {(p,s),(s,q), (p,q), (p),(s)}

Using the suggested Querying the


Materialized Views Fact Table
Cost(p,s) = 60 Cost(p,s) = 100
Cost(s,q) = 13 Cost(s,q) = 100
Cost(p,q) = 25 Cost(p,q) = 100
Cost(p) = 25 Cost(p) = 100
Cost(s) = 13 Cost(s) = 100
Total Query Cost = 136 Total QueryCost = 500
73% savings!!! 205
View Update?

• Assume all views are materialized


• What updates are triggered by a single new
fact (sale) : (p1,s3,q4,100) ?
product,store,quarter

product,quarter store,quarter product, store

quarter product store

none
208
The View Update problem
Materialized View: Vsc Table Deltas:
(new records to be appended in the fact table)

Store Customer Price Store Customer Product Price

New sale: S1 C2 P2 $55


S1 C2 $700

S1 C3 $240
S1 C2 P3 $15
S2 C1 $190
S2 C3 $450
S1 C1 P1 $50

How to update this view?


S2 C1 P3 $20

209
Choice 1:
Re-compute from fact table
• First update fact table (append new facts)
• Then re-execute SQL query to obtain view
In SQL:

//load new records


insert into Fact select * from Delta
//drop and recreate View
drop Vsc;
create table Vsc(store,customer,price);
//recompute View from scratch
insert into Vsc
select store,customer,sum(price)
from Fact
group by store,customer;
210
Choice-2: Incremental Updates
• Adding delta tuples means
– Step 1: Update sum() from combinations already in the view
– Step 2: Insert sum() with new coordinates for rest
Store Customer Price Store Customer Product Price
S1 C2 $700 S1 C2 P2 $55

S1 C3 $240
S1 C2 P3 $15
S2 C1 $190
S2 C3 $450
S1 C1 P1 $50

S2 C1 P3 $20

211
Step 1: Increment existing combinations

update Vsc
set Vsc.m=Vsc.m+(select sum(price) from Delta
where Vsc.store=Delta.store and
Vsc.customer=Delta.customer)
where (Vsc.store,Vsc.customer)
in
(select store,customer from Delta);

212
Step 2: Add new combinations
insert into Vsc
select store,customer,sum(price)
from Delta where (store,customer) not in
(select store,customer from Vsc)
group by store,customer;

213
Choice-2: Alternative
• Idea: add delta records to the view, create a new table to hold
updated records, then rename

insert into Vsc


select store,customer,sum(price) from Delta
group by store,customer;
create table Vnew(store,customer,price);
insert into Vnew
select store,customer,sum(price) from Vsc
group by store,customer
drop table Vsc;
rename table Vnew to Vsc;

214
Simple Example
After insertion of deltas
Final View

Store Customer Price


Store Customer Price

S1 C2 $700 S1 C1 $50
S1 C2 $770
S1 C3 $240
S1 C3 $240
S2 C1 $190 S2 C1 $210
S2 C3 $450 S2 C3 $450
S1 C1 $50
S1 C2 $70
S2 C1 $20

215
Multiple View Update
Assume V2 descendant of View View
V1 in the Data Cube V1 V2
Lattice (e.g. V1 can be
used to compute V2)

Updated
Fact
Fact
delta
Scenario 1: Re-compute views after
finishing updating the Fact table
View View
v1 v2
re-compute

Updated
Fact
Fact
delta
Scenario 2: Re-compute v1 from Fact.
Then, recompute v2 from v1
View View
v1 v2
re-compute

Updated
Fact
Fact
delta
Scenario 3: Incrementally update v1
from delta then recompute v2 from v1
View View
v1 v2
re-compute

update

Updated
Fact
Fact
delta
Scenario 4: Incrementally update both
v1 and v2 from delta
View View
v1 v2
update

Updated
Fact
Fact
delta
Consider
• More scenarios?

• Now consider the case of 100 views


PHYSICAL REPRESENTATION OF
MATERIALIZED VIEWS IN THE STAR SCHEMA
Want to create View:
SUM(Quantity), SUM(TotalPrice) per Category, CityName
Product
Order ProdNo
OrderNo ProdName
OrderDate ProdDescr
Fact table Category
Customer OrderNo CategoryDescr
SalespersonID UnitPrice
CustomerNo
CustomerNo QOH
CustomerName Date
CustomerAddress ProdNo
DateKey DateKey
City
CityName Date
Quantity Month
Salesperson
TotalPrice Year
SalespersonID
City
SalespesonName
City CityName
Quota State
Country
224
SQL Επερώτηση

Select Category,CityName,SUM(Quantity) as Sum_Quantity,SUM(TotalPrice) as Sum_TotalPrice


From Fact,Product
Where Fact.ProdNo=Product.ProdNo
Group by Category,CityName

225
Create New Fact Table (= this view)
Category
Aggregated
Fact table for Category
new View CategoryDescr

Category
CityName
Sum_Quantity City
Sum_TotalPrice CityName
State
Country

What additional queries can be


executed using this view?

226
Using Materialized Views through
Selection
• A query can use a view through a selection if
– Each selection condition C on each dimension d in the
query logically implies a condition C’ on dimension d
in the view
• Example: A view has sum(sales) by product and
by year for products introduced after 1991
– OK to use for sum(sales) by product for products
introduced after 1992
– CANNOT use for sum(sales) for products introduced
after 1989

229
Using Materialized Views through
Group By (Roll Up)
• The view V may be applicable via roll-up if for
every grouping attribute g of the query Q:
– Q has Group By a1,..,g, an
– V has Group By a1,..,h, an
– Attribute g is higher than h in the attribute hierarchy
– Aggregation functions are distributive (sum, count,
max, etc)
• Example: Compute “sum(sales) by category” from
the view “sum(sales) by product”

230
Using Views
• Need cost-based optimization to decide which view(s) to use
for answering a query
– Consider a query on (category, state) and three materialized
aggregate views on
1. (product, state)
2. (category, city)
3. (category, country)
– (product, state) and (category, city) are candidate materialized views
to answer the query
query
category,country
category,state view
view
product,state

view
category,city 232
Σημείωση
• Τα παρακάτω slides είναι εκτός ύλης
Data Cube Storage and Indexing
• Several approaches within the relational world
– Cubetrees, QC-trees, Dwarf, CURE
• Main idea: exploit inherent redundancy of
multidimensional aggregates

234
The Dwarf (sigmod 2002)
• Data-Driven DAG
– Factors out inter-view redundancies
– 100% accurate (no approximation)
– All views are included
– Indexes for free
– Partial materialization possible
• Look at the Data Cube Records
– Common Prefixes
• high in dense areas
– Common Suffixes
• extremely high in sparse areas

235
Redundancy in the Cube (1)
• Common Prefixes
Store Customer Product Price
S2,C1,P1,90
S2,C1,P2,50 S1 C2 P2 $70
S2,C1,ALL,140

Mostly in dense areas: S1 C3 P1 $40


➢ customer C1 buys a
lot of products at S2 C1 P1 $90
store S2
➢ all these records have
the same prefix: S2,C1 S2 C1 P2 $50

236
Redundancy in the Cube (2)
• Common Suffices
Store Customer Product Price

S2,C1,P1,90
S1 C2 P2 $70
S2,ALL,P1,90
ALL,C1,P1,90
S1 C3 P1 $40

Mostly in sparse areas


C1 only visits S2 and is the S2 C1 P1 $90
only customer that buys
P1,P2 in that store
S2 C1 P2 $50

237
Dwarf Example
S1 S2
(1)
Store Level

C2 C3 C1 C1 C2 C3
(2) (6) (8)
Customer Level

P1 $40 P2 $70 $110 P1 $90 P2 $50 $140 P1 $130 P2 $120 $250


(5) (7) (9)
P1 $40 $40
(4)

P2 $70 $70
(3)
Product Level
Store Customer Product Price
S1 C2 P2 $70
S1 C3 P1 $40
S2 C1 P1 $90
S2 C1 P2 $50
238
Dwarf Example
S1 S2
(1)
Store Level

C2 C3 C1 C1 C2 C3
(2) (6) (8)
Customer Level

P1 $40 P2 $70 $110 P1 $90 P2 $50 $140 P1 $130 P2 $120 $250


(5) (7) (9)
P1 $40 $40
(4)

P2 $70 $70
(3)
Product Level
Store Customer Product Price Group-by Product:
S1 C2 P2 $70 Store Customer Product Sum(Price)
S1 C3 P1 $40
ALL ALL P1 $130
S2 C1 P1 $90
ALL ALL P2 $120
S2 C1 P2 $50
239
Dwarf Example
S1 S2
(1)
Store Level

C2 C3 C1 C1 C2 C3
(2) (6) (8)
Customer Level

P1 $40 P2 $70 $110 P1 $90 P2 $50 $140 P1 $130 P2 $120 $250


(5) (7) (9)
P1 $40 $40
(4)

P2 $70 $70
(3)
Product Level
Store Customer Product Price Group-by Store:
S1 C2 P2 $70
Store Customer Product Sum(Price)
S1 C3 P1 $40
S1 ALL ALL $110
S2 C1 P1 $90
S2 ALL ALL $140
S2 C1 P2 $50
240

You might also like