12 DataWarehousing
12 DataWarehousing
Yannis Kotidis
What is a Database?
• From Wikipedia:
– A database is a structured collection of records or data. A computer database relies upon software to
organize the storage of data. The software models the database structure in what are known as
database models. The model in most common use today is the relational model. Other models such as
the hierarchical model and the network model use a more explicit representation of relationships …
– Database management systems (DBMS) are the software used to organize and maintain the database.
These are categorized according to the database model that they support. The model tends to
determine the query languages that are available to access the database. A great deal of the internal
engineering of a DBMS, however, is independent of the data model, and is concerned with managing
factors such as performance, concurrency, integrity, and recovery from hardware failures. ...
2
Note
3
Basic Database Usage (1): Querying
Statements
Relations Results
(select columns and rows)
A B C D E A D
A D
4
Basic Database Usage (2): Updates
• Banking transaction: transfer 100 euro from
account A to account B
– What can go wrong?
Account Balance
A 275 -100
B 64 +100
5
Issue 1: Partial results
• System failure prior to adding funds to
account B (but after deleting them from A)
Account Balance
A 175 -100
6
Issue 2: No isolation
• For an observer that monitors all funds
money seem to temporality disappear (and
reappear again)
Account Balance
B 64 +100
7
Issue 3: lost update
• Two concurrent transactions on account A
– T1: remove 100
– T2: remove 50
Account Balance
T2 T1
Read balance (275) Read balance (275)
Subtract 50 A 275 Subtract 100
Write balance 225 Write balance 175
B 64
8
Programming abstraction: Transactions
• Implement real-world transactions
Commit
Begin Run
Abort
9
Atomicity (A.C.I.D.)
• The "all or nothing" property.
– Programmer needn't worry about partial states persisting.
– Two possible outcomes: transaction commits or rollbacks
(aborts)
Abort
Begin Run
Commit
• Examples:
– T1: Delete person from consultants table, insert person into
employees table
– T2: Transfer funds from account A to account B
10
Consistency (A.C.I.D)
• The database should start out "consistent“
(legal state), and at the end of transaction
remain "consistent".
• The definition of "consistent" is up to the
database administrator to define to the
system
– integrity constraints
– other notions of consistency must be handled by
the application.
11
Integrity or correctness of data
• Would like data to be “accurate” or
“correct” at all times
Name Age
EMP: CREATE TABLE EMP (
John 52 Name varchar(255) NOT NULL,
Jim 24 Age int,
CHECK (Age>=18)
Martha 1 );
12
Integrity/consistency constraints
• Predicates data must satisfy
• Examples:
– age >= 18 and age < 65
– x is key of relation R
– x → y holds in R
– Domain(x) = {Red, Blue, Green}
– no employee should make more than twice the
average salary
13
Isolation (A.C.I.D)
• Each transaction must appear to be executed
as if no other transaction is executing at the
same time.
• Transfer funds from A to B (T1).
• Another teller makes a query on A and B (T2).
• T2 could see funds on A or B but not in both!
– Result may be independent of the time
transactions were submitted
14
Durability (A.C.I.D.)
• Once committed, the transactions effects
should not disappear.
– Of course, they may be overwritten by subsequent
committed transactions.
15
Implementation
• A, C, and D are mostly guaranteed by recovery
(usually implemented via logging).
• I is mostly guaranteed by concurrency control
(usually implemented via locking).
• Of course, life is not so simple. For example,
recovery typically requires concurrency
control and depends on certain behavior by
the buffer manager…
16
Operational DBs: OLTP systems
• OLTP= On-Line Transaction Processing
– order update: pull up order# XXX and update status
flag to “completed”
update Orders set status=“Completed”
where orderID=“XXX”
Index on Orders.orderID
orderID=“XXX”
XXX
17
Reconstruction of logical records
Employees Projects Assignments
EmpID Ename ProjID Pname EmpID ProjID Hours
101 John Smith 2 Web_TV 101 3 16
102 Nick Long 3 Web_portal 102 2 24
103 Susan Goal 4 Billing 102 3 8
104 John English 104 4 32
105 Alice Web 105 4 24
106 Patricia Kane 106 4 24
Nick Long
σE.name=“Nick Long”(Employees) <102,Nick Long>
Index on Employees.Ename
19
Physical Plan (step b):
INLJ(Employees,Assignments)
Employees Projects Assignments
EmpID Ename ProjID Pname EmpID ProjID Hours
101 John Smith 2 Web_TV 101 3 16
102 Nick Long 3 Web_portal 102 2 24
103 Susan Goal 4 Billing 102 3 8
104 John English 104 4 32
105 Alice Web 105 4 24
106 Patricia Kane 106 4 24
EmpID=102
<102,2,24>
Employees Assignments <102,3,8>
Index on Assignments.EmpID
20
Physical Plan (step c):
INLJ(Assignments,Projects)
Employees Projects Assignments
EmpID Ename ProjID Pname EmpID ProjID Hours
101 John Smith 2 Web_TV 101 3 16
102 Nick Long 3 Web_portal 102 2 24
103 Susan Goal 4 Billing 102 3 8
104 John English 104 4 32
105 Alice Web 105 4 24
106 Patricia Kane 106 4 24
ProjID=2
<2,Web_TV>
ProjID=3
Assignments Projects <3,Web_portal>
22
Transactional Systems
• Transactional systems are optimized primarily for the
here and now
23
Analytical queries on a production
system?
• CEO wants to report total sales per store in Athens,
for stores with at least 500 sales
• 3 tables: Sales(custid, productid,storeid,amt)
Stores(storeid, manager,addressid)
Addresses(addressid,number,street,city)
SELECT Stores.storeid, SUM(amt) as totalSales Aggregation
FROM Sales, Stores, Addresses
WHERE Stores.storeid = Sales.storeid Joins
AND Stores.addressid=Addresses.addresid
AND Addresses.city=“Athens”
GROUP BY Stores.storeid Group by
HAVING count(*) 500 Filter/Aggregation
Ι. Κωτίδης 24
Logical Plan
πstoreId,totalSales
COUNT(*) 500
Stores.storeId,SUM(amt)->totalSales, COUNT(*)
Sales(custid, productid,storeid,amt)
26
My employees & their projects
EmpID Ename ProjID Pname City Hours
101 John Smith 3 Web_portal Thessaloniki 16
102 Nick Long 2 Web_TV Athens 24
103 Susan Goal 3 Web_portal Thessaloniki 8
104 John English 4 Billing Athens 32
105 Alice Web 4 Billing Athens 24
106 Patricia Kane 4 Billing Athens 24
27
OLAP:
ONLINE ANALYTICAL PROCESSING
OLAP (Online analytical processing)
• OLAP is the process of creating and
summarizing historical, multidimensional data
– Enhances organizational understanding of data
– Supports informed decision-making through
Decision Support Systems and Business
Intelligence
– Enables users to manipulate and explore data
easily and intuitively
34
Data Analytics Stack
OLAP Data Mining Machine Learning
50
Integrated
• Constructed by integrating multiple,
heterogeneous data sources
– relational databases, files, external sources
• Data cleaning and data integration techniques are
applied
– Ensure consistency in naming conventions, keys,
attribute measures, etc. among different data sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.
– When data is moved to the warehouse, it is
transformed
51
Time-Variant
• The time horizon for the data warehouse is
significantly longer than that of operational
systems
– Operational database: current data, old values
overwritten, deleted or archived
– Data warehouse: provides data from a historical
perspective (e.g., past 5-10 years) for trend
analysis
52
Non-volatile
• A physically separate store of data
transformed from the operational
environment
• Operational update of data does not occur in
the data warehouse environment
– Does not require transaction processing, recovery,
and concurrency control mechanisms
– Requires only two operations in data accessing:
• loading of data and access to data
53
Data Warehouse Architecture
Metadata
Monitor
& OLAP Cubes
Other Integrator OLAP
Staging Area
sources
json csv
Query/Reporting
Operational Extract
DBs Transform Data Serve
Load Data Mining
Warehouse
Front-End Tools
Data Marts
57
Data Marts
• Smaller warehouses
• Span part of organization
– e.g., marketing (customers, products, sales)
• Do not require enterprise-wide consensus
– But may lead to long term integration
problems
58
ETL: Extract-Transform-Load
• Data is periodically (e.g. every night) pulled from the sources and feeds
the Data Warehouse
– Modern application stretch the need for (near) real time processing of
updates (will not be covered in this class)
• To update the Data Warehouse with new data, ETL (Extract, Transform,
Load) processes are utilized to extract, validate, cleanse, correct,
transform, and load the data
• Verifying data accuracy to ensure that the data is correct and consistent
• Removing duplicates to eliminate redundant entries
• Filling in or removing incomplete data to ensure that all data points are
complete and consistent
• Standardizing the data to ensure consistency in format and representation.
• High-quality data leads to better business decisions!
• Once the data has been loaded, precomputations are carried out in the
form of data cubes (either complete or partial) to accelerate the
processing of common queries
Data Warehouse Summary
• A data warehouse serves as a centralized
repository housing structured data, primarily
geared towards facilitating analytics and business
intelligence endeavours.
• Data warehouses are highly efficient, due to the
structured nature of the data inside them that
enables efficient SQL-based analytics.
• The enforcement of a consistent schema across
all stored data greatly enhances the usability and
reliability of the stored data.
Related Technologies
Data Warehouse Data Lake Data Lakehouse
C1,S1 T5 C1 S2 P2 $25
C2,S1
T6 C1 S2 P2 $50
T7 C2 S1 P4 $45
C3,S1 T8 C3 S1 P1 $10
C1,S2
Customer Store Revenue Min Price Max Price
C2 S1 $115 $45 $70
Perform
C1 S1 $45 $45 $45
2.
aggregation
C3 S1 $50 $10 $40
C1 S2 $165 $25 $90
64
Relational Algebra (logical plan)
Sales
Map data and aggregates into a high-
dimensional space
• Example: compute total sales volume per
productID and storeID
StoreID
Total ProductID
Sales 1 2 3 4
1 $454 - - $925
468
StoreID
2 $468 $800 - -
3 $296 - $240 -
4 $652 - $540 $745
ProductID
67
Product Hierarchy
ΠΡΟΙΟΝΤΑ
Μπριζόλα Καπνιστό Κιμάς Φιλέτο Γραβιέρα Φέτα Πλήρες Χοντρές Ψηλές Κίτρινο Καρολίνα
p187 p96
68
More on Attribute Hierarchies
• Values of a dimension may be related
– Hierarchies are most common
• Dependency graph may be: year
– Hierarchy (tree): e.g.,
city → state → country month
week
– Lattice:
date → month → year date
date → week (of a year) → year
69
Another example
• VIN: Vehicle
Identification Number Manufacturer
Type
(unique key)
• Model: e.g. Fiesta
Model
• Type: e.g. Compact
Car
VIN
• Manufacturer: e.g.
Ford
Using hierarchies
• When projecting data into a set of dimensions, it is
common to select an appropriate hierarchy level for
each dimension based on the analysis being
performed.
– “Compute total sales per productID”
Vs
– “Compute total sales per product-category”
NY
Milk 20
Soda 45
Beer 18
Bread 22
Toothpaste 07
Soap 06
1 2 3 4 5 67 1: Sunday
Day-of-week 2: Monday
….
72
Roll-up Operation
• Dimension reduction: Product
1 2 3 4
– e.g., total sales by city by product
NY $454 - - $925
– e.g., total sales by city
SF $468 $485 - $315
City
LA $296 - $340 -
SE $652 - $640 $645
73
Σημείωση
• Σε αυτό το σημείο συζητάμε για τελεστές με
τους οποίους «μετακινούμαστε» στο
πολυδιάστατο χώρο δημιουργώντας
διαφορετικές «προβολές» των δεδομένων της
αποθήκης.
• Ο τελεστής ROLLUP υπάρχει και στην SQL
γλώσσα αλλά εκεί λειτουργεί ελαφρώς
διαφορετικά υπολογίζοντας πολλαπλά group
bys με μία επερώτηση.
Drill-Down
• Drill-down: Inverse operation of roll-up
– Provides the data set that was aggregated
• e.g., show “base” data for total sales figure of the state of CA
75
Other Operations
• Selection (slice & dice) defines a
subcube
• Project the cube on fewer
dimensions by specifying coordinates
of remaining dimensions
• e.g., sales to customer XXX
• Ranking
• top 3% of cities by average sales
76
Warehouse Database Schema
• Relational design should reflect
multidimensional view
• Typical schemas:
– Star Schema
– Snowflake Schema
– Fact Constellation Schema
Fact Table T2
T2
P157
P6
L4
L1
3
14
180
2560
T3 P25 L3 1 2
• A table in the data warehouse T3 P157 L1 1 60
that contains facts consisting of
– Numerical performance measures
– Foreign keys that tie the fact data
to the dimension tables Foreign keys to dimension tables measures
79
Dimension product_key
P1
product_name
i7-8700K
category
CPU
brand
Intel
color
black
supplier name
Jim
Tables P2
P3
i5-2400
Samsung 830
CPU
SSD
Intel
Samsung
black
brown
Jim
Ben
• Loading of data
– dimension tables are relatively static
– data is loaded (append mostly) into fact table(s)
– new indexing opportunities
82
Querying the Star Schema
TIME PRODUCT
time_key
product_key
day
product_name
day_of_the_week SALES category
month
time_key brand
quarter
color
year product_key supplier_name
location_key
LOCATION
units_sold location_key
amount store
street_address
city
state
country
region
83
Querying the Star Schema
“Find total sales per product-category in our stores in Europe”
87
Snowflake Schema: represents dimensional
hierarchy by normalization
Product
Order Category
ProdNo
OrderNo CategoryID
ProdName
OrderType CategoryDescr
ProdDescr
OrderNotes Fact table
CategoryID
Customer OrderNo UnitPrice
CustomerNo SalespersonID QOH
CustomerName CustomerNo
Date Month
CustomerAddress DateKey Year
CityName DateKey Month
City Date
ProdNo Year Year
Quantity Week
Salesperson Week
TotalPrice Month
SalespersonID Year
SalespesonName
City City State
Quota CityName StateName
State Country
88
Multidimensional Modeling Stages
(adapted from https://www.kimballgroup.com/)
Gather Business Requirements
and Data Realities
94
Bitmap Index Example
Base Table Region Index
Cust Region Rating RowID N S E W
C1 N H 1 1 0 0 0
C2 S M 2 0 1 0 0
C3 W L 3 0 0 0 1
C4 W H 4 0 0 0 1
C5 S L 5 0 1 0 0
C6 W L 6 0 0 0 1
C7 N H 7 1 0 0 0
95
Bitmap Index Example
Base Table Region Index
Cust Region Rating RowID N S E W
C1 N H 1 1 0 0 0
C2 S M 2 0 1 0 0
C3 W L 3 0 0 0 1
C4 W H 4 0 0 0 1
C5 S L 5 0 1 0 0
C6 W L 6 0 0 0 1
C7 N H 7 1 0 0 0
97
Bitmap Index Example
Base Table Region Index Rating Index
Cust Region Rating RowID N S E W RowID H M L
C1 N H 1 1 0 0 0 1 1 0 0
C2 S M 2 0 1 0 0 2 0 1 0
C3 W L 3 0 0 0 1 3 0 0 1
C4 W H 4 0 0 0 1 4 1 0 0
C5 S L 5 0 1 0 0 5 0 0 1
C6 W L 6 0 0 0 1 6 0 0 1
C7 N H 7 1 0 0 0 7 1 0 0
99
Bitmap Index
• An alternative representation of RID-list
• Comparison, join and aggregation operations are
reduced to bit arithmetic
• Especially advantageous for low-cardinality
domains
– Significant reduction in space and I/O (30:1)
– Have been adapted for higher cardinality domains
– Compression (e.g., run-length encoding) exploited
• Products: Model 204, Redbrick, IQ (Sybase),
Oracle, etc
100
Join Index
• Traditional index maps the value in a column
to a list of rows with that value
• Join index maintains relationships between
attribute value of a dimension and the
matching rows in the fact table
• Join index may span multiple dimensions
(composite join index)
101
Example: Join Indexes
• “Combine” SALE, PRODUCT relations
103
Example: Compute total sales in
AFRICA
TIME PRODUCT
time_key
product_key
day
product_name
day_of_the_week SALES category
month
time_key brand
quarter
color
year product_key supplier_name
location_key LOCATION
units_sold location_key
store
amount
street_address
city
SELECT SUM(sales.amount)
state
FROM sales, location
country
WHERE sales.location_key=location.location_key
region
AND location.region=“AFRICA”
Join-Index in the Star Schema
• Join index relates the values of
the dimensions of a star SALES
LOCATION
schema to rows in the fact
region = Africa
table.
region = America R102 1
– a join index on region region = Asia
maintains for each distinct region = Europe
region a list of ROW-IDs of
R117 1
the tuples recording the sales
in the region R118 1
105
Join Index on Location.Region
implemented as bitmap index
T2 P157 L4 3 180 0 0 0 1
T2 P6 L1 14 2560 1 0 0 0
T3 P25 L3 1 2 0 0 1 0
T3 P157 L1 1 60 1 0 0 0
107
THE DATA CUBE
Aggregation
(on a single group via filtering)
• Sum up amounts for day 1
• In SQL: SELECT sum(amt)
FROM SALE
WHERE day = 1
109
Group by & Aggregation
• Sum up amounts by day
110
Common operations
• Sum up amounts by day, product
• In SQL: SELECT prodid,day,sum(amt) FROM SALE
GROUP BY prodId, day
sale prodId storeId day amt
p1 c1 1 12 sale prodId day amt
p2 c1 1 11
p1 1 62
p1 c3 1 50
p2 1 19
p2 c2 1 8
p1 c1 2 44 p1 2 48
p1 c2 2 4
rollup
drill-down
111
Recall: Star Schema Example 1
TIME PRODUCT
time_key
product_key
day
product_name
day_of_the_week SALES category
month
time_key brand
quarter
color
year product_key supplier_name
location_key LOCATION
units_sold location_key
{ amount
store
street_address
city
state
country
region
Compute volume of sales
per product_key and store
Store Product_key sum(amount)
1 1 454
Product_key 1 4 925
Sales
1 2 3 4 ALL 2 1 468
2 2 800
1 454 - - 925 1379
3 1 296
2 468 800 - - 1268
Store
3 3 240
3 296 - 240 - 536 4 1 652
4 3 540
4 652 - 540 745 1937
4 4 745
114
Multiple Simultaneous Aggregates
Cross-Tabulation (products/store)
How many queries
to obtain this result?
Product_key
Sales
1 2 3 4 ALL
1 454 - - 925 1379
2 468 800 - - 1268
Sub-totals per store
Store
Total sales
Sub-totals per product_key
115
Multiple Simultaneous Aggregates
Cross-Tabulation (products/store)
Product_key
Sales
1 2 3 4 ALL
1 454 - - 925 1379 Aggregate sales*
group by (store,product_key)
2 468 800 - - 1268
Store
Cross-Tabulation (products/store)
Product_key
Sales
1 2 3 4 ALL
1 454 - - 925 1379 Aggregate sales
group by (store)
2 468 800 - - 1268
Store
117
Multiple Simultaneous Aggregates
Cross-Tabulation (products/store)
Product_key
Sales
1 2 3 4 ALL
1 454 - - 925 1379
2 468 800 - - 1268
Store
Aggregate sales
group by (product_key) 118
Total sales: group by “none”
Product_key
Sales
1 2 3 4 ALL
1 454 - - 925 1379
2 468 800 - - 1268
Store
119
Multiple Simultaneous Aggregates
4 Group-bys here:
Cross-Tabulation (products/store) (store,product_key)
(store)
Product_key (product_key)
Sales ()
1 2 3 4 ALL Need to write 4 queries!!!
1 454 - - 925 1379
2 468 800 - - 1268
Sub-totals per store
Store
122
Relational View of Data Cube
Store Product_key sum(amount)
Product 1 1 454
Sales 1 4 925
1 2 3 4 ALL 2 1 468
1 454 - - 925 1379 2 2 800
2 468 800 - - 1268 3 1 296
Store
3 3 240
3 296 - 240 - 536 4 1 652
4 652 - 540 745 1937 4 3 540
ALL 1870 800 780 1670 5120 4 4 745
1 ALL 1379
2 ALL 1268
SELECT LOCATION.store, SALES.product_key, SUM (amount) 3 ALL 536
FROM SALES, LOCATION 4 ALL 1937
WHERE SALES.location_key=LOCATION.location_key ALL 1 1870
GROUP BY CUBE (SALES.product_key, LOCATION.store)
ALL 2 800
ALL 3 780
ALL 4 1670
ALL ALL 5120
123
Relational View of Data Cube
Store Product_key sum(amount)
1 1 454
1 4 925
2 1 468
2 2 800
group by(store, product_key) 3 1 296
3 3 240
4 1 652
4 3 540
4 4 745
1 ALL 1379
group by(store) 2 ALL 1268
3 ALL 536
4 ALL 1937
ALL 1 1870
ALL 2 800
group by(product_key)
ALL 3 780
ALL 4 1670
group by() ALL ALL 5120
124
Quiz
• SALES(customer,sales_person,store,product,amt)
• Assume the SUM() aggregate function
• What is the meaning of the following data cube
records?
(ALL,’JOHN’,ALL,ALL,5000)
(‘NICK’,ALL,ALL,’BEER’,250)
(ALL,ALL,ALL,’MILK’,70000)
(ALL,ALL,ALL,ALL,250000)
125
Group by (Product, Quarter, Region)
SUM() aggregate function
Quarter
1Qtr 2Qtr 3Qtr 4Qtr
DVD
PC America
VCR
Region
Europe
1.2M Asia
126
Group by (Product, Quarter, Region)
Quarter
1Qtr 2Qtr 3Qtr 4Qtr
DVD
PC America
VCR
Region
Europe
Asia
4M
127
Group by (Product, Quarter, Region)
Total sales of DVDs in the
1st Qtr in America
Quarter
1Qtr 2Qtr 3Qtr 4Qtr
DVD 2.2M
PC America
VCR
Region
Europe
Asia
128
Data Cube: Multidimensional View
Total annual sales
Quarter of DVDs in America
1Qtr 2Qtr 3Qtr 4Qtr ALL
DVD
PC America
VCR
ALL
Europe
Region
Asia
ALL
133
Step 1: Sort tuples by prodId
Raw data (fact table)
sale prodId storeId date amt
p1 s1 1 12
p2 s1 1 11
p1 s3 1 50
p2 s2 1 8
p1 s1 2 44
p1 s2 2 4
134
Step 2: Aggregate records (sum amt)
Sorted Raw data
sale prodId storeId date amt
p1 s1 1 12
p1 s1 2 44 Sales for prodId=1
p1 s2 2 4
p1 s3 1 50
p2 s1 1 11
p2 s2 1 8
135
More on aggregate
• Assumed SUM() function
• How much space needed? sale prodId storeId date amt
p1 s1 1 12
• How about AVG()? p1 s1 2 44
p1 s2 2 4
• How about MEDIAN()? p1 s3 1 50
p2 s1 1 11
p2 s2 1 8
136
Aggregate Computation
• Certain functions
(SUM,MIN,MAX,COUNT,AVERAGE, etc) require
small (bounded) space for storing their state
and may be computed on the fly, while
executing the merging phase of the 2-phase
sort algorithm.
137
Hashing
key → h(key)
<key>
Buckets
(typically 1
. disk block)
.
.
138
Example: 2 records/bucket
INSERT:
0
h(a) = 1 d
1
h(b) = 2 a e
c
h(c) = 1 2
b
h(d) = 0
3
h(e) = 1
139
How does this work for aggregates?
Hash on prodId Possibly keep
sale prodId storeId date amt
p1 s1 1 12 records sorted
p1 s3 1 50
p1 s1 2 44
within bucket
p1 s2 2 4
p3 s5 1 7
140
Naïve Data Cube Computation
• Fact table:
sale prodId storeId amt
p1 s1 12
p2 s1 11
p1 s3 50
p2 s2 8
p1 s1 44
p1 s2 4
141
Full Data Cube
(from previous example)
prodId storeId sum(amt)
p1 s1 56
p1 s2 4
p1 s3 50
p2 s1 11
p2 s2 8
p1 ALL 110
p2 ALL 19
ALL s1 67
ALL s2 12
ALL s3 50
ALL ALL 129
142
How much does it cost to compute?
• Assume B(SALES)=1 Million Blocks, larger than
available memory
• Our (brute force) strategy: compute each
group by independently
– Compute GROUP BY prodId,storeId
– Compute GROUP BY prodId
– Compute GROUP BY storeId
– Compute GROUP BY none (=total amt)
143
First Group By: prodId,storeId
• In SQL
SELECT prodId,storeId,sum(amt)
FROM SALES
GROUP BY prodId,storeId
144
Second Group By: prodId
• In SQL
SELECT prodId,sum(amt)
FROM SALES
GROUP BY prodId
145
Third Group By: storeId
• In SQL
SELECT storeId,sum(amt)
FROM SALES
GROUP BY storeId
146
Group By (none) = sum(amt)
• SQL:
SELECT sum(amt)
FROM SALES
• Cost ?
147
Recap
• Group By prodId,storeId : 3M I/Os
• Group By prodId : 3M I/Os
• Group By storeId : 3M I/Os
• Group By none : 1M I/Os
– Compute aggregate function over all records, no
sorting necessary
148
Practice Problem
• Rotation speed 7200rpm
• 128 sectors/track
• 4096 bytes/sector
• 4 sectors/block (16KB page size)
• Sequential I/O: ignore SEEKTIME, gaps, etc
149
Sustained disk speed
• 1 full rotation
– takes 60/7200=8.33ms
– retrieves 1 track = 128 sectors = 32 pages (blocks)
• 10 Million blocks in
8.33/1000 * 10M/32 = 43.5 minutes
• Can we do better?
150
Share sort orders
153
Can we do better?
• Sort SALES on prodId,storeId gb(prodId,storeId)
– At the merging phase compute prodId storeId sum(amt)
both group by (prodId,storeId)) p1 s1 56
and group by (prodId) p1 s2 4
– Also compute group by none at p1 s3 50
the same time p2 s1 11
p2 s2 8
• Compute group by (storeId) by s
product,store,quarter
155
Discussed optimization (sharing sort
orders) on the 3D Data Cube
• Sort SALES on product,store,quarter (also get
gb product,store, gb product and gb none)
• Sort SALES on product,quarter
• Sort SALES on store,quarter (also get gb store)
• Sort SALES on quarter product,store,quarter
156
Compute from “smallest parent”
vs
“sharing sort orders”
• Consider computation of gb product, quarter
• Previously: Sort SALES on product,quarter
• Alternative: read and sort previously computed gb
product,store,quarter
– This gb will be smaller than SALES
• It may even fit in memory (one-pass sort)
– This gb is partially sorted (common prefix) product,store,quarter
none
157
ESTIMATING THE DATA CUBE SIZE
How many group bys in the Data Cube?
product,store,quarter
none
159
2D Data Cube lattice
product, store
product store
none
160
Let’s add a simple hierarchy
• Assume that products are organized into
categories
• Utilizing this information becomes possible
when we aggregate the sales data.
– Aggregate sales per category
– Aggregate sales per category and store
– But it does not make sense to aggregate sales per
product and category (WHY?)
Compare these two results
Group by (product,category) Group by (product)
product category sum(amt) product sum(amt) sum(amt)
p1 cat1 110 p1 110 56
p2 cat1 19 p2 19 4
p3 cat3 240 p3 240 50
p4 cat2 255 p4 255 11
p5 cat1 75 p5 75 8
none
163
2D Data Cube lattice with 2 separate
hierarchies on the product dimension
store, brand, category
none
165
#of group bys when there is a single
hierarchy per dimension
• N dimensions
• Dimension di has a hierachy of length Li
• Location: store→city→country
LLocation =3
– If no hierarchy, then Li =1
• Number of group bys = (1+L1) (1+L2)… (1+LN)
– No need to memorize this formula! Seek to
understand its derivation instead (next slide)
166
How is the formula derived
• Consider Location dimension with hierarchy
– store→city→country (i.e. LLocation =3)
• In a group by (aggregate) query I may
– Not consider location at all (e.g. total sales per product)
• Another way to think about this is that +1 stands for ALL
– Consider location information at the store-level
• (e.g. total sales per customer, store)
– Consider location information at the city-level
• (e.g. total sales per product, city)
– Consider location information at the country-level
• (e.g. total sales per sales_person, country)
• There are (1+3) choices regarding that dimension independently on
what other dimensions I select in a gb
– Thus, (1+L1) (1+L2)… (1+LN) possible combinations of dimensions in a query
167
Example
• 8 dimensions (typical)
• 3-level hierarchy/dimension
• Number of group bys = 48=65536 group bys!
• BUT, how many tuples in the cube?
– Depends on data distribution
– Worst case is uniform
?
product
customer
168
Upper bound on the size of each group by
169
Example gb(customer,product)
• Assume I have 1000 customers and 50 products
• Assume uniform distribution (customers buy
products with same probability)
– There can be 1000 x 50 combinations of pairs
(customer, product) in the fact table (sales)
– Thus, 50000 records in gb(customer,product) (at most)
• Each record in this gb is derived from a real sale
– There can not be an aggregated record if there are not
base records in the fact table to support it
• Thus, there can not be more records in the gb
than the number of actual sales in the fact table
170
Example
• Consider R(product,store,quarter,amt) with 1M records
• 10,000 products, 30 stores, 4 quarters
• Let G(x,y) denote the maximum number of records in group
by x,y
– G(product,store,quarter)=min(1M,10000*30*4)=1,000,000
– G(product,store)=min(1M,10000*30)=300,000
– G(product,quarter)=min(1M,10000*4)=40,000
– G(store,quarter)=min(1M,30*4)=120
– G(product)=min(1M,10000)=10,000
– G(store)=min(1M,30)=30
– G(quarter)=min(1M,4)=4
– G(none)=1
171
Quick and Dirty Upper Bound
MAX-SIZE<=10001*31*5 = 1550155
(1+t1)*(1+t2)*(1+t3)
172
Data Cube: Multidimensional View
Total annual sales
Quarter of DVDs in America
1Qtr 2Qtr 3Qtr 4Qtr ALL
DVD
PC America
VCR
ALL
Europe
Region
Asia
ALL
174
Correlated Attributes
• In practice there is some correlation between
different dimensions
• Example 1: each store sells up to 1,000
products (specialized stores)
• Example 2: some products are not sold
through-out the year
– Ice cream, watermelon, snow-chains
175
Solve Example-1
• R(product,store,customer) with 1M records
• 1,000 products, 20 stores, 100 customers
• Each customer buys from only one store
(closest)
– Functional Dependency: customer → store
G(store,customer)=min(1M,1*100)=100
G(product,store,customer)=min(1M,1000*1*100)
=100,000
176
More realistic example
• 100,000 parts
• 20,000 customers
• 2,000 suppliers
• 5 years (=365 *5 days)
• 100 stores
• 1,000 sales persons
178
What to Materialize?
• Data Cube extremely large for many
applications
• Store in warehouse results useful for
common queries
• Example:
– Total sales per product, store
– Max sales per product
– Avg sales per store,day
–…
179
Materialization Factors
• Type/Frequency of Queries
– Examine frequently occurring query patterns
• Query Response Time
– Aim to expedite long-running queries
• Storage cost
– Evaluate disk space required to hold materialized
results
• Update cost
– Materialized results need to be refreshed following
star schema updates
180
MATERIALIZED VIEWS
Preliminaries
• We will consider solutions that
selectively materialize certain groups by
in the Data Cube
– We will be referring to the group bys as
“views”
– When a group by is materialized we will call
it a “materialized view”
182
Views in OLTP databases
Employee(ename, age, dept, address, telno, salary)
• Views are derived tables
– Instance of view is generated on demand by
executing the view query:
create view V as
select ename,age, address,telno
from employee
where employee.dept = “Sales”
• Views have many uses
– Shortcuts for complex queries
– Logical-physical independence
– Hide details from the end-user
– Integration systems
183
Materialized Views (OLAP)
• Sometimes, we may want to compute and store the
content of the view in the database
– Such Views are called materialized
– Queries on the materialized view instance will be much
faster
– Materialized views are now supported by some vendors
• Otherwise, we will be storing their data in regular tables
• This is our extended architecture:
Data Warehouse=
detailed records (star schema) + aggregates (materialized views)
• Problem is NP-hard
186
View Selection Problem: Heuristic
• Use some notion of benefit per view considering the
interdependencies illustrated in the Data Cube lattice
group by(product,store)
product store sum(amt)
p1 s1 56
product,store,quarter p1 s2 4
p1 s3 50
p2 s1 11
p2 s2 8
product,quarter store,quarter product, store
Regardless of the specific
quarter product store computation method (such as
sorting, hashing, etc.), queries
related to these GROUP BYs
none
can be effectively performed
by leveraging a materialized
view on the grouping
attributes (product, store)
187
A simple greedy algorithm
• Employ a benefit criterion to evaluate and compare the potential advantages of
different views. Select the one with the highest benefit at each step.
• Assume V represents the set of views that have been selected thus far, reflecting
the current state.
– Let v be a candidate view under consideration, which is not currently included in set V.
– Benefit(v) = cost of answering queries using V – cost of answering queries using V U {v}
• Assesses the reduction in the cost associated with answering queries if the candidate view, v, is
materialized
• The utilization of view v may potentially result in a decrease in the cost of certain queries, although it
is also possible that no cost reduction would occur.
• Thus, Benefit(v) ≥ 0
• Simple Greedy algorithm:
– At each iteration, select the view that offers the highest benefit among the available options.
– Re-compute benefits of remaining views
– Update space budget B, set B=B-sizeof(v)
– Remove views that do not fit in new budget B
– Stop if no more space available or no view fits in the remaining space or remaining views
provide no benefit (query cost reduction)
188
Simple Example
• Star schema with three dimensions and one measure
– Product (p), Store location (s), Quarter (q), amount (amt)
– Fact table: SALES(product, store, quarter, amt)
• Assume the following set of queries
– Q = {(p,s),(s,q), (p,q), (p),(s)}
– Notation (1): (s,q) is a query on group by (store,quarter), i.e.
(s,q): SELECT store, quarter, sum(amt)
FROM SALES
GROUP BY store, quarter
– Notation (2): View vstore,quarter is a materialized view containing
the result of the previous query
189
Query computation cost
• For ease of presentation, let us assume that
each query can be computed from the fact
table SALES with the same cost 100 I/O
(s,q): SELECT store, quarter, sum(amt)
FROM SALES
GROUP BY store, quarter
190
Data Cube result size
• Assume each group by in the Data Cube
requires the depicted number of blocks, when
stored as a materialized view
80
product,store,quarter
25 13 60
product,quarter store,quarter product, store Size vproduct,store = 60 blocks
quarter
1 product
4 store
3
none 1
191
Assumption (linear cost model)
• A group by query is computable from an
ancestor materialized view v with Cost=size(v)
Computation for (s,q) from SALES:
(s,q): SELECT store, quarter, sum(amt)
80 FROM SALES
product,store,quarter
GROUP BY store, quarter
25 13 60 Cost = 100 I/O
product,quarter store,quarter product, store Alternative computation for (s,q):
SELECT store, quarter, sum(amt)
quarter
1 product
4 store
3 FROM vproduct,store,quarter
GROUP BY store, quarter
Cost = 80 I/O
none 1 192
View Selection Problem
• Minimize the cost of answering the depicted
queries when available space B=100 blocks
80
product,store,quarter
25 13 60
product,quarter store,quarter product, store
quarter
1 product
4 store
3
none 1
193
Initial Benefits
(no view is materialized yet, V={})
Group By (Materialized Benefit for
View) Q={(p,s),(s,q), (p,q), (p),(s)}
p,s,q (100-80)+(100-80)+(100-
80)+(100-80)+(100-80) = 100
p,q 2*(100-25) = 150
s,q 2*(100-13) = 174
p,s 3*(100-60) = 120
p 100-4 = 96
s 100-3 = 97
q 0
None 0 195
First Iteration
• Materialize view vs,q
• Update space budget B = 100-13 = 87
• Recompute benefits (next slide)
196
Space=87
Updated Benefits V={vs,q}
198
Space=62
Updated Benefits V={vs,q,vp,q}
200
Space=2
Updated Benefits V={vs,q,vp,q,vp,s}
25 13 60
product,quarter store,quarter product, store
quarter
1 product
4 store
3
none 1
202
Considerations
• To account for the varying sizes of views, it is advisable to select
views based on their amortized benefit.
– amortizedBenefit(v) = (cost of answering queries using V – cost of
answering queries using V U {v}) / size(v)
• Or dynamically materialize views while answering user queries!
– DynaMat: A Dynamic View Management System for Data Warehouses.
Y. Kotidis, N. Roussopoulos. In Proceedings of ACM SIGMOD
International Conference on Management of Data, pages 371-382,
Philadelphia, Pennsylvania, June 1999.
– Smart-Views: Decentralized OLAP View Management using
Blockchains.
K. Messanakis, P. Demetrakopoulos, Y. Kotidis. In Proceedings of the
23rd International Conference on Big Data Analytics and Knowledge
Discovery (DaWaK 2021), September 27-30, Linz, Austria, 2021.
Query costs for this selection of
materialized views
• Q = {(p,s),(s,q),
(p,q), (p),(s)} 80
product,store,quarter
– Cost(p,s) = 60 25 13 60
– Cost(s,q) = 13 product,quarter store,quarter product, store
– Cost(p,q) = 25
1 4 3
– Cost(p) = 25 quarter product store
– Cost(s) = ?
none 1
204
Benefit of using Materialized Views
(following the assumptions of this running example)
Q = {(p,s),(s,q), (p,q), (p),(s)}
none
208
The View Update problem
Materialized View: Vsc Table Deltas:
(new records to be appended in the fact table)
S1 C3 $240
S1 C2 P3 $15
S2 C1 $190
S2 C3 $450
S1 C1 P1 $50
209
Choice 1:
Re-compute from fact table
• First update fact table (append new facts)
• Then re-execute SQL query to obtain view
In SQL:
S1 C3 $240
S1 C2 P3 $15
S2 C1 $190
S2 C3 $450
S1 C1 P1 $50
S2 C1 P3 $20
211
Step 1: Increment existing combinations
update Vsc
set Vsc.m=Vsc.m+(select sum(price) from Delta
where Vsc.store=Delta.store and
Vsc.customer=Delta.customer)
where (Vsc.store,Vsc.customer)
in
(select store,customer from Delta);
212
Step 2: Add new combinations
insert into Vsc
select store,customer,sum(price)
from Delta where (store,customer) not in
(select store,customer from Vsc)
group by store,customer;
213
Choice-2: Alternative
• Idea: add delta records to the view, create a new table to hold
updated records, then rename
214
Simple Example
After insertion of deltas
Final View
S1 C2 $700 S1 C1 $50
S1 C2 $770
S1 C3 $240
S1 C3 $240
S2 C1 $190 S2 C1 $210
S2 C3 $450 S2 C3 $450
S1 C1 $50
S1 C2 $70
S2 C1 $20
215
Multiple View Update
Assume V2 descendant of View View
V1 in the Data Cube V1 V2
Lattice (e.g. V1 can be
used to compute V2)
Updated
Fact
Fact
delta
Scenario 1: Re-compute views after
finishing updating the Fact table
View View
v1 v2
re-compute
Updated
Fact
Fact
delta
Scenario 2: Re-compute v1 from Fact.
Then, recompute v2 from v1
View View
v1 v2
re-compute
Updated
Fact
Fact
delta
Scenario 3: Incrementally update v1
from delta then recompute v2 from v1
View View
v1 v2
re-compute
update
Updated
Fact
Fact
delta
Scenario 4: Incrementally update both
v1 and v2 from delta
View View
v1 v2
update
Updated
Fact
Fact
delta
Consider
• More scenarios?
225
Create New Fact Table (= this view)
Category
Aggregated
Fact table for Category
new View CategoryDescr
Category
CityName
Sum_Quantity City
Sum_TotalPrice CityName
State
Country
226
Using Materialized Views through
Selection
• A query can use a view through a selection if
– Each selection condition C on each dimension d in the
query logically implies a condition C’ on dimension d
in the view
• Example: A view has sum(sales) by product and
by year for products introduced after 1991
– OK to use for sum(sales) by product for products
introduced after 1992
– CANNOT use for sum(sales) for products introduced
after 1989
229
Using Materialized Views through
Group By (Roll Up)
• The view V may be applicable via roll-up if for
every grouping attribute g of the query Q:
– Q has Group By a1,..,g, an
– V has Group By a1,..,h, an
– Attribute g is higher than h in the attribute hierarchy
– Aggregation functions are distributive (sum, count,
max, etc)
• Example: Compute “sum(sales) by category” from
the view “sum(sales) by product”
230
Using Views
• Need cost-based optimization to decide which view(s) to use
for answering a query
– Consider a query on (category, state) and three materialized
aggregate views on
1. (product, state)
2. (category, city)
3. (category, country)
– (product, state) and (category, city) are candidate materialized views
to answer the query
query
category,country
category,state view
view
product,state
view
category,city 232
Σημείωση
• Τα παρακάτω slides είναι εκτός ύλης
Data Cube Storage and Indexing
• Several approaches within the relational world
– Cubetrees, QC-trees, Dwarf, CURE
• Main idea: exploit inherent redundancy of
multidimensional aggregates
234
The Dwarf (sigmod 2002)
• Data-Driven DAG
– Factors out inter-view redundancies
– 100% accurate (no approximation)
– All views are included
– Indexes for free
– Partial materialization possible
• Look at the Data Cube Records
– Common Prefixes
• high in dense areas
– Common Suffixes
• extremely high in sparse areas
235
Redundancy in the Cube (1)
• Common Prefixes
Store Customer Product Price
S2,C1,P1,90
S2,C1,P2,50 S1 C2 P2 $70
S2,C1,ALL,140
236
Redundancy in the Cube (2)
• Common Suffices
Store Customer Product Price
S2,C1,P1,90
S1 C2 P2 $70
S2,ALL,P1,90
ALL,C1,P1,90
S1 C3 P1 $40
237
Dwarf Example
S1 S2
(1)
Store Level
C2 C3 C1 C1 C2 C3
(2) (6) (8)
Customer Level
P2 $70 $70
(3)
Product Level
Store Customer Product Price
S1 C2 P2 $70
S1 C3 P1 $40
S2 C1 P1 $90
S2 C1 P2 $50
238
Dwarf Example
S1 S2
(1)
Store Level
C2 C3 C1 C1 C2 C3
(2) (6) (8)
Customer Level
P2 $70 $70
(3)
Product Level
Store Customer Product Price Group-by Product:
S1 C2 P2 $70 Store Customer Product Sum(Price)
S1 C3 P1 $40
ALL ALL P1 $130
S2 C1 P1 $90
ALL ALL P2 $120
S2 C1 P2 $50
239
Dwarf Example
S1 S2
(1)
Store Level
C2 C3 C1 C1 C2 C3
(2) (6) (8)
Customer Level
P2 $70 $70
(3)
Product Level
Store Customer Product Price Group-by Store:
S1 C2 P2 $70
Store Customer Product Sum(Price)
S1 C3 P1 $40
S1 ALL ALL $110
S2 C1 P1 $90
S2 ALL ALL $140
S2 C1 P2 $50
240