Data Wharehousing, OLAP
and Data Mining
1
Acknowledgments
A. Balachandran
Anand Deshpande
Sunita Sarawagi
S. Seshadri
2
Overview
Part 1: Data Warehouses
Part 2: OLAP
Part 3: Data Mining
Part 4: Query Processing and Optimization
3
Part 1: Data Warehouses
4
Data, Data everywhere
yet ...
I can’t find the data I need
data is scattered over the network
many versions, subtle differences
I can’t get the data I need
need an expert to get the data
I can’t understand the data I
found
available data poorly documented
I can’t use the data I found
results are unexpected
data needs to be transformed from
one form to other
5
What is a Data Warehouse?
A single, complete and
consistent store of data
obtained from a variety of
different sources made
available to end users in a
what they can understand
and use in a business
context.
[Barry Devlin]
6
Why Data Warehousing?
Which are our
lowest/highest margin
customers ?
Who are my customers
What is the most and what products
effective distribution are they buying?
channel?
What product prom- Which customers
-otions have the biggest are most likely to go
impact on revenue? to the competition ?
What impact will
new products/services
have on revenue
and margins? 7
Decision Support
Used to manage and control business
Data is historical or point-in-time
Optimized for inquiry rather than update
Use of the system is loosely defined and
can be ad-hoc
Used by managers and end-users to
understand the business and make
judgements
8
Evolution of Decision
Support
60’s: Batch reports
hard to find and analyze information
inflexible and expensive, reprogram every request
70’s: Terminal based DSS and EIS
80’s: Desktop data access and analysis tools
query tools, spreadsheets, GUIs
easy to use, but access only operational db
90’s: Data warehousing with integrated OLAP
engines and tools
9
What are the users
saying...
Data should be integrated
across the enterprise
Summary data had a real
value to the organization
Historical data held the key to
understanding data over time
What-if capabilities are
required
10
Data Warehousing --
It is a process
Technique for assembling and
managing data from various
sources for the purpose of
answering business questions.
Thus making decisions that
were not previous possible
A decision support database
maintained separately from
the organization’s operational
database
11
Traditional RDBMS used
for OLTP
Database Systems have been used
traditionally for OLTP
clerical data processing tasks
detailed, up to date data
structured repetitive tasks
read/update a few records
isolation, recovery and integrity are critical
Will call these operational systems
12
OLTP vs Data Warehouse
OLTP Warehouse (DSS)
Application Oriented Subject Oriented
Used to run business Used to analyze business
Clerical User Manager/Analyst
Detailed data Summarized and refined
Current up to date Snapshot data
Isolated Data Integrated Data
Repetitive access by Ad-hoc access using
small transactions large queries
Read/Update access Mostly read access
(batch update) 13
Data Warehouse
Architecture
Relational Optimized Loader
Databases
Extraction
Cleansing
Data Warehouse
Legacy Engine Analyze
Data Query
Purchased
Data
Metadata Repository
14
From the Data Warehouse
to Data Marts
Information
Individually Less
Structured
Departmentally History
Structured Normalized
Detailed
Organizationally More
Structured Data Warehouse
Data
15
Users have different views
of Data
Tourists: Browse
information harvested
OLAP
by farmers
Farmers: Harvest information
from known access paths
Explorers: Seek out the
Organizationally unknown and previously
structured unsuspected rewards hiding in
the detailed data
16
Wal*Mart Case Study
Founded by Sam Walton
One the largest Super Market Chains in
the US
Wal*Mart: 2000+ Retail Stores
SAM's Clubs 100+Wholesalers Stores
This case study is from Felipe Carino’s (NCR Teradata)
presentation made at Stanford Database Seminar
17
Old Retail Paradigm
Wal*Mart Suppliers
Inventory Management Accept Orders
Merchandise Accounts Promote Products
Payable Provide special
Purchasing Incentives
Supplier Promotions: Monitor and Track The
National, Region, Store Incentives
Level Bill and Collect
Receivables
Estimate Retailer
Demands
18
New (Just-In-Time) Retail
Paradigm
No more deals
Shelf-Pass Through (POS Application)
One Unit Price
Suppliers paid once a week on ACTUAL items sold
Wal*Mart Manager
Daily Inventory Restock
Suppliers (sometimes SameDay) ship to Wal*Mart
Warehouse-Pass Through
Stock some Large Items
Delivery may come from supplier
Distribution Center
Supplier’s merchandise unloaded directly onto Wal*Mart Trucks
19
Information as a Strategic
Weapon
Daily Summary of all Sales Information
Regional Analysis of all Stores in a logical area
Specific Product Sales
Specific Supplies Sales
Trend Analysis, etc.
Wal*Mart uses information when negotiating
with
Suppliers
Advertisers etc.
20
Schema Design
Database organization
must look like business
must be recognizable by business user
approachable by business user
Must be simple
Schema Types
Star Schema
Fact Constellation Schema
Snowflake schema
21
Star Schema
A single fact table and for each dimension one
dimension table
Does not capture hierarchies directly
T date, custno, prodno, cityname, sales p
i r
m o
e f d
a
c c c
u t i
s t
t y 22
Dimension Tables
Dimension tables
Define business in terms already familiar to
users
Wide rows with lots of descriptive text
Small tables (about a million rows)
Joined to fact table by a foreign key
heavily indexed
typical dimensions
time periods, geographic region (markets, cities),
products, customers, salesperson, etc.
23
Fact Table
Central table
Typical example: individual sales records
mostly raw numeric items
narrow rows, a few columns at most
large number of rows (millions to a billion)
Access via dimensions
24
Snowflake schema
Represent dimensional hierarchy directly by
normalizing tables.
Easy to maintain and saves storage
T p
date, custno, prodno, cityname, ...
i r
m o
e f d
a
c c c r
u t e
i
s g
t i
t y o
25
n
Fact Constellation
Fact Constellation
Multiple fact tables that share many
dimension tables
Booking and Checkout may share many
dimension tables in the hotel industry
Promotion
Hotels
Booking
Checkout
Travel Agents Room Type
Customer 26
Data Granularity in
Warehouse
Summarized data stored
reduce storage costs
reduce cpu usage
increases performance since smaller number
of records to be processed
design around traditional high level reporting
needs
tradeoff with volume of data to be stored
and detailed usage of data
27
Granularity in Warehouse
Solution is to have dual level of
granularity
Store summary data on disks
95% of DSS processing done against this data
Store detail on tapes
5% of DSS processing against this data
28
Levels of Granularity
Banking Example account
Operational month
# trans
withdrawals
account
monthly account deposits
activity date
amount register -- up to average bal
teller 10 years
location
account bal 60 days of amount
activity activity date
amount
Not all fields account bal
need be
archived 29
Data Integration Across
Sources
Savings Loans Trust Credit card
Same data Different data Data found here Different keys
different name Same name nowhere else same data
30
Data Transformation
Operational/ Sequential Legacy Relational External
Source Data
Data Accessing Capturing Extracting Householding Filtering
Transformation Reconciling Conditioning Loading Validating Scoring
Data transformation is the foundation
for achieving single version of the truth
Major concern for IT
Data warehouse can fail if appropriate
data transformation strategy is not
developed
31
Data Integrity Problems
Same person, different spellings
Agarwal, Agrawal, Aggarwal etc...
Multiple ways to denote company name
Persistent Systems, PSPL, Persistent Pvt. LTD.
Use of different names
mumbai, bombay
Different account numbers generated by different
applications for the same customer
Required fields left blank
Invalid product codes collected at point of sale
manual entry leads to mistakes
“in case of a problem use 9999999”
32
Data Transformation
Terms
Extracting Enrichment
Conditioning Scoring
Scrubbing Loading
Merging Validating
Householding Delta Updating
33
Data Transformation
Terms
Householding
Identifying all members of a household
(living at the same address)
Ensures only one mail is sent to a household
Can result in substantial savings: 1 million
catalogues at $50 each costs $50 million . A
2% savings would save $1 million
34
Refresh
Propagate updates on source data to the
warehouse
Issues:
when to refresh
how to refresh -- incremental refresh
techniques
35
When to Refresh?
periodically (e.g., every night, every
week) or after significant events
on every update: not warranted unless
warehouse data require current data (up
to the minute stock quotes)
refresh policy set by administrator based
on user needs and traffic
possibly different policies for different
sources 36
Refresh techniques
Incremental techniques
detect changes on base tables: replication
servers (e.g., Sybase, Oracle, IBM Data
Propagator)
snapshots (Oracle)
transaction shipping (Sybase)
compute changes to derived and summary
tables
maintain transactional correctness for
incremental load 37
How To Detect Changes
Create a snapshot log table to record ids
of updated rows of source data and
timestamp
Detect changes by:
Defining after row triggers to update
snapshot log when source table changes
Using regular transaction log to detect
changes to source data
38
Querying Data Warehouses
SQL Extensions
Multidimensional modeling of data
OLAP
More on OLAP later …
39
SQL Extensions
Extended family of aggregate functions
rank (top 10 customers)
percentile (top 30% of customers)
median, mode
Object Relational Systems allow addition
of new aggregate functions
Reporting features
running total, cumulative totals
40
Reporting Tools
Andyne Computing -- GQL
Brio -- BrioQuery
Business Objects -- Business Objects
Cognos -- Impromptu
Information Builders Inc. -- Focus for Windows
Oracle -- Discoverer2000
Platinum Technology -- SQL*Assist, ProReports
PowerSoft -- InfoMaker
SAS Institute -- SAS/Assist
Software AG -- Esperant
Sterling Software -- VISION:Data
41
Decision support tools
Mining
Direct Reporting OLAP tools
Query tools
Essbase Intelligent Miner
Crystal reports
Merge
Relational
Clean Data warehouse DBMS+
Summarize e.g. Redbrick
Detailed GIS
transactional data
data Operational data
Census
Bombay branch Delhi branch Calcutta branch data
Oracle IMS SAS
42
Deploying Data
Warehouses
What business information
keeps you in business today?
What business information can
put you out of business
tomorrow?
What business information
should be a mouse click away?
What business conditions are
the driving the need for
business information?
43
Cultural Considerations
Not just a technology project
New way of using information
to support daily activities and
decision making
Care must be taken to prepare
organization for change
Must have organizational
backing and support
44
User Training
Users must have a higher level of IT
proficiency than for operational systems
Training to help users analyze data in the
warehouse effectively
45
Warehouse Products
Computer Associates -- CA-Ingres
Hewlett-Packard -- Allbase/SQL
Informix -- Informix, Informix XPS
Microsoft -- SQL Server
Oracle – Oracle
Red Brick -- Red Brick Warehouse
SAS Institute -- SAS
Software AG -- ADABAS
Sybase -- SQL Server, IQ, MPP
46
Part 2: OLAP
47
Nature of OLAP Analysis
Aggregation -- (total sales, percent-to-
total)
Comparison -- Budget vs. Expenses
Ranking -- Top 10, quartile analysis
Access to detailed and aggregate data
Complex criteria specification
Visualization
Need interactive response to aggregate queries
48
Multi-dimensional Data
Measure - sales (actual, plan, variance)
Dimensions: Product, Region, Time
Hierarchical summarization paths
W
S
N Product Region Time
Product
Juice Industry Country Year
Cola
Milk
Cream Category Region Quarter
Toothpaste
Soap
1 2 34 5 6 7 Product City Month week
Month Office Day
49
Conceptual Model for
OLAP
Numeric measures to be analyzed
e.g. Sales (Rs), sales (volume), budget,
revenue, inventory
Dimensions
other attributes of data, define the space
e.g., store, product, date-of-sale
hierarchies on dimensions
e.g. branch -> city -> state
50
Operations
Rollup: summarize data
e.g., given sales data, summarize sales for
last year by product category and region
Drill down: get more details
e.g., given summarized sales as above, find
breakup of sales by city within each region, or
within the Andhra region
51
More Cube Operations
Slice and dice: select and project
e.g.: Sales of soft-drinks in Andhra over the last
quarter
Pivot: change the view of data
Q1 Q2 Total L S Total
L 22 33 55 Red 14 07 21
S 15 44 59 Blue 41 52 93
Total 37 77 114 Total 55 59 114
52
More OLAP Operations
Hypothesis driven search: E.g. factors
affecting defaulters
view defaulting rate on age aggregated over other
dimensions
for particular age segment detail along profession
Need interactive response to aggregate queries
=> precompute various aggregates
53
MOLAP vs ROLAP
MOLAP: Multidimensional array OLAP
ROLAP: Relational OLAP
Type Size Colour Amount
Shirt S Blue 10
Shirt L Blue 25
Shirt ALL Blue 35
Shirt S Red 3
Shirt L Red 7
Shirt ALL Red 10
Shirt ALL ALL 45
… … … …
ALL ALL ALL 1290
54
SQL Extensions
Cube operator
group by on all subsets of a set of attributes
(month,city)
redundant scan and sorting of data can be
avoided
Various other non-standard SQL
extensions by vendors
55
OLAP: 3 Tier DSS
Data Warehouse OLAP Engine Decision Support Client
Database Layer Application Logic Layer Presentation Layer
Store atomic Generate SQL Obtain multi-
data in industry execution plans in dimensional
standard Data the OLAP engine to reports from the
Warehouse. obtain OLAP DSS Client.
functionality.
56
Strengths of OLAP
It is a powerful visualization
tool
It provides fast, interactive
response times
It is good for analyzing time
series
It can be useful to find
some clusters and outliners
Many vendors offer OLAP
tools
57
Brief History
Express and System W DSS
Online Analytical Processing - coined by
EF Codd in 1994 - white paper by
Arbor Software
Generally synonymous with earlier terms such as Decisions
Support, Business Intelligence, Executive Information
System
MOLAP: Multidimensional OLAP (Hyperion (Arbor
Essbase), Oracle Express)
ROLAP: Relational OLAP (Informix MetaCube,
Microstrategy DSS Agent)
58
OLAP and Executive
Information Systems
Andyne Computing -- Oracle -- Express
Pablo Pilot -- LightShip
Arbor Software -- Essbase Planning Sciences --
Cognos -- PowerPlay Gentium
Comshare -- Commander Platinum Technology --
OLAP ProdeaBeacon, Forest &
Holistic Systems -- Holos Trees
Information Advantage -- SAS Institute -- SAS/EIS,
AXSYS, WebOLAP OLAP++
Informix -- Metacube Speedware -- Media
Microstrategies --
DSS/Agent
59
Microsoft OLAP strategy
Plato: OLAP server: powerful, integrating
various operational sources
OLE-DB for OLAP: emerging industry standard
based on MDX --> extension of SQL for OLAP
Pivot-table services: integrate with Office
2000
Every desktop will have OLAP capability.
Client side caching and calculations
Partitioned and virtual cube
Hybrid relational and multidimensional storage
60
Part 3: Data Mining
61
Why Data Mining
Credit ratings/targeted marketing:
Given a database of 100,000 names, which persons are the least likely
to default on their credit cards?
Identify likely responders to sales promotions
Fraud detection
Which types of transactions are likely to be fraudulent, given the
demographics and transactional history of a particular customer?
Customer relationship management:
Which of my customers are likely to be the most loyal, and which are
most likely to leave for a competitor? :
Data Mining helps extract such
information
62
Data mining
Process of semi-automatically analyzing
large databases to find interesting and
useful patterns
Overlaps with machine learning, statistics,
artificial intelligence and databases but
more scalable in number of features and
instances
more automated to handle heterogeneous
data
63
Some basic operations
Predictive:
Regression
Classification
Descriptive:
Clustering / similarity matching
Association rules and variants
Deviation detection
64
Classification
Given old data about customers and
payments, predict new applicant’s loan
eligibility.
Previous customers Classifier Decision rules
Age Salary > 5 L
Salary Good/
Profession Prof. = Exec bad
Location
Customer type
New applicant’s data
65
Classification methods
Goal: Predict class Ci = f(x1, x2, .. Xn)
Regression: (linear or any other polynomial)
a*x1 + b*x2 + c = Ci.
Nearest neighour
Decision tree classifier: divide decision
space into piecewise constant regions.
Probabilistic/generative models
Neural networks: partition by non-linear
boundaries 66
Decision trees
Tree where internal nodes are simple
decision rules on one or more attributes
and leaf nodes are predicted class labels.
Salary < 1 M
Prof = teacher Age < 30
Good Bad Bad Good
67
Pros and Cons of decision
trees
• Pros • Cons
+ Reasonable training – Cannot handle complicated
time relationship between features
+ Fast application – simple decision boundaries
+ Easy to interpret – problems with lots of missing
+ Easy to implement data
+ Can handle large
number of features
More information:
http://www.stat.wisc.edu/~limt/treeprogs.html
68
Neural network
Set of nodes connected by directed
weighted edges
A more typical NN
Basic NN unit
x1 n
o ( wi xi )
w1 x1
x2 x2
w2 i 1
x3 w3 1 x3 Output nodes
( y)
1 e y Hidden nodes
69
Pros and Cons of Neural
Network
• Pros • Cons
+ Can learn more complicated – Slow training time
class boundaries – Hard to interpret
+ Fast application – Hard to implement:
+ Can handle large number of trial and error for
features choosing number of
nodes
Conclusion: Use neural nets only if decision trees/NN fail.
70
Bayesian learning
Assume a probability model on generation
of data. p(d | c j ) p(c j )
predicted class : c max p(c j | d ) max
Apply bayes theorem c
to find cmost likely
j p(d ) j
class as:
n
p(c j )
c max
cj p(d )
p(a
i 1
i | cj)
Naïve bayes: Assume attributes conditionally
independent given class value
71
Clustering
Unsupervised learning when old data with class
labels not available e.g. when introducing a new
product.
Group/cluster existing customers based on time
series of payment history such that similar
customers in same cluster.
Key requirement: Need a good measure of
similarity between instances.
Identify micro-markets and develop policies for
each 72
Association rules
T
Milk, cereal
Given set T of groups of items
Tea, milk
Example: set of item sets purchased
Tea, rice, bread
Goal: find all rules on itemsets of
the form a-->b such that
support of a and b > user threshold s
conditional probability (confidence) of
b given a > user threshold c
Example: Milk --> bread
Purchase of product A --> service B cereal
73
Variants
High confidence may not imply high
correlation
Use correlations. Find expected support
and large departures from that
interesting..
see statistical literature on contingency
tables.
Still too many rules, need to prune...
74
Prevalent Interesting
1995 Milk and
Analysts already cereal sell
know about prevalent together!
rules
Interesting rules are
those that deviate
1998
from prior Zzzz... Milk and
cereal sell
expectation
together!
Mining’s payoff is in
finding surprising
phenomena
75
What makes a rule
surprising?
Does not match Cannot be trivially
prior expectation derived from
Correlation between simpler rules
milk and cereal Milk 10%, cereal 10%
remains roughly
Milk and cereal 10%
constant over time
… surprising
Eggs 10%
Milk, cereal and eggs
0.1% … surprising!
Expected 1%
76
Application Areas
Industry Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysis
Transport Logistics management
Consumer goods promotion analysis
Data Service providers Value added data
Utilities Power usage analysis
77
Data Mining in Use
The US Government uses Data Mining to track
fraud
A Supermarket becomes an information broker
Basketball teams use it to track game strategy
Cross Selling
Target Marketing
Holding on to Good Customers
Weeding out Bad Customers
78
Why Now?
Data is being produced
Data is being warehoused
The computing power is available
The computing power is affordable
The competitive pressures are strong
Commercial products are available
79
Data Mining works with
Warehouse Data
Data Warehousing provides
the Enterprise with a memory
Data Mining provides the
Enterprise with intelligence
80
Mining market
Around 20 to 30 mining tool vendors
Major players:
Clementine,
IBM’s Intelligent Miner,
SGI’s MineSet,
SAS’s Enterprise Miner.
All pretty much the same set of tools
Many embedded products: fraud detection,
electronic commerce applications
81
OLAP Mining integration
OLAP (On Line Analytical Processing)
Fast interactive exploration of multidim.
aggregates.
Heavy reliance on manual operations for
analysis:
Tedious and error-prone on large
multidimensional data
Ideal platform for vertical integration of mining
but needs to be interactive instead of batch.
82
State of art in mining OLAP
integration
Decision trees [Information discovery, Cognos]
find factors influencing high profits
Clustering [Pilot software]
segment customers to define hierarchy on that
dimension
Time series analysis: [Seagate’s Holos]
Query for various shapes along time: eg. spikes,
outliers etc
Multi-level Associations [Han et al.]
find association between members of dimensions
83
Vertical integration: Mining on
the web
Web log analysis for site design:
what are popular pages,
what links are hard to find.
Electronic stores sales enhancements:
recommendations, advertisement:
Collaborative filtering: Net perception, Wisewire
Inventory control: what was a shopper
looking for and could not find..
84