Data Warehousing & DATA
MINING (SE-409)
Lecture-1
Introduction and Background
Huma Ayub
Software Engineering department
University of Engineering and Technology, Taxila
Course Books
– W. H. Inmon, Building the Data Warehouse
(Second Edition), John Wiley & Sons Inc., NY.
– Han, J. and Kamber, M. (2011) Data Mining Concepts and
Techniques, 3rd Edition, Morgan Kaufmann.
– Paulraj Ponniah, Data Warehousing Fundamentals,
John Wiley & Sons Inc., NY.
Summary of course
1. Introduction & Background
2. De-normalization
3. On Line Analytical Processing (OLAP)
4. Dimensional modeling
5. Extract – Transform – Load (ETL)
6. Data Quality Management (DQM)
7. Need for speed (Parallelism, Join and Indexing techniques)
8. Data Mining
Why this course?
• The world is changing (actually changed), either
change or be left behind.
• Missing the opportunities or going in the wrong
direction has prevented us from growing.
• What is the right direction?
• Joining the data, in a knowledge driven
economy.
The need
“Drowning in data and starving
for information”
Knowledge is power, Intelligence
is absolute power!
The need
$
POWER
INTELLIGENCE
KNOWLEDGE
INFORMATION
DATA
Historical overview
1960
Master Files & Reports
1965
Lots of Master files!
1970
Direct Access Memory & DBMS
1975
Online high performance transaction processing
Historical overview
1980
PCs and 4GL Technology (MIS/DSS)
1985 & 1990
Extract programs, extract processing,
The legacy system’s web
Historical overview: Crisis of Credibility
What is the financial health of our company?
??
-10%
+10%
Why a Data Warehouse (DWH)?
• Data recording and storage is growing.
• History is excellent predictor of the future.
• Gives total view of the organization.
• Intelligent decision-support is required for
decision-making.
Reason-1: Why a Data Warehouse?
• Size of Data Sets are going up .
• Cost of data storage is coming down .
– The amount of data average business collects and
stores is doubling every year
– Total hardware and software cost to store and
manage 1 Mbyte of data
• 1990: ~ $15
• 2002: ~ ¢15 (Down 100 times)
• By 2007: < ¢1 (Down 150 times)
Reason-1: Why a Data Warehouse?
– A Few Examples
• WalMart: 24 TB
• France Telecom: ~ 100 TB
• CERN: Up to 20 PB by 2006
• Stanford Linear Accelerator Center (SLAC):
500TB
Caution!
A Warehouse of Data
is NOT a
Data Warehouse
Caution!
Size
is NOT
Everything
Reason-2: Why a Data Warehouse?
• Businesses demand Intelligence (BI).
– Complex questions from integrated data.
– “Intelligent Enterprise”
Reason-2: Why a Data Warehouse?
DBMS Approach
List of all items that were sold last
month?
List of all items purchased by AAA?
The total sales of the last month
grouped by branch?
How many sales transactions
occurred during the month of
January?
Reason-2: Why a Data Warehouse?
Intelligent Enterprise
Which items sell together? Which
items to stock?
Where and how to place the items?
What discounts to offer?
How best to target customers to
increase sales at a branch?
Which customers are most likely to
respond to my next promotional
campaign, and why?
Reason-3: Why a Data Warehouse?
• Businesses want much more…
– What happened?
– Why it happened? Stages of
– What will happen? Data
Warehouse
– What is happening?
– What do you want to happen?
What is a Data Warehouse?
A complete repository of historical
corporate data extracted from
transaction systems that is available
for ad-hoc access by knowledge
workers.
What is a Data Warehouse?
Complete repository
History
Transaction System
Ad-Hoc access
Knowledge workers
What is a Data Warehouse?
Transaction System
– Management Information System (MIS)
– Could be typed sheets (NOT transaction system)
Ad-Hoc access
– Dose not have a certain access pattern.
– Queries not known in advance.
– Difficult to write SQL in advance.
Knowledge workers
– Typically NOT IT literate (Executives, Analysts, Managers).
– NOT clerical workers.
– Decision makers.
Another View of a DWH
Subject
Oriented
Integrated
Time
Variant
Non
Volatile
What is a Data Warehouse ?
It is a blend of many technologies, the basic
concept being:
Take all data from different operational systems.
If necessary, add relevant data from industry.
Transform all data and bring into a uniform format.
Integrate all data as a single entity.
What is a Data Warehouse ? (Cont…)
It is a blend of many technologies, the basic
concept being:
Store data in a format supporting easy access for
decision support.
Create performance enhancing indices.
Implement performance enhancement joins.
Run ad-hoc queries with low selectivity.
How is it Different?
• Different patterns of hardware utilization
100%
0%
Operational DWH
Bus Service vs. Train
How is it Different?
• Combines operational and historical data.
Don’t do data entry into a DWH, OLTP or ERP are the source
systems.
OLTP systems don’t keep history, cant get balance statement
more than a year old.
DWH keep historical data, even of bygone customers. Why?
In the context of bank, want to know why the customer left?
What were the events that led to his/her leaving? Why?
Customer retention/holding.
How much history?
• Depends on:
– Industry.
– Cost of storing historical data.
– Economic value of historical data.
How much history?
• Industries and history
– Telecomm calls are much much more as compared to bank
transactions- 18 months.
– Retailers interested in analyzing yearly seasonal patterns- 65
weeks.
– Insurance companies want to do actuary analysis, use the
historical data in order to predict risk- 7 years.
How much history?
Economic value of data
Vs.
Storage cost
Data Warehouse a
complete repository of data?
How is it Different?
• Usually (but not always) periodic or batch
updates rather than real-time.
For an ATM, if update not in real-time, then lot of real
trouble.
DWH is for strategic decision making based on historical
data. Wont hurt if transactions of last one hour/day are
absent.
How is it Different?
Rate of update depends on:
volume of data,
nature of business,
cost of keeping historical data,
benefit of keeping historical data.
How is it Different?
• Starts with a 6x12 availability requirement ... but
7x24 usually becomes the goal.
Decision makers typically don’t work 24 hrs a day and 7
days a week. An ATM (OLTP) system does.
Once decision makers start using the DWH, and start
gaining the benefits, they start liking it…
Start using the DWH more often, till want it available
100% of the time.
For business across the globe, 50% of the world may be
sleeping at any one time, but the businesses are up 100%
of the time.
100% availability not a minor task, need to take into
account loading strategies, refresh rates etc. 33
How is it Different?
• Does not follows the traditional development
model
Requirements
Program
Classical SDLC
Requirements gathering
Analysis
Design
Programming
Testing
Integration
Implementation
34
How is it Different?
• Does not follows the traditional development
model
DWH
Program
Requirements
DWH SDLC (CLDS)
Implement warehouse
Integrate data
Test for biasness
Program w.r.t data
Design DSS system
Analyze results
Understand requirement
35
Data Warehouse Vs. OLTP
OLTP (On Line Transaction Processing)
Select tx_date, balance from tx_table
Where account_ID = 23876;
36
Data Warehouse Vs. OLTP
DWH
Select balance, age, sal, gender from
customer_table, tx_table
Where age between (30 and 40) and
Education = ‘graduate’ and
CustID.customer_table =
Customer_ID.tx_table;
37
Data Warehouse Vs. OLTP
OLTP: OnLine Transaction Processing (MIS or Database System)
OLTP DWH
Operational processing Information Processing
Normal employee is a user Knowledgeable person is a
user
Few rows returned Many rows returned
May use a single table Uses multiple tables
100’ s of records access Milion of record access
38
Data Warehouse Vs. OLTP
OLTP: OnLine Transaction Processing (MIS or Database System)
OLTP DWH
Function : Day to Day Operation Decision support
Data: Current Historical
Access: Read/write Read only
No of users: 1000’s 100’s
Db size : Max GB >TB
39
Putting the pieces together
Data Data Warehouse Server OLAP Servers Clients
(Tier 0) (Tier 1) (Tier 2) (Tier 3)
Semistructured MOLAP
Sources Query/Reporting
www data
Meta
Data
Extract
Data
Analysis
Archived
Transform
Load Warehouse
data
(ETL) ROLAP Business
IT Data Mining
Users
Users
Operational
Data Bases
Data sources Data Marts Tools
Business Users
40
Types & Typical Applications of DWH
41
Types of data warehouse
• Financial
• Telecommunication
• Insurance
• Human Resource
• Global
• Exploratory
42
Types of data warehouse
Financial
First data warehouse that an organization
builds. This is appealing because:
Nerve center, easy to get attention.
In most organizations start work from smallest data
set. [due to risk factor, more complexity]
Touches all aspects of an organization, with a
common denomination i.e. money.
43
Types of data warehouse
Telecommunication
Controlled by complete volume of data.
Many ways to accommodate call level detail:
Only a few months of call level detail,
Storing lots of call level detail scattered over different
storage media,
Storing only selective call level detail, etc.
Unfortunately, for many kinds of processing, working at
an aggregate level is simply not possible.
44
Types of data warehouse
Insurance
Insurance data warehouses are similar to other
data warehouses BUT with a few exceptions.
Stored data that is very, very old, used for actuarial
processing.(RISK ASSESMENT)
Typical business may change dramatically over
last 40-50 years, but not insurance.
In retailing or telecomm there are a few important
dates, but in the insurance environment there are
many dates of many kinds.
45
Types of data warehouse
Insurance
Insurance data warehouses are similar to other
data warehouses BUT with a few exceptions.
Long operational business cycles, in years.
Processing time in months. Thus the operating
speed is different.
Transactions are not gathered and processed, but
are in kind of “frozen”.
Thus a very unique approach of design &
implementation.
46
Typical Applications
Impact on organization’s core business is to streamline
and maximize profitability.
• Fraud detection.
• Profitability analysis.
• Direct mail/database marketing.
• Credit risk prediction.
• Customer retention modeling.
• Yield management.
• Inventory management.
47
Typical Applications
Fraud detection
• By observing data usage patterns.
• People have typical purchase patterns.
• Deviation from patterns.
• Certain cities notorious for fraud.
• Certain items bought by stolen cards.
• Similar behavior for stolen phone cards.
48
Typical Applications
Profitability Analysis
• Every Banks know if they are profitable or not.
• Don’t know which customers are profitable.
• Typically more than 50% are NOT profitable.
• Don’t know which one?
• Balance is not enough, transactional behavior is
the key.
• Restructure products and pricing strategies.
• Life-time profitability models (next 3-5 years).
49
Typical Applications
Direct mail marketing
• Targeted marketing.
• Offering high bandwidth package NOT to all
users.
• Know from call detail records of web surfing.
• Saves marketing expense, saving pennies.
50
Typical Applications
Credit risk prediction
• Who should get a loan?
• Customer separation i.e. stable vs. rolling.
• Qualitative decision making NOT subjective.
• Different interest rates for different customers.
• Do not fund bad customer on the basis of good.
51