Data Warehousing
Naveed Iqbal, Assistant Professor FAST-NU, Islamabad
(Lecture Slides Week # 3)
Created with Print2PDF. To remove this line, buy a license at: http://www.software602.com/
Data Warehouse: How is it Different?
4. Usually (but not always) periodic or batch updates rather than real-time
The boundary is blurring for active data warehousing. For an ATM, if update not in real-time, then lot of real trouble. DWH is for strategic decision making based on historical data, would not hurt if transactions of last one hour or day are absent. Rate of update depends on:
Volume of data Nature of business Cost of keeping historical data Benefit of keeping historical data
Data Warehousing - Fall 2010 2
FAST-NU, Islamabad
Created with Print2PDF. To remove this line, buy a license at: http://www.software602.com/
Data Warehouse: How is it Different?
5. Starts with 6x12 availability requirement but 7x24 usually becomes the goal.
Decision makers typically dont work 24 hrs a day and 7 days a week. An ATM system does. Once decision makers start using the DWH, and start reaping the benefits, they start liking it. Start using the DWH more often, till want it available 100% of the time. For business across the globe, 50% of the world may be sleeping at any one time, but the businesses are up 100% of the time. 100% availability not a trivial task, need to take into account loading strategies, refresh rates etc.
Data Warehousing - Fall 2010 3
FAST-NU, Islamabad
Created with Print2PDF. To remove this line, buy a license at: http://www.software602.com/
Data Warehouse: How is it Different?
6. Does not follows the traditional development model
Requirements gathering Analysis Design Programming Testing Integration Implementation
FAST-NU, Islamabad
Implement warehouse Integrate data Test for biasness / incorrectness Program w.r.t. data Design DSS system Analyze results Understand requirements
Data Warehousing - Fall 2010 4
Created with Print2PDF. To remove this line, buy a license at: http://www.software602.com/
Data Warehouse: How is it Different?
7. Comparison of response times
OLAP (Online Analytical Processing) queries must be executed in a small number of seconds.
Often requires de-normalization and/or sampling
Complex query scripts and large list selections can generally be executed in a small number of minutes. Sophisticated clustering algorithms e.g. data mining can generally be executed in a small number of hours (even for hundreds of thousands of customers).
FAST-NU, Islamabad
Data Warehousing - Fall 2010
Created with Print2PDF. To remove this line, buy a license at: http://www.software602.com/
Data Warehouse: How is it Different?
8. Data Warehouse vs. OLTP (Online Transaction Processing)
OLTP: Select tx_date, balance from tx_table where account_ID = 829; DWH
Select balance, age, sal, gender from customer_table and tx_table where age between (30 and 40) and education = graduate and custID.customer_table = customer_ID.tx_table;
OLTP
DWH
Primary key used No concept of primary index May use a single table Normally few rows returned High selectivity of query Indexing on primary key (unique)
Primary key NOT used Primary index used Mostly uses multiple tables Normally many rows returned Low selectivity of query Indexing on primary index (nonunique)
6
FAST-NU, Islamabad
Data Warehousing - Fall 2010
Created with Print2PDF. To remove this line, buy a license at: http://www.software602.com/
DWH vs. OLTP: Summary
DWH
Scope
Application neutral Single source of truth Evolves over time How to improve business
OLTP
Application specific Multiple databases with repetition Off the shelf application Runs the business Operational data No summary Fully normalized
Data Perspective Queries Time Factor
Historical, detailed data Some summary Lightly de-normalized
Hardly uses PK No. of returned results in Ks
Based on PK No. of returned results in 100s
Minutes to hours Typical availability 12x6
Sub seconds to seconds Typical availability 24x7
FAST-NU, Islamabad
Data Warehousing - Fall 2010
Created with Print2PDF. To remove this line, buy a license at: http://www.software602.com/
DWH vs. OLTP: Summary
DWH
Characteristics Orientation Users Function DB Design Unit of work Access Focus Priority Metric Informational processing Analysis Knowledge workers Decision support Subject oriented Complex query Mostly read Information out High flexibility / autonomy Query throughput
OLTP
Operational processing Transaction Clerks, DBAs etc. Day to day operation Application oriented Short, simple transaction Read / write Data in High performance / availability Transaction throughput
FAST-NU, Islamabad
Data Warehousing - Fall 2010
Created with Print2PDF. To remove this line, buy a license at: http://www.software602.com/
Putting the pieces together
Data (Tier 0)
Semistructured Sources
Data Warehouse Server (Tier 1)
OLAP Servers (Tier 2)
MOLAP
Clients (Tier 3)
Query/Reporting
www data
Meta Data
Archived data
IT Users
Extract Transform Load (ETL)
Data Warehouse
ROLAP
Business Users Data Mining
Tools
Analysis
Operational Data Bases
Data sources
Data Marts
Business Users
FAST-NU, Islamabad
Data Warehousing - Fall 2010
Created with Print2PDF. To remove this line, buy a license at: http://www.software602.com/
Why is this hard?
Data sources are unstructured & heterogeneous. Requirements are always changing. Most computer scientist trained on OLTP systems, those concepts not valid for VLDB & DSS. The scale factor in VLDB implementations is difficult to comprehend. Performance impacts are often non-linear O(n) vs. O(nlogn) e.g. scanning vs indexing. Complex computer/database architectures. Rapidly changing product characteristics.
FAST-NU, Islamabad
Data Warehousing - Fall 2010
10
Created with Print2PDF. To remove this line, buy a license at: http://www.software602.com/
DWH: High level implementation steps
Phase-I Determine user needs Determine DBMS Server platform Determine hardware platform(s) Information and Data Modeling Construct metadata repository Phase-II Data acquisition and cleansing Data transform, transport and populate Determine middleware connectivity Prototyping, querying and reporting Data Mining Online Analytical Processing (OLAP) Phase-III Deployment and System Management
FAST-NU, Islamabad Data Warehousing - Fall 2010 11
Created with Print2PDF. To remove this line, buy a license at: http://www.software602.com/
Types of Data Warehouses
Financial Telecommunication Insurance Human Global Exploratory
Resource
FAST-NU, Islamabad
Data Warehousing - Fall 2010
12
Created with Print2PDF. To remove this line, buy a license at: http://www.software602.com/
Types of Data Warehouses
Financial
First Data Warehouse that an organization builds. This is appealing because:
Nerve center, easy to get attention. In most organizations, smallest data set. Touches all aspects of an organization with a common denomination i.e. money. Inherent structure of data directly influenced by the day-to-day activities of financial processing.
FAST-NU, Islamabad
Data Warehousing - Fall 2010
13
Created with Print2PDF. To remove this line, buy a license at: http://www.software602.com/
Types of Data Warehouses
Telecommunication
Dominated by sheer volume of data Many ways to accommodate call level detail:
Only a few months of call level detail. Storing lots of call level detail scattered over different storage media. Storing only selective call level detail etc. Unfortunately, for many kinds of processing, working at an aggregate level is simply not possible as finding patterns will be difficult.
FAST-NU, Islamabad Data Warehousing - Fall 2010 14
Created with Print2PDF. To remove this line, buy a license at: http://www.software602.com/
Types of Data Warehouses
Insurance
Insurance Data Warehouses are similar to other Data Warehouses BUT with few exceptions:
Store data that is very old and used for actuarial processing / analysis. Typical business may change dramatically over last 40-50 years, but not insurance. In retailing or telecom, there are few important dates but in the insurance environment there are many dates of many kinds.
FAST-NU, Islamabad
Data Warehousing - Fall 2010
15
Created with Print2PDF. To remove this line, buy a license at: http://www.software602.com/
Types of Data Warehouses
Insurance
Insurance Data Warehouses are similar to other Data Warehouses BUT with few exceptions (Contd.):
Long operational business cycles, in years. Processing time in months. Thus the operating speed is different. Transactions are not gathered and processed but are in kind of frozen. Thus a very unique approach of design & implementation.
FAST-NU, Islamabad
Data Warehousing - Fall 2010
16
Created with Print2PDF. To remove this line, buy a license at: http://www.software602.com/
DWH: Typical Applications
Impact on organizations core business is to streamline and maximize profitability.
Fraud Detection Profitability Analysis Direct Mail / Database Marketing Credit Risk Prediction Customer Retention Modeling Yield Management Inventory Management
ROI on any one of these applications can justify HW / SW and Consultancy costs in most organizations.
Data Warehousing - Fall 2010 17
FAST-NU, Islamabad
Created with Print2PDF. To remove this line, buy a license at: http://www.software602.com/
DWH: Typical Applications
Fraud
Detection
By observing data usage patterns People have typical purchase patterns Deviation patterns Certain cities notorious for fraud Certain items bought by stolen cards Similar behavior for stolen cards
FAST-NU, Islamabad
Data Warehousing - Fall 2010
18
Created with Print2PDF. To remove this line, buy a license at: http://www.software602.com/
DWH: Typical Applications
Profitability
Analysis
Banks know if they are profitable are not Dont know which customers are profitable Typically more than 50% are NOT profitable Dont know which one? Balance is not enough, transactional behavior is the key Restructure products and pricing strategies Life time profitability models (next 3-5 years)
FAST-NU, Islamabad Data Warehousing - Fall 2010 19
Created with Print2PDF. To remove this line, buy a license at: http://www.software602.com/
DWH: Typical Applications
Direct
Mail Marketing
Targeted marketing Offering high bandwidth package NOT to all users Know from call detail records of web surfing Saves marketing expense, saving pennies Knowing your customer better
FAST-NU, Islamabad
Data Warehousing - Fall 2010
20
Created with Print2PDF. To remove this line, buy a license at: http://www.software602.com/
DWH: Typical Applications
Credit
Risk Prediction
Who should get a loan? Customer segregation i.e. stable vs. rolling Qualitative decision making NOT subjective Different interest rates for different customers Do not subsidize bad customer on the basis of good
FAST-NU, Islamabad
Data Warehousing - Fall 2010
21
Created with Print2PDF. To remove this line, buy a license at: http://www.software602.com/
DWH: Typical Applications
Yield
Management
Works for fixed inventory businesses The price of item suddenly goes to zero Item prices vary for varying customers Examples: Airlines, Hotels etc. E.g. Price of air ticket depends on:
How much in advance ticket was bought? How many vacant seats were available? How profitable is the customer? Ticket is one-way or return?
FAST-NU, Islamabad Data Warehousing - Fall 2010 22
Created with Print2PDF. To remove this line, buy a license at: http://www.software602.com/
DWH: Recent Applications
Agriculture
Systems
Agriculture related data collected for decades Metrological data consists of 50+ attributes Decision making based on expert judgment Lack of integration results in underutilization What is required, in which amount and when?
FAST-NU, Islamabad
Data Warehousing - Fall 2010
23
Created with Print2PDF. To remove this line, buy a license at: http://www.software602.com/
DWH: Typical Early Adopters
Financial
service / insurance Retailing and distribution Telecommunications Transportation Government Common thread:
Lots of customers and transactions.
FAST-NU, Islamabad
Data Warehousing - Fall 2010
24
Created with Print2PDF. To remove this line, buy a license at: http://www.software602.com/
DWH: End User Expectations
Point
and click access to data Insulation from DBMS structures
Want semantic data model not 3rd normal form
Integration
with existing tools: Excel,
SAS etc. Interactive response times for online analysis but batch time is important as well.
FAST-NU, Islamabad Data Warehousing - Fall 2010 25
Created with Print2PDF. To remove this line, buy a license at: http://www.software602.com/