Lecture 01 Tue, Jan 20, 2009 1800 : 2100 FAST NU, Karachi
Course Outline
Introduction to Data Warehousing and Background Dimension Modeling Architecture and Infrastructure Extract Transform Load Data Quality Management OLAP Implementation Methods of Data Warehouse Data Mining Overview
Course Material
Data Warehousing Fundamentals by Paulraj Ponniah John Wiley and Sons Articles
Class Notes
Marks Distribution
Objective of the course
Why exactly the world needs a Data Warehouse?
How Data Warehouse differs from traditional databases
and RDBMS? Where does OLAP stands in the Data Warehouse picture? What are different Data Warehouse and OLAP models/schemas? How to perform ETL? What is data cleansing? How to perform it? What are the famous algorithms? Which different Data Warehouse architectures are there? What are their strengths and weaknesses?
What is a Data Warehouse?
The Data Warehouse is an integrated, subject-
oriented, time-variant, non-volatile database that provides support for decision making
Subject Oriented
Organized along the lines of the subjects of the corporation. Typical subjects are customer, product, vendor and transaction.
Integrated
Single, Enterprise-Wide view.
Time Variant
Every record in the data warehouse has some form of time dimension attached to it.
Non Volatile
Refers to the inability of data to be updated. Every record in the data warehouse is time stamped in one form or the other.
Decision Support is a methodology (or a series of
methodologies) designed to extract information from data and to use such information as a basis for decision making
6
What is a Data Warehouse?
Legacy Data
Large Scale Data Collection Generation or Digitization Exercise Online Online Operational Online Operational Source Online Operational Source Operational Source Source
Corporate Decision Support Infrastructure Reporting End DW Servers User
Needs for Strategic Information
Retain the present customer base
Increase the customer base by 15% over the next 5
years Gain market share by 10% in the next 3 years Improve product quality levels in the top five product groups Enhance customer service level in shipments Bring three new products to market in 2 years Increase sales by 15% in the Northern Division
8
Need of a Data Warehouse
The amount of data the average business collects and
stores is doubling each year Total hardware and software cost to store and manage 1 Mbyte of data
1990: ~ $15 2002: ~ 15 (Down 100 times) 2005: ~ 1 (Down 1500 times)
A Few Examples
Cern: Up to 20 PB by 2006 Stanford Linear Accelerator Center (SLAC): 500TB France Telecom: ~ 100 TB WalMart: 24 TB
9
Operational Systems
User needs information
User requests reports from IT IT places request on backlog IT creates ad queries IT sends requested reports User hopes to find the right answer User needs information
10
Operational vs. Informational
Operational Data Content Data Structure Access Frequency Access Type Usage
Current values Optimized for transactions High
Informational
Archived, derived, summarized Optimized for complex queries Medium to low
Read, update, delete Predictable, repetitive
Read Ad hoc, random, heuristic
Response Time
Users
Sub seconds
Large number
Several seconds to minutes
Relatively small number
11
Data Warehouse
Information Sources Data Warehouse Server (Tier 1) OLAP Servers (Tier 2) e.g., MOLAP Semistructured Sources extract transform load refresh etc. Operational DBs Analysis Data Warehouse serve Query/Reporting serve e.g., ROLAP serve Data Mining Clients (Tier 3)
Data Marts
12
Online Transaction Processing (OLTP)
Also known as operational sources Day-to-day handling of transactions that result from
enterprise operation Airline reservation systems, Electronic point of sale systems, Automatic teller machines etc Typically several systems within same enterprise Read and Update mostly Standard, Predefined, less complex queries Queries based on individual or a relatively less number of records (Single-Hit Queries) Typically used in Tactical Management
13
Decision Support Systems
Decision Support is a methodology (or a series of
methodologies) designed to extract information from data and to use such information as a basis for decision making
Communication Driven DSS
Data Driven DSS Document Driven DSS Knowledge Driven DSS Model Driven DSS
14
Data Driven DSS
15
Online Analytical Processing (OLAP)
Goal of OLAP is to support ad-hoc querying for the
business analyst Multidimensional view of data is the foundation of OLAP Extend spreadsheet analysis model to work with warehouse data
Read Only Access Semantically enriched to understand business terms
(e.g., time, geography) Combined with reporting features
16
OLTP vs. Data Driven DSS
Trait User Function DB Design Data View Usage Unit of work Access Operations Records accessed #Users Db size Metric
OLTP
Sales Staff, IT Professionals Day to day operations Application-oriented (E-R based) Current, Isolated Detailed, Flat relational Structured, Repetitive Short, Simple transaction Read/write Index/hash on primary key Tens to Hundreds Thousands 100 MB-GB Trans. throughput
Data Driven DSS Knowledge worker Decision support Subject-oriented (Star, snowflake) Historical, Consolidated Summarized, Multidimensional Ad hoc Complex query Read Mostly Lots of Scans Thousands to Millions Hundreds 100GB-TB Query throughput, response
17
Data Mining
Knowledge Extraction
Verification: OLAP type analyses, hypothesis testing Discovery: Extracting rules or patterns
Data Mining is finding hidden patterns in data
Predict which customers will buy new policies Identify behavior patterns of risky customers Identify fraudulent behavior Characterize patient behavior to predict office visits Identify successful medical therapies for different illnesses
18
Knowledge Discovery in Databases (KDD)
Non-trivial extraction of implicit, previously unknown
and potentially useful knowledge from data KDD stages
Problem definition Data selection Cleaning Enrichment Coding and organization Data mining Reporting
19
DW and DB Clarifying Confusions
Is DW different from DB No The difference is historical not technical DW is a DB inside and out DW is to Data Driven DSS what DB is to OLTP
20
Brief History of DB Design
Master file design
Integrated, subject-oriented design Relational design Star join design
21