UNIT - III
Data Warehouse consists of
1. What is Data Warehouse
2. Differences between OLAP and OLTP
3. Multidimensional Data Model
4. Data Warehouse Architecture
1
1. What is Data Warehouse: Data warehouse is a
subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of
management’s decision-making process.[Bill W. H.
Inmon ]
A Data warehouse is a copy transaction data
specifically structured for query and analysis.[Ralph
Kimball]
Father of Data warehousing is and the term Data
Warehouse was coined by Bill W. H. Inmon and
Ralph Kimball in 1990.
Data warehousing is the process of constructing and
2
using data warehouses.
IOT Internet of Things
Big data Analytics Deals with 3V’s big data like face book
Data Mining Extracting Meaningful data
Data warehouse Collection of Data Marts/OLAP
Data Mart Subset of a DWH
DBMS Collection of related information / OLTP
Information Processed Data
Data Raw material/facts/images
Fig: Hierarchy of Data warehousing and Data Mining
The term "data warehouse" refers to a special type of
database that acts as the central repository for company
data.
A data warehouse is a relational or multidimensional
database that is designed for query and analysis.
It can be thought of as a database archive that is
segregated from the operational databases, and used
primarily for reporting and data mining purposes.
Data warehouse enables knowledge worker to make
faster and better decisions .
4
5
Data Warehouse can be defined as collection of Data
marts.
Data warehousing is a collection of decision support
technologies, aimed at enabling the knowledge worker
to make better decisions.
Requirements of a Data Warehouse system
Efficient cube computation
Better access methods
Efficient query processing
6
Data in Tape Distributed
DW with
files Mainframe DBMS
Server
7
Characteristics are
1. Subject Oriented: Data that gives information about a particular
subject instead of about a company's ongoing operations. A data
warehouse can be used to analyze a particular subject area.
Ex: "Sales" can be a particular subject. Customer, product,
sales, weather data, stock market.
2. Integrated: Data that is gathered into the data warehouse from a
variety of sources and merged into a coherent whole. A data
warehouse integrates data from multiple data sources. Data
cleaning and data integration techniques are applied to ensure
consistency in encoding structures, naming conventions, attribute
measures etc.
Ex: Constructed by integrating multiple, heterogeneous data
sources, Relational databases, flat files, on-line transaction
records.
8
Process Oriented Subject Oriented
Entry
Sales
Sales
Sales Rep
Quantity Sold
Date
Customer Name Customers
Customers
Product Description
Unit Price
Mail Address
Products
Products
Transactional Storage Data Warehouse Storage
Fig: Subject Oriented
Appl. A - M, F
Encoding Appl. B - 1, 0 M, F
Appl. C - X, Y
Appl. A - pipeline cm.
Unit of Appl. B - pipeline inches pipeline cm
Attributes Appl. C - pipeline m
Appl. A - balance dec(13,2)
Integration
Physical Appl. B - balance char 9(9)V99 balance dec(13, 2)
Attributes Appl. C - balance float
Appl. A - bal-on-hand
Naming Appl. B - current_balance balance
Conventions Appl. C - balance
Appl. A - date (Julian)
Data Appl. B - date (yymmdd) date (Julian)
Consistency Appl. C - date (absolute)
Transactional Storage Data Warehouse Storage
Fig: Integrated
3. Time-variant: All data in the data warehouse is
identified with a particular time period. Historical data is
kept in a data warehouse.
Ex: Operational database: current value data, Data warehouse
data: provide information from a historical perspective (e.g.,
past 5-10 years).
4. Non-volatile: Data is stable in a data warehouse. More
data is added but data is never removed. Once data is in
the data warehouse, it will not change.
Ex: Initial loading of data and access of data, No update of data
allowed and Only loading and access of data operations.
11
Current Data Historical Data
Sales ( Region , Year - Year 97 - 1st Qtr)
20
15
Sales ( in lakhs
10 East
)
West
5 North
0
January February March
Year97
Transactional Storage Data Warehouse Storage
Fig: Time Variant
Volatile Non-Volatile
Insert Change
Delete
Access
Insert Load
Change
Access
Record-by-Record Data Manipulation Mass Load / Access of Data
Transactional Storage Data Warehouse Storage
Fig: Non - Volatile
Need of Data Warehousing
Data warehousing is capable of storing and consolidating the
past information.
Provides support for sophisticated multidimensional queries.
Data warehouse uses update driven approach than query
driven approach.
To understand current business trends and forecasting
decisions.
To generate reporting & analysis
Knowledge Discovery and Decision Support
To improve the Performance of stock market data
14
DBMS, OLAP and Data Mining
DBMS(OLTP) OLAP(DW) Data Mining
Knowledge
Extraction of Summaries,
discovery of
Task detailed and trends and
hidden patterns
summary data forecasts
and insights
Insight and
Type of Result Information Analysis
Prediction
Multidimensional Induction (Build
Deduction (Ask
data modeling, the model, apply
Method the question,
Aggregation, it to new data, get
verify with data)
Statistics the result)
What is the
Who purchased Who will buy a
average income
mutual fund in
Example question mutual funds in of mutual fund
the next 6 months
the last 3 years? buyers by region
and why?
by year?
15
Benefits of a Data Warehouse
1. A Data Warehouse Delivers Enhanced Business Intelligence.
2. A Data Warehouse Saves Time.
3. A Data Warehouse Enhances Data Quality and Consistency.
4. A Data Warehouse Provides Historical Intelligence
5. A Data Warehouse Generates a High ROI(Return of Investment).
16
DW Tools
1. Informatica - Power Center
2. IBM - Websphere DataStage, Cognos Data Manager
3. SAP - Business Objects Data Integrator
4. Microsoft - SQL Server Integration Services
5. Oracle - Data Integrator, Warehouse Builder
6. SAS - Data Integration Studio
7. AB Initio
17
Data Warehousing Tools
Tool Category Products
ETL Tools Informatica, Ab Initio, IBM Infosphere Data Stage, Oracle
Warehouse Builder, Business Objects Data Integrator etc.
OLAP Server Oracle Express Server, Oracle Essbase, IBM Cognos, SAP
Netweaver OLAP, Microsoft Analysis Services
OLAP Tools Oracle Express Suite, Oracle Essbase, Cognos Powerplay,
Business Objects. Micro Strategy
Data Warehouse Oracle, Informix, Teradata, DB2
Data Mining & SAS Enterprise Miner, IBM Intelligent Miner
Analytics
APPLICATION AREAS FOR DATA WAREHOUSING
1. Financial Data Analysis
2. Data mining in the retail industry
3. Telecommunication Industry
4. Biological Data Analysis
5. Scientific Applications
6. Intrusion Detection
7. Sales and Marketing
8. Health Care and Insurance
9. e-Commerce 19
2. OLTP(Online Transaction OLAP(Online Analytical Processing)
Processing
1. It is market oriented
1. It is customer oriented 2. Users are knowledge worker
2. Users are clerk, IT professional 3. Function is decision support
3. Function is day to day operations. 4. DB design is subject-oriented
4. DB design is application-oriented 5. Data is historical, summarized, multi
5. Data is current, up-to-date, detailed, dimensional, integrated,
flat relational, Isolated. consolidated.
6. Usage is Repetitive 6. Usage is ad-hoc
7. Access is read/write, index/hash on 7. Access requires lots of scans
primary key
8. Unit of work is short, simple 8. Unit of work is complex query
transaction.(uses 3NF) 9. No. of records accessed is
9. No. of records accessed is tens millions(Uses 2NF)
10. No. of users are thousands 10. No. of users are hundreds
11. DB size is 100MB-GB 11. DB size is 100GB-TB
12. Low processing time 12. High processing time
13. Metric is transaction throughput 13. Metric is query throughput, response.
3 - Multi-Dimensional Data Model
The Dimensional Model was Developed for
Implementing data warehouse and data marts.
MDDM provide both a mechanism to store data and a
way for business analysis.
Data warehouses and OLAP tools are based on a
multidimensional data model.
Multidimensional data model is typically used for the
design of corporate data warehouses and departmental
data marts.
Multidimensional data model is represented through
Data cubes. 21
The core of the multidimensional model is the data cube, which
consists of a large set of facts (or measures) and a number of
dimensions.
Component of MDDM are two primary components of
dimensional models are Dimensions and Facts.
Data cube consists of dimensions & measures and
multidimensional view of data is the foundation of OLAP.
Data cube consists of a lattice of cuboids, each corresponding to a
different degree of summarization of the given multidimensional
data.
Dimension Table: It consists of tuple of attributes of dimension, It
is simple primary key, dimensions are texture attributes to analyses
data.
Fact Table: A Fact table contains keys to each of related dimension
tables, Facts are numeric volume to analyze business.
22
Types of Facts: These are
1. Additive Facts
2. Semi Additive Facts
3. Non- Additive Facts
1. Additive Facts: Additive facts are facts that can be
summed up through all of the dimensions in the fact table.
Ex: Number of products sold on day 1 = 500
Number of products sold on day 2 = 200
-----------
700
-----------
23
2. Semi-Additive Facts: These are the facts that can be
summed up for some of dimensions in the fact table but
not others.
Ex: Balance of company's a/c1 on day 1= 5000
Balance of company's a/c 1 on day 2= 3000
_________
8000
_________
3. Non-Additive Facts: These are the facts that can not be
summed up for any of the dimensions present in fact
table.
Ex: Profit margin for day 1= 30%
Profit margin for day 2 = 80% 24
Types of Dimensions
1. Slowly Changing Dimensions(SCD) – Dimensions that change
slowly over time rather than changing on regular schedule , time
based.
Type 1: Overwrite old value
Type 2 : Add new row
Type 3: Add new column
Id Year Name City
1 2000 James New York
Id Year Name City
1 2000 James New York
1 2004 James Claifornia
Id Start Date End Date Name City
1 1st January 2013 31st December 2013 James New York
1 1st January 2014 31st December 2014 James Claifornia
2. Conformed Dimensions – Dimension that has
exactly the same meaning and content when being
referred from different fact tables.
Ex: Products, Time, Location for Sales fact
3. Degenerate Dimensions – Dimension that don’t
require any dimension table in specific related to
fact table.
Ex: From Location to City
26
Data warehouse is based on a multidimensional data model which
views data in the form of a data cube.
Multi Dimensional Models are two types of views are
i. Logical view: Easy understanding for user, e.g. to formulate
queries or to understand result presentation
ii. Physical view: Storage in computer memory, access methods
Sparse vs. Dense.
Dimensions are the entities, attributes and Dimensional modeling
elements are
Hierarchies:
Ex Mandal->District->State->India ->World
Ex: Day->Week->Month->Quarter->Half year->Year
Facts are measures .
Ex: Cost, Revenue, Quantity , No.of Units
27
Ex: Sales volume as a function of product, month,
and region are representing through Data Cubes.
Dimensions: Product, Location, Time
Hierarchical summarization paths
o n
gi
Industry Region Year
Re
Category Country Quarter
Product
Product City Month Week
Office Day
Month
28
It consists of
1. Tables to Data Cubes
2. Stars, Snowflakes and Fact Constellations
3. Schemas with DMQL
4. Measures
5. Concept Hierarchies
6. OLAP Operations
7. Starnet Query Model 29
1. Tables to Data Cubes
A Data Warehouse is based on a multidimensional data
model, which views data in the form of a data cube.
When data is grouped or combined together in
multidimensional matrices called Data Cubes.
A Data Cube, such as sales, allows data to be modeled and
viewed in multiple dimensions.
– Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
– Fact Table contains measures (such as dollars_sold) and keys to
each of the related dimension tables
30
region
Fig: Sales Product shown in Table and Cube 31
Multidimensional data model is to view it as a cube.
The cube on the right associates sales number (unit sold)
with dimensions-product type, market and time with the
unit variables organized as cell in an array.
In cube as number of dimensions increases number of cubes
cell increase exponentially.
Dimensions are hierarchical in nature i.e. time dimension
may contain hierarchies for years, quarters, months, weak
and day.
Lattice Cuboid: An n-D base cube is called a base cuboid.
The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid.
The lattice of cuboids forms a data cube.
all 0-D(apex) cuboid
time item location supplier
1-D cuboids
time,location item, location location,supplier
time,item
time, supplier
2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier
4-D(base) cuboid
time, item, location, supplier
Fig : Lattice Cuboid
2. Stars, Snowflakes, and Fact Constellations
The most popular data model for a data warehouse is a
multidimensional model.
Modeling data warehouses with dimension & measure
tables.
There are three types of schema’s
1. Star schema
2. Snowflake schema
3. Fact constellations
34
1. Star schema: It is also known as Star Join Schema.
It is the simplest style of data warehouse schema.
It is called a Star Schema because the entity relationship diagram
of this Schema resembles a star, with points radiating from
central table.
A star query is a join between a fact table and a number of
dimension table.
Each dimension table is joined to the fact table using primary
key to foreign key join but dimension table are not joined to each
other.
A typical fact table contain key and measure.
In a Star schema , a dimension table will not have any parent
table.
Each dimension table have primary key that corresponds exactly
to one of the component s of the composite key in the fact table.
Star schema is the selfish model(Subject-oriented). 35
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
Fig: Star Schema 36
Characteristics of Star Schema:
i. Simple structure: Easy to understand schema
ii. Great query effectives: Small number of tables to join
iii. Relatively long time of loading data into dimension tables: de-
normalization, redundancy data caused that size of the table
could be large.
iv. The most commonly used in the data warehouse
implementations: Widely supported by a large number
of business intelligence tools
Advantage of Star Schema Model
1. Provide highly optimized performance for typical star queries.
2. Provide a direct and intuitive mapping between the business
entities being analyzed by end users and the schema design.
37
2. Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
third normal form and forms a set of smaller
dimensional tables.
It is a combination of star schemas.
It keeps same fact table structure as star schema.
In the dimension, it has multiple levels with multiple
hierarchies.
From each hierarchy of levels any one level can be
attached to fact table.
Mostly lowest level hierarchy is attached to fact table.
The snowflake structure can reduce the effectiveness of
browsing, since more joins will be needed to execute a
query. 38
The snowflake schema architecture is a more complex
variation of the star schema used in a data warehouse,
because the tables which describe the dimensions are
normalized.
The Snow Flake Schema is represented by centralized fact
table which are connected to multiple dimensions.
The Snow Flaking effecting only affecting the dimension
tables and not the fact tables.
Benefits of Snow flake schema
1. It is easier to implement a snow flake Schema when a
multidimensional is added to the typically normalized
tables.
2. A Snow flake schema can reflect the same data to the
database. 39
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
Fig: Snowflake Schema 40
Star vs. Snowflake
Star Schema Snowflake Schema
1. Has redundant data 1. No redundancy, saves storage
2. Lower query complexity and space
easy to understand 2. More complex queries and less
3. Less number of foreign keys easy to understand
and shorter execution time 3. More foreign keys and long query
4. One simple query analysis execution time
5. Less number of joins 4. Many simple query analysis
6. Contains single dimensions 5. More joins
7. When dimension table contains 6. Contains multiple dimensions-
less rows level
8. Both dimension and fact tables 7. When dimension table is big
are de-normalized 8. Dimension tables are normalized
and fact tables are denormalised
3. Fact constellations:
It is set of fact tables that share some dimensions tables.
The fact constellation architecture contains multiple fact
tables that share many dimension tables.
Multiple fact tables share dimension tables, viewed as a
collection of stars, therefore called Galaxy schema or fact
constellation.
Fact constellation is a combination of more than one fact
table and many dimension tables.
42
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
branch location_key location to_location
branch_key location_key dollars_cost
branch_name units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type
Fig: Fact Constellation Schema 43
3. Schemas with DMQL
A Data Mining Query Language for Relational Databases.
Data Mining Query Languages can be designed to support ad hoc
and interactive data mining.
Two language primitives:
i) Cube definition
Syntax: Define cube <cube_name> <dimension _list><measure_list>.
Ex: define cube sales_star [time, item, branch,location]
dollars_sold=sum(sales_in_dollars), units_sold=count(*).
ii) Dimension definition
Syntax: define dimension <dimension_name> as
(<attribute_or_subdimension list>)
Ex: Define dimension time as (time_key, day, day_of_week, month,
quarter, year)
44
There are three schemas with DMQL syntax is
1. Defining Star Schema in DMQL: Syntax
Sales_star [time, item, branch, location]: dollars_sold =
sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
units_sold = count(*).
2. Defining Snowflake Schema in DMQL: Syntax
Sales_snowflake [time, item, branch, location]: dollars_sold =
sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
units_sold = count(*).
3. Defining Fact Constellation in DMQL: Syntax
Sales_factconstellation [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*).
45
4. Measures
Measures of Data Cube: There are three Categories:
1.Distributive: if the result derived by applying the function to n
aggregate values is the same as that derived by applying the
function on all the data without partitioning
E.g., count(), sum(), min(), max()
2.Algebraic: if it can be computed by an algebraic function with M
arguments (where M is a bounded integer), each of which is
obtained by applying a distributive aggregate function
E.g., avg(), min_N(), standard_deviation()
3.Holistic: if there is no constant bound on the storage size needed to
describe a subaggregate.
E.g., median(), mode(), rank()
46
5. Concept Hierarchies
Concept hierarchies organize the values of attributes or
dimensions into gradual levels of abstraction and are
useful in mining at multiple levels of abstraction.
A concept hierarchy defines a sequence of mappings
from a set of low-level concepts to higher-level, more
general concepts.
Ex: Mandal->District->State->India ->World
Day->Week->Month->Quarter->Half year->Year
47
all all
region Europe ... North_America
country Germany ... Spain Canada ... Mexico
city Frankfurt ... Vancouver ... Toronto
office L. Chan ... M. Wind
Fig: A Concept Hierarchy: Dimension (location) 48
6. OLAP Operations
On-Line Analytical Processing (OLAP) can be performed
in data warehouses/marts using the multidimensional data
model.
Online Analytical Processing Server (OLAP) is based on
multidimensional data model. It allows the managers ,
analysts to get insight the information through fast,
consistent, interactive access to information.
One of the most compelling front-end applications for
OLAP is a PC spreadsheet program.
OLAP operations can be implemented efficiently using the
data cube structure.
Typical OLAP operations include rollup, drill-(down,
across, through), slice-and-dice, pivot (rotate). 49
Fig. Typical OLAP Operations 50
OLAP Operations are
1. Roll up: It is also called as Drill-up. Summarize data-by
climbing up hierarchy or by dimension reduction. Roll-up
takes the current aggregation level of fact values and does
a further aggregation on one or more of the dimensions.
Equivalent to doing GROUP BY to this dimension by
using attribute hierarchy.
Decreases a number of dimensions and removes row
headers.
Ex: SELECT [attribute list], SUM [attribute names]
FROM [table list]
WHERE [condition list]
GROUP BY [grouping list]; 51
2. Drill-down: it is also called as Roll down. It is reverse of roll-up-
from higher level summary to lower level summary or detailed
data, or introducing new dimensions
Opposite of roll-up.
Summarizes data at a lower level of a dimension hierarchy, thereby
viewing data in a more specialized level within a dimension.
Increases a number of dimensions - adds new headers
3. Slice: it is defined project and Select
Performs a selection on one dimension of the given cube, resulting
in a sub-cube.
Reduces the dimensionality of the cubes.
Sets one or more dimensions to specific values and keeps a subset
of dimensions for selected values.
52
4. Dice: Define a sub-cube by performing a selection of one
or more dimensions.
Refers to range select condition on one dimension, or to
select condition on more than one dimension.
Reduces the number of member values of one or more
dimensions.
5. Pivot it is also called as rotate. Reorient the cube,
visualization, 3D to series of 2D planes
Rotates the data axis to view the data from different
perspectives.
Groups data with different dimensions.
53
6. Drill across: Involving (across) more than one fact table.
Accesses more than one fact table that is linked by
common dimensions.
Combines cubes that share one or more dimensions.
7. Drill through: Through the bottom level of the cube to its
back-end relational tables (using SQL)
Drill down to the bottom level of a data cube down to its
back-end relational tables.
8. Cross-tab: Spreadsheet style row/column aggregates.
54
7. Starnet Query Model
The querying of multidimensional databases can be
based on a Starnet model.
A starnet model consists of radial lines emanating
from a central point, where each line represents a
concept hierarchy for a dimension, each abstraction
level in the hierarchy is called a footprint.
These represent the granularities available for use by
OLAP operations such as drill-down and roll-up.
55
Fig: Modeling business queries: a Starnet Model.
Starnet query model for the all Electronics data
warehouse.
Starnet consists of four radial lines, representing concept
hierarchies for the dimensions location, customer, item,
and time, respectively.
Each line consists of footprints representing abstraction
levels of the dimension.
A concept hierarchy may involve a single attribute or
several attributes.
Ex: The time line has four footprints: “day,” “month,”
“quarter,” and “year.”
57
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Location Organization
Promotion
Each circle is called a footprint
Fig: A Star-Net Query Model
4 - Data Warehouse Architecture
Data Warehouse Architecture is a description of the elements and
services of the warehouse, with details showing how the
components will fit together and how the system will grow over
time is called “Data Warehouse Architecture.”
DW Architecture is a structure, which comprises top tier(Front-end
tools), Middle tier(OLAP Server), and Bottom tier(DW Server).
An integrated set of products that enable the extraction and
transformation of operational data to be loaded into a database for
end-user analysis and reporting.
It consists of
1. Steps for the Design and Construction of DW
2. Three-Tier Data Warehouse Architecture
3. Data Warehouse Back-End Tools and Utilities
4. Metadata Repository
5. Types of OLAP Servers 59
1. Steps for the Design and Construction of Data
Warehouses : Design of Data Warehouse is a Business
Analysis Framework:
Four views regarding the design of a data warehouse
1. Top-down view: allows selection of the relevant
information necessary for the data warehouse.
2. Data source view: exposes the information being
captured, stored, and managed by operational systems.
3. Data warehouse view: consists of fact tables and
dimension tables.
4. Business query view : sees the perspectives of data in
the warehouse from the view of end-user.
60
Typical Data Warehouse Design Process: These are:
1. Choose a business process to model, e.g., orders,
invoices, etc.
2. Choose the grain (atomic level of data) of the business
process.
3. Choose the dimensions that will apply to each fact table
record.
4. Choose the measure that will populate each fact table
record
Implementing a Warehouse
Monitoring: Sending data from sources
Integrating: Loading, cleansing,...
Processing: Query processing, indexing, ...
Managing: Metadata, design, ... 61
2. Three-Tier Data Warehouse Architecture
The bottom tier is a warehouse database server that is
almost always a relational database system.
Back-end tools and utilities are used to feed data into the
bottom tier from operational databases or other external
sources.
The middle tier is an OLAP server that is typically
implemented using either. a relational OLAP (ROLAP)
model and a multidimensional OLAP (MOLAP) model.
The top tier is a front-end client layer, which contains
query and reporting tools, analysis tools, and/or data
mining tools. 62
Fig: Three-Tier Data Warehouse Architecture
There are three Data Warehouse Models are Enterprise
warehouse, Data Mart and Virtual warehouse models.
1. Enterprise Warehouse: Collects all of the information
about subjects spanning the entire organization
2. Data Mart: Subset of corporate-wide data that is of
value to a specific groups of users.
3. Virtual Warehouse: A set of views over operational
databases.
64
Multi-Tier Data
Warehouse
Distributed
Data Marts
Data Data Enterprise
Mart Mart Data
Warehouse
Model refinement Model refinement
Define a high-level corporate data model
Fig: Data Warehouse Development: A Recommended Approach
65
3. Data Warehouse Back-End Tools and Utilities
1. Data Extraction: Get data from multiple, heterogeneous,
and external sources
2. Data Cleaning: Detect errors in the data and rectify them
when possible
3. Data Transformation: Convert data from legacy or host
format to warehouse format
4. Load Data : Sort, summarize, consolidate, compute
views, check integrity, and build indices and partitions
5. Refresh the Data: Propagate the updates from the data
sources to the warehouse
66
4. Metadata Repository: (data about data) Meta data is the
data defining warehouse objects.
Description of the structure of the data warehouse is schema, view,
dimensions, hierarchies, derived data definition, data mart
locations and contents.
Operational meta-data is data lineage, currency of data, monitoring
information.
The algorithms used for summarization and the mapping from
operational environment to the data warehouse.
5. OLAP Server Architectures: OLAP servers are:
1. ROLAP: Relational Online Analytical Processing
2. MOLAP: Multidimensional Online Analytical Processing
3. HOLAP: Hybrid Online Analytical Processing
67
(1) Relational OLAP (ROLAP)
Special schema designs are star, snowflake
Special indexes are bitmap, multi-table join
Proven technology with relational model, DBMS and tend to
outperform specialized MDDB especially on large data sets.
Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware.
Example are
Telecommunication startup: call data records (CDRs)
Ecommerce Site
Credit Card Company
Products are
IBM DB2, Oracle, Sybase IQ, RedBrick, Informix.
Relational and specialized relational DBMS
OLAP middleware to support missing pieces
68
(2) Multidimensional OLAP (MOLAP)
Array-based storage structures
Direct access to array data structures
Fast indexing to pre-computed summarized data
Facts stored in multi-dimensional arrays
Dimensions used to index array
Examples are
Budgeting in a financial department and
Sales analysis.
Products are Pilot, Arbor Essbase, Gentia.
69
(3) Hybrid OLAP (HOLAP)
Storing detailed data in RDBMS
Storing aggregated data in MDBMS
User access via MOLAP tools.
Flexibility, e.g., low level: relational, high-level: array
• Examples are
Sales department of a multi-national company
Banks and Financial Service Providers
• Products / Tools are
ORACLE 8i,10g and 11i
ORACLE Express Serve
ORACLE Relational Access Manager
ORACLE Express Clients
70