Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
3 views13 pages

DW Notes

The document provides an overview of data warehousing concepts, including the distinction between OLTP and OLAP systems, data warehouse design methodologies, and architectural components. It discusses dimensional modeling, fact and dimension tables, surrogate keys, slowly changing dimensions, and the importance of data cubes for analysis. Additionally, it covers advanced modeling concepts, query performance enhancement techniques, and the implications of sparsity in data aggregation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views13 pages

DW Notes

The document provides an overview of data warehousing concepts, including the distinction between OLTP and OLAP systems, data warehouse design methodologies, and architectural components. It discusses dimensional modeling, fact and dimension tables, surrogate keys, slowly changing dimensions, and the importance of data cubes for analysis. Additionally, it covers advanced modeling concepts, query performance enhancement techniques, and the implications of sparsity in data aggregation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Data Warehouse:

Module 1: Intro

Software programs can be broadly classified as :

OLTP (Online Transactional processing)Operational Systems required to run the Business applications and have
information about the Processes.Highly transaction and large valumes of data. Normalised to reduce redundancy
More tables result to address 1,2,3NF and to avoid Update,Delete,Insert Anamolies  For faster insert, updates,
deletes As more data comes into system, data quality is maintained and reducing reduncy Data in (Data
Capture) systems

OLAP Decision support systems Information about strategies Subject oriented (each data mart is about a
subject) For faster analysis and search read most systems with little updates except durign loading(ETL)
Denormalised to have few tables, integrated data at one place for faster access to the data.--> data out systems.

Module 2: Data warehouse Design

 ER Modelling is used for OLTP to optimize data updates through normalization which basically creates more
tables to reduce redunancy and avoid data anamolies (Delete, insert,update)
 ER Modelling cannot be used in DW, as normalization does not not support fast data retrieval in large
volumes, as more tables are involved.
 Dimensional modelling is used in DW
 Facts or Measure= function(Dimensions or factors)
 Facts or Measures are numeric and additive. Ex Sale Amt= f(product,location,time)
 Dimensions tables have textual or descriptive or numerical attributes.
 Fact tables have facts or measures which are numeric and also foreign keys from Dimension tables which are
surrogate keys (not intelligent or smart)
 All the foreign keys (Dimension entries) are surrogate keys in fact table,form a composite primary key and
take 4 bytes each and integer values. They are the new primary keys in dimension tables.
 Types ofs : Additive (the fact is additive with all dimension), semi((the fact is additive with some dimensions.
Ex Account balance is not addiitive wrt time, yesterday 10k, today 15k doesn’t imply balance as 15k. But with
customer dimension it is additive.I have 10k and another has 5k, total balance for both is 15k) and non-
additive
 Star schema, dimension tables are not interconnected but all are connected through fact table.
 Snow flakes, dimension tables are normalised into sub tables and so such normalised dimension tables are
inter-connected .
 Design steps:
o Step1 : Idenitfy the Business Process
o Step2 :Declare the grain (level of details ex: per day per month)
o Step3: Identify the dimension
o Step4: Identify the facts

Data Cubes:Users of decision support systems often see data in the form of data cubes. The
cube is used to represent data along some measure of interest. Although called a "cube", it can
be 2-dimensional, 3-dimensional, or higher-dimensional. Each dimension represents some
attribute in the database and the cells in the data cube represent the measure of interest.
Inmon: high risk, high reward, Kimball:low risk low reward

Database sizing:

FACT TABLE SIZE

 3 year data

 100 stores

 Daily grain

 60,000 SKUs
 Sparsity = 10% (10% of total products is sold per day)

 4 dimensions (16 bytes)

 4 facts (16 bytes)

Total Size=3x365x100x6000x32 (20gb)

Dimensional versus normalized approach for storage of data


 The dimensional approach refers to Ralph Kimball's approach in which it is stated that the data
warehouse should be modeled using a Dimensional Model/star schema. The normalized approach, also
called the 3NF model (Third Normal Form) refers to Bill Inmon's approach in which it is stated that the
data warehouse should be modeled using an E-R model/normalized model.
 In the normalized approach, the data in the data warehouse are stored following, to a degree, database
normalization rules.
 In the bottom-up approach, data marts are first created to provide reporting and analytical capabilities for
specific business processes. These data marts can then be integrated to create a comprehensive data
warehouse.
 A data mart is the access layer of the data warehouse environment that is used to get data out to the
users. The data mart is a subset of the data warehouse and is usually oriented to a specific business line
or team. Whereas data warehouses have an enterprise-wide depth, the information in data marts
pertains to a single department.
 Dependant datamarts: DW is first implemented then data marts are created.

 Independent datamarts: Data marts firstc raeted and aggregate of them constitutes dw.
 Coverage Factless table contains all the information. Factless – fact table gives information that is not
part of fact table.It covers all possiblities.
 So factless-fact entries give information about missing events in fact table
 It has no measures but only has foreign keys from dimension tables
 It conatins no data only track events (happened and not happened)

Topic 5- Data warehouse architecture:

• Source Systems

• Data Staging Area (A storage area where extracted data is ..ETL)

• Presentation Servers (A target physical machine on which DW data. Data marts are here.)

• Data Mart/Super Marts

• Data Warehouse

• Operational Data Store

• OLAP (used by end users through client server arcgitecture,front end tool for DW,enables easy reporting
cababilities without sql knowledge)

ODS (operational data store) optional: Half operational (volatile,current valued) & half DSS (subject oriented ,
integrated)

• ODS is particularly useful when:


– ETL process of the main DW delayed the availability of data

– Only aggregated data is available

– Class I – Updates of data from operational systems to ODS are synchronous


– Class II – Updates between operational environment & ODS occurs between 2-3 hour frame
– Class III – synchronization of updates occurs overnight
– Class IV – Updates into the ODS from the DW are unscheduled
• Data in the DW is analyzed, and periodically placed in the ODS
• For Example –Customer Profile Data
• Customer Name & ID
• Customer Volume – High/low
• Customer Profitability – High/low
• Customer Freq. of activity – very freq./very infreq.
• Customer likes & dislikes

OLAP
 A data warehouse serves as a repository to store historical data that can be used for
analysis.
 OLAP is Online Analytical processing that can be used to analyze and evaluate data in a
warehouse.
 The warehouse has data coming from varied sources. OLAP tool helps to organize data in
the warehouse using multidimensional models.

Topic 6,7 : case study

 Factless tables
 Conformed dimensions where dimensional tables are reused.

Module 5: Topic 8 and 9: Advanced Modelling concepts


Surrogate keys

 Natural key or Production keys or Smart keys or Intelligent keys (previously primary key
from OLTP)
 Surrogate,aritifial,integer keys.4 byte integers(4 billion rows ) taking less space than smart
keys. Natural primary key is changed to a surrogate
 Advantages: less space+ faster joins than using natural keys as comparision involves
integers in joins.+easy handling changing dimensions

Slowly Changing Dimensions

 Type1 Overwrite old value in dimension table, no history is mainatined.Easy and fast to
implement. Only latest is refelected.
 Type2Add a new dimension row,history is maintained. New record will have same Natural
key but new surrogate is created for the new row. Effective dates column can be added.
 Type 3 In the same record, new column(new attribute) is added. So previous and new
values are listed. Full history is not available,only recent history is maintained+when more
attributes changes, too many columsn might get added.

Generating Surrogate keys:

 First time load Assign new surrogate keys


 Data Refresh
o First ifnd if the data exists in dimension table, if not assign new surrogate key
o If exists, find if record is changed or not using checksums of each record, if there is a change
find the attributes which changes and according to type 1,2 or 3 load the changes.
o If it is type 3, assign new surrogate key.
 Lookup tables can be used in refreshing activity.

Lookup Tables: Each dimension tables has a lookup table which has mappins between surrogate
keys and natural keys. Only the latest surrogate is stored when multiple rows exists with same
natural keys(previously primary key from OLTP).

Advanatages:

 Makes generartion of Sks fatser


 Helps refershing acivity in dimension
 Helps populating fact table faster
 Alwys contains latest dimension record (latest sk)

Rapidly Changing Monster Dimensions:

 Problem: Type 2 is not recommended in such cases as rows will increase.


 Solution: Mini-Dimension. The rapidly changing attributes are separated out into separate mini dimension
table. The fact table will have foreign key to this new mini dimension table as well. Both the primary
dimension (from where mini is separated) and mini dimension are related through the Fact or factless table.

Snowflaking & Outriggers:

 Snowflakes not recommended: Dimensions are further normalised .


 Outriggers :permissible snowflakes.Outriggers have special characteristicsthat make snowflake ,doel
permissible. Limited normalization and only attributes sets which are higly correlated etc. Outriggers are
exception and should not be considered as a rule or should not be used unless there is a need.

Star Schema: Mini Dimension:

Dimension1 Dimension3
Dimension1 Dimension3

Fact Fact
Mini Dimension4
Dimension

Dimension2 Dimension4

Dimension2
Snowflakes: Outtrigger:

 When all dimensions are normalised : Snowflake, only when some dimensions are normalised, it is called
Starflakes

 Centipede fact table is a normalized fact table. Modeller may decide to normalize the
fact instead of snow flaking dimensions tables.

Time Dimension:

Time Dimension is explicitly added in DW and not something coming from OLTP systems. This avoids deriving time
which saves while accessing or analyzing millions of records.

Conformed dimensions: (Conformed= of similar type)

 A conformed dimension can exist as a single dimension table that relates to multiple
fact tables within the same data warehouse, or as identical dimension tables in separate data
marts.
 Dimensions can reused(created physically) in different data marts or star schemas in three different ways:
Identical only some columns/attributes only some rows/records

Role Playing Dimensions

 When same dimensions/attributes are used multipel times within the same fact table, it complicates the join
operation
 Solution: View are created to separate the dimensions, each view is virtual dimension entity and has its own
SK entry in fact table like any other dimension table.

Multi-valued dimensions: (Same cell with different values )

Ex: Joint accounts in Bank, Multiple sale person involved in a sale

M7: Query Performance Enhancing Techniques (Concepts of physical design)

There are several strategies to improve query response time in the data warehouse
context: indexing techniques, materialized views, and partitioning of
data.

Description/Plan/Reference Brief Description


RL7.1.1 = Aggregation Pre-calculated Summarization of base fact table through new star
RL7.1.2 = Sparsity Failure schema witt reduced record rows. Sparsity problem
RL7.2.1 = Shrunken, Lost, &
Collapsed Dimensions
RL7.3.1 = Aggregate Navigator Helps in making Aggregate unware sql queries from users to
RL7.3.2 = Aggregate Navigation aggregate aware queries.Algorithm find smallets aggregate star to
Algorithm next smallest till results are returned.
RL7.4.1 = Partitioning
RL7.4.1 = Partitioning wrt Time
RL7.5.1 = View Materialization
RL7.5.2 = Selection of Views to
Materialize
RL7.6.1 = View Maintenance
Strategies
RL7.6.2 = Incremental Maintenance
Algorithms
RL7.7.1 = Bitmap Indices
RL7.7.2 = Bitmap Compression
Strategies
CS7.1.1 = Data Warehouse
performance challenges(T1, Ch 18)
CS7.1.2 = Concepts of physical
design (T1, Ch 18)

Aggregation (Summarization in separate star schemas)


 Aggregate fact tables are merely summaries of the most granular data at
higher levels along the dimension hierarchies.
An aggergate is a fact table representing a summarization of base-level fact table data.
 It’s a precalculated summaries that are stored in the data warehouse to imrpove query performance.
 Improves query performance by a factor of 100 to even 1000 as the number of record rows decreases with
aggregation.
 Ex: if a fact table contains daily granular data about sales data. Another summary tablle(another star
schema) is created with summary of the sales details per month. So total number records will reduce in
second instance.Similarly, if total number of sales per product sub catgeory are shown in base fact table, a
summary\aggregate table can be created with sales per category.
 So the aggregation can be done with 1 dimension or 2 way or multiway.
n
 With n dimension, 2 aggregates can be created including with Base table (with no aggregation).
 With 1 way, the sparsity increases, with 2 way the increase is more.
 Sparsity(antonym is dense) refers to how the fact table is occupied sold or % of events occuring.
 So with increase in aggregation (2..3… n), sparsisty inccreases , which results in more records which causes
sparsity failure.

Effect of Sparsity on Aggregation Consider the case of the grocery chain with 300
stores, 40,000 products in each store, but only 4000 selling in each store in a day. As discussed
earlier, assuming that you keep records for 5 years or 1825 days, the maximum
number of base fact table rows is calculated as follows:
Product ¼ 40,000
Store ¼ 300
Time ¼ 1825
Maximum number of base fact table rows ¼ 22 billion

Because only 4000 products sell in each store in a day, not all of these 22 billion rows are
occupied. Because of this sparsity, only 10% of the rows are occupied. Therefore, the real
estimate of the number of base table rows is 2 billion.

Now let us see what happens when you form aggregates. Scrutinize a one-way aggregate:
brand totals by store by day. Calculate the maximum number of rows in this one-way
aggregate.
Brand ¼ 80
Store ¼ 300
Time ¼ 1825
Maximum number of aggregate table rows ¼ 43,800,000
While creating the one-way aggregate, you will notice that the sparsity for this aggregate
is not 10% as in the case of the base table. This is because when you aggregate by brand,
more of the brand codes will participate in combinations with store and time codes. The sparsity
of the one-way aggregate would be about 50%, resulting in a real estimate of 21,900,000.
If the sparsity had remained as the 10% applicable to the base table, the real estimate of the
number of rows in the aggregate table would be much less.
When you go for higher levels of aggregates, the sparsity percentage moves up and
even
reaches 100%.

Experienced data warehousing practitioners have a suggestion. When you form aggregates,
make sure that each aggregate table row summarizes at least 10 rows in the lower
level table. If you increase this to 20 rows or more, it would be really remarkable.

Aggregate Navigator

 Why Aggregates should be hidden be from end users? If not hidden any changes in Aggregates
will force the users to change their queries. To keep them independent, Aggregate Navigator
middleware software is used.
 When a user sends a query targeting base table, the AN converts the query to target an
appropriate Aggregate.

Aggregator navigator Algorithm:

 Starts from the smallest aggregate fact table, of the query can be answered, the table is considered as base
fact table its aggregated dimensions are used in the query processing.
 If it fails, next smallest and so on.
 Worst case, the base table might be queries if none of the aggregate tables satisfy the dimensions in the
query.

Partitioning:
 Fact tables are generally very large. Large tables are not easy to manage. During the load
process, the entire table must be closed to users. Again, back up and recovery of large tables
pose difficulties because of their sheer sizes.
 Partitioning divides large database tables into manageable parts..Partitioning helps Loading fact
tables faster ,supports incrmental backup process and faster retrieval. Horizontal or vertical?
 Partitioning wrt Time means partitioning in year or time ranges which helps in data retrieval or backup or
loading or archiving in time based partitions

 Range Partitioning: Each partition is specified by a range of values of the partitioning key (e.g. a trade table can be
partitioned by the year of the trade date')
 List Partitioning: Each partition is specified by a list of values of the partitioning key (sales data can be partitioned by
region in which countries falling under NA, EMEA etc can be grouped under separate lists.)
 Hash Partitioning: A hash algorithm is applied to the partitioning key to determine the partition for a given row

View Materialization:

 Problems with Views : When a view is created, just the view definition is saved in DB but no rows\data are
fetched or exist physically.
 Whenever a query runs on a view, the view definition is executed .
 So the overall time of the query response is very large on a view in large large data sets or in joins
 View is virtual table and is nothing but a saved sql query.

A materialized view is a database object that contains the results of a query. It contains data which can be
queried on. Can be refreshed .They are pre-computed important, expensive and frequently required
results.
Maintenance or Synchronization:

A view maintenance policy is a decision about when a view is refreshed,


independent of whether the refresh is incremental or not. A view can be refreshed
within the same transaction that updates the underlying tables. This
is called immediate view maintenance. The update transaction is slowed
by the refresh step, and the impact of refresh increass with the nurnber of
materialized views that depend on the updated table.
Alternatively, we can defer refreshing the viewv. Updates are captured in a log
and applied subsequently to the rnaterialized views. There are several deferred
view maintenance policies:
1. Lazy: The rnaterialized viewis refreshed at the timee a query is evaluated
using V, if V is not already consistent with its underlying base tables. This
approach slows down queries rather than updates, in contrast to immediate
view rnaintenance.
2.Periodi(daily or weekly)
3.Forced: The rnaterialized view is refreshed after a certain nurnber of
changes have been made to the underlying tables.
Problems:
In periodic and forced view maintenance, queries see an instance of the
maaterialized view that is not consistent with the current state of the underlying
tables.

Examples

Create a materialized view of columns from the customer and bookorder tables.
CREATE MATERIALIZED VIEW custorder AS
SELECT custno, custname, ordno, book
FROM customer, bookorder
WHERE customer.custno=bookorder.custno;

Create a materialized view of columns x1 and y1 from the t1 table.


CREATE MATERIALIZED VIEW v1 AS SELECT x1, y1 FROM t1
PRIMARY KEY (x1) UNIQUE HASH (x1) PAGES=100;
Immediate lazy Periodic Forced or
Event based
Current value Always Always May not be May not be
when not when not
refereshed refreshed
Delay in Less probable High probable Less probable Less probable
query
response

Incremental Updates on materialized views:

 A change in tables can be propogated to Materialized views retaining the old view and
just implementing the on the view,incrementally
 EX: If V = R join S , if Ir tuples are inserted then the new materialized view can be
obtained by Vnew = Rnew join S
Since Rnew = R Union Ir  Vnew = (R Union Ir) join S
Which can be written as Vnew = (R join S ) Union (Ir join S)
 Vnew = Vold Union (Ir join S)
 EX: If V = R join S , if ir tuples are inserted then the new materialized view can be
obtained by Vnew = Rnew join S
 Similary when Dr tuples are deletd Vnew = Vold – (Dr join S)

Bitmap Indexing:Indexing in general improves Query performance. But the downside is with
large data tables, the index will also be very large. Since Dat warehouse has large datasets,
normal index techniques may not yield effeciency. Bitmap indexes are generally used in data
warehouse tables on low cardinal columns, i.e., Columns with less distinct values. Ex a table
may have 1 million records, but gender column will be either true or false. Similarly state
column. For every distinct value, a bit is used. For 2 distinct value, 2 bits are used. If a value is
present the bit is represented by 1 or else 0. For each row of data, a bit vector is added along
with rowid. Whichever row meets the condition, results in the rowids which need to be fetched.
If there are multiple bitmap indexes on multipe columns in a table. Based on the query
conditions , bitmap operator is applied on the resultset to get the resultant record rows.

In a bitmap index, though there are 10 million records with 1000 columns, the index on a
column will have 10 million rows but with only few columns, equal to the number of distinct
values of the indexed column +1.

M6 :OLAP &Multi-Dimension Databases:

OLAP Operations:

Roll up
The roll-up operation (also called drill-up or aggregation operation) performs aggregation
on a data cube, either by climbing up a concept hierarchy for a dimension or by climbing
down a concept hierarchy, i.e. dimension reduction.
Roll Down
The roll down operation (also called drill down) is the reverse of roll up. It navigates from
less detailed data to more detailed data. It can be realized by either stepping down a
concept hierarchy for a dimension or introducing additional dimensions.
Slicing
Slice performs a selection on one dimension of the given cube, thus resulting in a
subcube.
Dicing
The dice operation defines a subcube by performing a selection on two or more
dimensions. For example, applying the selection (time = day 3 OR time = day 4) AND
(temperature = cool OR temperature = hot) to the original cube we get the following
subcube (still two-dimensional):

 (Data cubes) more organised faster dml faster retrieval faster search operationsfast
analysis
 When is it appropriate: when a fact is function of all the dimension. Or else empty cells will
exist
 Recommendation: EDWRDMS Data marts MDDB
 MOLAP OLAP systems using MDDBs (Multi dimensional olap)
 ROLAP Use RDMS (Relational OLAP)
M9: Support for DW in RDMS:
SQL new operators for DW:

 Rollup Operator: Used for rollup operations: In Hierarchy structure when data is shown at
daily level, when it is coverted to to weekly or monthly, it is called roll-up.
 Cube: Operator which creates aggregates
 Window queries: helps analysing in date windows around a particular date . Ex: 5 daya
before and after 10th August.
 Top N Queries:Ex: Top 5 selling products

You might also like