Data Warehousing-II
(Schema, OLAP operations)
Dr. Pradeep Kumar Mallick
Schema Design
• Schema refers to the structure or organization of a
database.
• It contains a logical description of the entire database,
which includes names and descriptions of tables, records,
views, and indexes.
• While a relational model is used to describe a database,
data warehouse schemas get more specialized because the
structure is optimized for reporting and analysis
• A database uses relational model, while a data warehouse
uses Star, Snowflake, and Fact Constellation schema.
• Schema Types
– Star Schema
– Snowflake schema
– Fact Constellation Schema
Dimension Tables
• Every dimension contains attributes, which are grouped
in the form of a dimension.
• They are essentially a collection of information that can
be referenced to answer meaningful business questions
when used together with fact tables.
• Hold descriptive information about a particular business
perspective
• Define business in terms already familiar to users
– Wide rows with lots of descriptive text
– Small tables (about a million rows)
– Joined to fact table by a foreign key
– heavily indexed
– typical dimensions
• time periods, geographic region (markets, cities), products,
customers, salesperson, etc.
Fact Table
• Central table: Multiple dimension tables are linked to one fact table,
which contains ‘keys’ and ‘measures’. By ‘keys’, we’re referring to the
foreign keys of every associated dimension.
• Keys are used to perform joins with dimension tables to run queries.
‘Measures’ refer to numeric data like price and quantity, which
represents business events or transactions, used to add detail to
dimension data, so that effective reports can be generated.
• Key value is a composite key made up of the primary keys of the
dimensions
• Joined to dimension tables through foreign keys that reference primary
keys in the dimension tables
– Typical example: individual sales records
– mostly raw numeric items
– narrow rows, a few columns at most
– large number of rows (millions to a billion)
– Access via dimensions
Star Schema
• In the STAR Schema, the center of the star can have one fact table and a number
of associated dimension tables. It is known as star schema as its structure
resembles a star.
• Each dimension in a star schema is represented with only one-dimension
table.
• This dimension table contains the set of attributes.
• The following diagram shows the sales data of a company with respect to the
four dimensions, namely time, item, branch, and location.
• There is a fact table at the center. It contains the keys to each of four
dimensions.
• The fact table also contains the attributes, namely dollars sold and units sold.
• The dimension table should contain the set of attributes.
• The dimension table is joined to the fact table using a foreign key
Note − Each dimension has only one dimension table and each table holds a set of
attributes. For example, the location dimension table contains the attribute set
{location_key, street, city, province_or_state,country}. This constraint may cause data
redundancy. For example, "Vancouver" and "Victoria" both the cities are in the Canadian
province of British Columbia. The entries for such cities may cause data redundancy along
the attributes province_or_state and country.
Star Schema
Snowflake Schema
• In the snowflake schema, dimensions are stored in multiple dimension tables instead of
a single table per dimension.
• A Snowflake Schema is an extension of a Star Schema, and it adds additional
dimensions.
• Some dimension tables in the Snowflake schema are normalized.
• The dimension tables are normalized which splits data into additional tables
• The normalization splits up the data into additional tables.
• Unlike Star schema, the dimensions table in a snowflake schema are normalized. For
example, the item dimension table in star schema is normalized and split into two
dimension tables, namely item and supplier table.
• Now the item dimension table contains the attributes item_key, item_name, type, brand,
and supplier-key.
• The supplier key is linked to the supplier dimension table. The supplier dimension table
contains the attributes supplier_key and supplier_type.
• Advantages:
• The main benefit of the snowflake schema it uses smaller disk space.
• Easier to implement a dimension is added to the Schema
• Due to multiple tables query performance is reduced
• The primary challenge that you will face while using the snowflake Schema is that
you need to perform more maintenance efforts because of the more lookup tables.
Snowflake Schema
Fact Constellation Schema
• A fact constellation has multiple fact tables. It is also known as galaxy schema.
• The following diagram shows two fact tables, namely sales and shipping.
• The sales fact table is same as that in the star schema.
• The shipping fact table has the five dimensions, namely item_key, time_key,
shipper_key, from_location, to_location.
• The shipping fact table also contains two measures, namely dollars sold and units sold.
• It is also possible to share dimension tables between fact tables. For example, time,
item, and location dimension tables are shared between the sales and shipping fact table.
Star Vs Snowflake Schema: Key Differences
Star Schema Snow Flake Schema
Hierarchies for the dimensions are stored in the Hierarchies are divided into separate tables.
dimensional table.
It contains a fact table surrounded by dimension One fact table surrounded by dimension table
tables. which are in turn surrounded by dimension table
In a star schema, only single join creates the A snowflake schema requires many joins to fetch
relationship between the fact table and any the data.
dimension tables.
Simple DB Design. Very Complex DB Design.
Denormalized Data structure and query also run Normalized Data Structure.
faster.
High level of Data redundancy Very low-level data redundancy
Single Dimension table contains aggregated data. Data Split into different Dimension Tables.
Cube processing is faster. Cube processing might be slow because of the
complex join.
Offers higher performing queries using Star Join The Snow Flake Schema is represented by
Query Optimization. Tables may be connected centralized fact table which unlikely connected
with multiple dimensions. with multiple dimensions.
Multi Dimensional Data Model
• The multi-Dimensional Data Model is a method which is used for ordering data in the
database along with good arrangement and assembling of the contents in the database.
• OLAP (online analytical processing) and data warehousing uses multi dimensional
databases. It is used to show multiple dimensions of the data to users.
• It represents data in the form of data cubes.
• Data cubes allow to model and view the data from many dimensions and perspectives.
• It is defined by dimensions and facts and is represented by a fact table.
• Facts are numerical measures and fact tables contain measures of the related
dimensional tables or names of the facts.
(Multidimensional Data Representation)
Multidimensional Model...
• Here the data of the sales is represented as a two dimensional table. Let us
consider the data according to item, time and location (like Kolkata, Delhi,
Mumbai). Here is the table :
• This data can be represented in the form of three dimensions conceptually,
which is shown in the image below :
12
Features of multidimensional data models
• Measures: Measures are numerical data that can be analyzed and compared,
such as sales or revenue. They are typically stored in fact tables in a
multidimensional data model.
• Dimensions: Dimensions are attributes that describe the measures, such as
time, location, or product. They are typically stored in dimension tables in a
multidimensional data model.
• Cubes: Cubes are structures that represent the multidimensional relationships
between measures and dimensions in a data model. They provide a fast and
efficient way to retrieve and analyze data.
• Aggregation: Aggregation is the process of summarizing data across
dimensions and levels of detail. This is a key feature of multidimensional data
models, as it enables users to quickly analyze data at different levels of
granularity.
• Drill-down and roll-up: Drill-down is the process of moving from a higher-level
summary of data to a lower level of detail, while roll-up is the opposite process
of moving from a lower-level detail to a higher-level summary. These features
enable users to explore data in greater detail and gain insights into the
underlying patterns.
• Hierarchies: Hierarchies are a way of organizing dimensions into levels of
detail. For example, a time dimension might be organized into years, quarters,
months, and days. Hierarchies provide a way to navigate the data and perform
13
drill-down and roll-up operations.
Advantages & Disadvantages
• The following are the advantages of a multi-dimensional data model :
• A multi-dimensional data model is easy to handle.
• It is easy to maintain.
• Its performance is better than that of normal databases (e.g. relational
databases).
• The representation of data is better than traditional databases. That is
because the multi-dimensional databases are multi-viewed and carry
different types of factors.
• It is workable on complex systems and applications, contrary to the simple
one-dimensional database systems.
• The compatibility in this type of database is an upliftment for projects having
lower bandwidth for maintenance staff.
• The following are the disadvantages of a Multi Dimensional Data Model :
• The multi-dimensional Data Model is slightly complicated in nature and it
requires professionals to recognize and examine the data in the database.
• It is complicated in nature due to which the databases are generally
dynamic in design.
• As the Multi Dimensional Data Model has complicated systems, databases
have a large number of databases due to which the system is very insecure
when there is a security break.
14
Multidimensional Model...
• Hierarchies consist of levels called dimensional attributes
15
Multidimensional Model...
• Sales volume as a function of product, month, and region
16
Multidimensional Model...
17
Data cubes
• When data is grouped or combined in multidimensional matrices called Data Cubes.
• The data cube method has a few alternative names or a few variants, such as
"Multidimensional databases," "materialized views," and "OLAP (On-Line Analytical
Processing).".
• Data cube represents the data along some measures of an interest.
• It can be of 2-dimensional, 3-dimensional and higher dimensional
• Mainly used for the retrieval of the data
• It consists of categories of data called dimensions and measures.
• Measure and dimension represents fact such as cost, time, locations
Measures and Dimensions
3-Dimensional Data
Cuboids Corresponding to the Cube
all
0-D (apex) cuboid
product country
date 1-D cuboids
date, country
product,country 2-D cuboids
product,date
3-D (base) cuboid
product, date, country
OLAP Operations
• Since OLAP servers are based on multidimensional view of
data, we will discuss OLAP operations in multidimensional
data.
• Here is the list of OLAP operations −
1. Drill Down
2. Roll Up
3. Dice
4. Slice
5. Pivot
Roll up
✔ The roll-up operation also known as drill-up or aggregation operation.
✔ Roll-up performs aggregation on a data cube in any of the following ways −
• By climbing up a concept hierarchy for a dimension
• By dimension reduction
✔ Roll-up is performed by climbing up a concept hierarchy for the dimension
location.
✔ When a roll-up is performed by dimensions reduction, one or more dimensions
are removed from the cube.
✔ Initially the concept hierarchy was "street < city < province < country".
✔ On rolling up, the data is aggregated by ascending the location hierarchy from
the level of city to the level of country.
✔ The data is grouped into cities rather than countries.
✔ When roll-up is performed, one or more dimensions from the data cube are
removed.
✔ For example, consider a sales data cube having two dimensions, location and
time. Roll-up may be performed by removing, the time dimensions, appearing
in an aggregation of the total sales by location, relatively than by location and
by time.gt
Roll up
Drill down
• The drill-down operation (also called roll-down) is the reverse
operation of roll-up.
• Drill-down is the reverse operation of roll-up. It is performed by
either of the following ways −
• By stepping down a concept hierarchy for a dimension
• By introducing a new dimension.
• It navigates from less detailed record to more detailed data.
• When drill-down is performed, one or more dimensions from the
data cube are added.
• Drill-down can be performed by either stepping down a concept
hierarchy for a dimension or adding additional dimensions.
Drill down
Slice
• The slice operation selects one particular dimension from a given
cube and provides a new sub-cube.
• A slice is a subset of the cubes corresponding to a single value for
one or more members of the dimension.
• For example, a slice operation is executed when the customer
wants a selection on one dimension of a three-dimensional cube
resulting in a two-dimensional site.
• So, the Slice operations perform a selection on one dimension of
the given cube, thus resulting in a subcube.
• For example, if we make the selection, temperature=cool we will
obtain the following cube:
• Here Slice is functioning for the dimensions "time" using the
criterion time = "Q1".
Slice
Dice
• The dice operation describes a subcube
by operating a selection on two or
more dimension.
• The dice operation on the cubes based
on the following selection criteria
involves three dimensions.
• (location = "Toronto" or
"Vancouver")
• (time = "Q1" or "Q2")
• (item =" Mobile" or "Modem")
Pivot
• The pivot operation is also called a rotation.
• Pivot is a visualization operations which rotates the data axes in
view to provide an alternative presentation of the data.
• In the sub-cube obtained after the slice operation, performing
pivot operation gives a new view of it.
• It may contain swapping the rows and columns or moving one of
the row-dimensions into the column dimensions.
Pivot
• Consider the following diagram, which shows the pivot
operation.
OLAP All operations
OLAP vs OLTP
OLAP OLTP
Involves historical processing of Involves day-to-day processing.
information.
OLAP systems are used by knowledge OLTP systems are used by clerks, DBAs,
workers such as executives, managers and or database professionals.
analysts.
Useful in analyzing the business. Useful in running the business.
Based on Star Schema, Snowflake, Schema Based on Entity Relationship Model.
and Fact Constellation Schema.
Contains historical data. Contains current data.
Provides summarized and consolidated Provides primitive and highly detailed data.
data.
Highly flexible. Provides high performance.
Database size is from 100 GB to 1 TB Database size is from 100 MB to 1 GB.
Example-Schema:
Student( studID, name, major ) // dimension table, studID is key
Instructor( instID, dept ); // dimension table, instID is key
Class( classID, univ, region, country ); // dimension table, classID is key
Took( studID, instID, classID, score ); // fact table, foreign key references to
dimension tables
1. Find all students who took a class in California from an instructor
not in the student's major department and got a score over 80. Return
the student name, university, and score.
2. Find average scores grouped by student and instructor for courses
taught in Quebec.
3. "Roll up" your result from problem 2 so it's grouping by instructor
only.
4. Find average scores grouped by student major.
5. "Drill down" on your result from problem 4 so it's grouping by
instructor's department as well as student's major.
Student( studID, name, major ) // dimension table, studID is key
Instructor( instID, dept ); // dimension table, instID is key
Class( classID, univ, region, country ); // dimension table, classID is key
Took( studID, instID, classID, score ); // fact table, foreign key references
to dimension tables
1. Find all students who took a class in California from an instructor
not in the student's major department and got a score over 80.
Return the student name, university, and score.
Student( studID, name, major ) // dimension table, studID is key
Instructor( instID, dept ); // dimension table, instID is key
Class( classID, univ, region, country ); // dimension table, classID is key
Took( studID, instID, classID, score ); // fact table, foreign key references
to dimension tables
2. Find average scores grouped by student and instructor for
courses taught in Quebec.
Student( studID, name, major ) // dimension table, studID is key
Instructor( instID, dept ); // dimension table, instID is key
Class( classID, univ, region, country ); // dimension table, classID is key
Took( studID, instID, classID, score ); // fact table, foreign key references
to dimension tables
5. "Drill down" on your
3. "Roll up" your 4. Find average result from problem 4 so
result from scores grouped it's grouping by
problem 2 so it's by student instructor's department as
grouping by major. well as student's major.
instructor only.