KCA012: Data Warehousing & Data Mining
UNIT-1
Data Warehouse Introduction
A data warehouse is a collection of data marts representing historical data from
different operations in the company.
The term Data Warehouse was coined by Bill Inmon in 1990, which he defined in
the following way: "A data warehouse is a subject-oriented, integrated, time-
variant and non-volatile collection of data in support of management's decision
making process".
A data warehouse is constructed by integrating data from multiple
heterogeneous sources.
A data warehouse is a database, which is kept separate from the
organization's operational database.
It possesses consolidated historical data, which helps the organization to
analyze its business.
A data warehouse helps executives to organize, understand, and use their
data to take strategic decisions.
Data warehouse systems help in the integration of diversity of application
systems.
Data warehouse is an information system that contains historical and
commutative data from single or multiple sources. It simplifies reporting and
analysis process of the organization.
A data warehouse system helps in consolidated historical data analysis.
Data warehouse is an information system that contains historical and commutative
data from single or multiple sources. It simplifies reporting and analysis process of
the organization. It is also a single version of truth for any company for decision
making and forecasting.
Characteristics of Data warehouse
Subject-Oriented
Integrated
Time-variant
Non-volatile
Subject Oriented − A data warehouse is subject oriented because it provides
information around a subject rather than the organization's ongoing
operations. These subjects can be product, customers, suppliers, sales,
revenue, etc. A data warehouse does not focus on the ongoing operations,
rather it focuses on modeling and analysis of data for decision making.
1
Integrated − A data warehouse is constructed by integrating data from
heterogeneous sources such as relational databases, flat files, etc. This
integration enhances the effective analysis of data.
Time Variant − The data collected in a data warehouse is identified with a
particular time period. The data in a data warehouse provides information
from the historical point of view.
Non-volatile − Non-volatile means the previous data is not erased when new
data is added to it. A data warehouse is kept separate from the operational
database and therefore frequent changes in operational database is not
reflected in the data warehouse.
DATA WAREHOUSE COMPONENTS
The data warehouse is based on an RDBMS server which is a central information
repository that is surrounded by some key components to make the entire
environment functional, manageable and accessible
There are mainly five components of Data Warehouse:
Data Warehouse Database:
The central database is the foundation of the data warehousing environment. This
database is implemented on the RDBMS technology. Although, this kind of
2
implementation is constrained by the fact that traditional RDBMS system is
optimized for transactional database processing and not for data warehousing. For
instance, ad-hoc query, multi-table joins, aggregates are resource intensive and
slow down performance.
Sourcing, Acquisition, Clean-up and Transformation Tools (ETL)
The data sourcing, transformation, and migration tools are used for performing all
the conversions, summarizations, and all the changes needed to transform data into
a unified format in the data warehouse. They are also called Extract, Transform
and Load (ETL) Tools.
Metadata
The name Meta Data suggests some high- level technological concept. However, it
is quite simple. Metadata is data about data which defines the data warehouse. It is
used for building, maintaining and managing the data warehouse. Metadata can be
classified into following categories:
1. Technical Meta Data: This kind of Metadata contains information about
warehouse which is used by Data warehouse designers and administrators.
2. Business Meta Data: This kind of Metadata contains detail that gives end-users
a way easy to understand information stored in the data warehouse.
Query Tools
One of the primary objects of data warehousing is to provide information to
businesses to make strategic decisions. Query tools allow users to interact with the
data warehouse system.
These tools fall into four different categories:
1. Query and reporting tools
2. Application Development tools
3. Data mining tools
4. OLAP tools
3
Data Marts
A data mart is an access layer which is used to get data out to the users. It is
presented as an option for large size data warehouse as it takes less time and
money to build. However, there is no standard definition of a data mart is differing
from person to person.
BUILDING A DATA WAREHOUSE
In general, building any data warehouse consists of the following steps:
1. Extracting the transactional data from the data sources into a staging area
2. Transforming the transactional data
3. Loading the transformed data into a dimensional database
4. Building pre-calculated summary values to speed up report generation
5. Building (or purchasing) a front-end reporting tool
Extracting Transactional Data:
A large part of building a DW is pulling data from various data sources and
placing it in a central storage area
Transforming Transactional Data:
An equally important and challenging step after extracting is transforming and
relating the data extracted from multiple sources.
4
Creating a Dimensional Model:
The third step in building a data warehouse is coming up with a dimensional
model. Most modern transactional systems are built using the relational model.
The relational database is highly normalized; when designing such a system
Loading the Data:
After you've built a dimensional model, it's time to populate it with the data in
the staging database. This step only sounds trivial. It might involve combining
several columns together or splitting one field into several columns
Generating Pre calculated Summary Values:
The next step is generating the pre calculated summary values which are
commonly referred to as aggregations. This step has been tremendously simplified
by SQL Server Analysis Services (or OLAP Services,
Building (or Purchasing) a Front-End Reporting Tool
After you've built the dimensional database and the aggregations you can decide
how sophisticated your reporting tools need to be. If you just need the drill-down
capabilities, and your users have Microsoft Office 2000 on their desktops, the
Pivot Table Service of Microsoft Excel 2000 will do the job.
MAPPING THE DATA WAREHOUSE TO A MULTIPROCESSOR
ARCHITECTURE
The functions of data warehouse are based on the relational data base technology.
The relational data base technology is implemented in parallel manner. There are
two advantages of having parallel relational data base technology for data
warehouse:
Linear Speed up: refers the ability to increase the number of processor to
reduce response time
Linear Scale up: refers the ability to provide same performance on the same
requests as the database size increases
Types of parallelism:
Inter query Parallelism: In which different server threads or processes
handle multiple requests at the same time.
Intra query Parallelism: This form of parallelism decomposes the serial
SQL query into lower level operations such as
scan, join, sort etc. Then these lower level operations are executed
concurrently in parallel.
5
Intra query parallelism can be done in either of two ways:
Horizontal parallelism: which means that the data base is partitioned
across multiple disks and parallel processing
occurs within a specific task that is performed concurrently on different
processors against different set of data
Vertical parallelism: This occurs among different tasks. All query
components such as scan, join, sort etc are executed in parallel in a pipelined
fashion. In other words, an output from one task becomes an input into
another task.
Types of DBMS Parallelism
Data Partitioning: Data partitioning is the key component for effective parallel
execution of data base operations. Data Partition can be done in two ways:-
Random Partitioning: Includes random data striping across multiple disks
on a single server. Another option for random portioning is round robin
fashion partitioning in which each record is placed on the next disk assigned
to the data base.
Intelligent partitioning: Assumes that DBMS knows where a specific
record is located and does not waste time searching for it across all disks.
2. Data base architectures of parallel processing
There are three DBMS software architecture styles for parallel processing:
Shared memory or shared everything Architecture
Shared disk architecture
Shred nothing architecture
6
2.1 Shared Memory Architecture:
Tightly coupled shared memory systems, illustrated in following figure have the
following characteristics:
Multiple PUs share memory.
Each PU has full access to all shared memory through a common bus.
Communication between nodes occurs via shared memory.
Performance is limited by the bandwidth of the memory bus.
It is simple to implement and provide a single system image, implementing
an RDBMS on SMP(symmetric multiprocessor)
2.2 Shared Disk Architecture
Shared disk systems are typically loosely coupled. Such systems, illustrated in
following figure, have the following characteristics:
Each node consists of one or more PUs and associated memory.
Memory is not shared between nodes.
Communication occurs over a common high-speed bus.
Each node has access to the same disks and other resources.
A node can be an SMP if the hardware supports it.
Bandwidth of the high-speed bus limits the number of nodes (scalability) of
the system.
The Distributed Lock Manager (DLM ) is required.
7
2.3 Shared Nothing Architecture
Shared nothing systems are typically loosely coupled. In shared nothing systems
only one CPU is connected to a given disk. If a table or database is located on that
disk, access depends entirely on the PU which owns it.
Shared nothing systems are concerned with access to disks, not access to memory.
Adding more PUs and disks can improve scale up.
8
Draw the 3-tier data warehouse architecture. Explain ETL process.
Generally a data warehouses adopts a three-tier architecture. Following are the
three tiers of the data warehouse architecture.
Bottom Tier − The bottom tier of the architecture is the data warehouse database
server. It is the relational
database system. We use the back end tools and utilities to feed data into the
bottom tier. These back end tools
and utilities perform the Extract, Clean, Load, and refresh functions.
Middle Tier − In the middle tier, we have the OLAP Server that can be
implemented in either of the following
ways.
By Relational OLAP (ROLAP), which is an extended relational database
management system. The ROLAP
maps the operations on multidimensional data to standard relational operations.
By Multidimensional OLAP (MOLAP) model, which directly implements the
multidimensional data and
operations.
9
Top-Tier − This tier is the front-end client layer. This layer holds the query tools
and reporting tools, analysis tools and data mining tools.
Difference between Database System and Data Warehouse:
Database System:
Database System is used in traditional way of storing and retrieving data. The
major task of database system is to perform query processing. These systems are
generally referred as online transaction processing system. These systems are used
day to day operations of an organization.
Data Warehouse:
Data Warehouse is the place where huge amount of data is stored. It is meant for
users or knowledge workers in the role of data analysis and decision making.
These systems are supposed to organize and present data in different format and
different forms in order to serve the need of the specific user for specific purpose.
These systems are referred as online analytical processing.
Database System Data Warehouse
It supports operational processes.It supports analysis and performance
reporting.
Capture and maintain the data. Explore the data.
Current data. Multiple years of history.
Data is balanced within the scope Data must be integrated and balanced
of this one system. from multiple system.
Data is updated when transaction Data is updated on scheduled processes.
occurs.
Data verification occurs when Data verification occurs after the fact.
entry is done.
100 MB to GB. 100 GB to TB.
ER based. Star/Snowflake.
Application oriented. Subject oriented.
Primitive and highly detailed. Summarized and consolidated.
Flat relational. Multidimensional.
10
MULTIDIMENSIONAL DATA MODEL
Multidimensional data model stores data in the form of data cube. Mostly, data
warehousing supports two or three-dimensional cubes.
A data cube allows data to be viewed in multiple dimensions. Dimensions are
entities with respect to which an organization wants to keep records. For example
in store sales record, dimensions allow the store to keep track of things like
monthly sales of items and the branches and locations. A multidimensional
database helps to provide data-related answers to complex business queries quickly
and accurately. Data warehouses and Online Analytical Processing (OLAP) tools
are based on a multidimensional data model. OLAP in data warehousing enables
users to view data from different angles and dimensions
The multi-Dimensional Data Model is a method which is used for ordering data in
the database along with good arrangement and assembling of the contents in the
database.
The Multi-Dimensional Data Model allows customers to interrogate analytical
questions associated with market or business trends, unlike relational databases
which allow customers to access data in the form of queries. They allow users to
rapidly receive answers to the requests which they made by creating and
examining the data comparatively fast.
OLAP (online analytical processing) and data warehousing uses multi-dimensional
databases. It is used to show multiple dimensions of the data to users.
Working on a Multidimensional Data Model
11
The following stages should be followed by every project for building a Multi-
Dimensional Data Model:
Stage 1: Assembling data from the client: In first stage, a Multi-Dimensional Data
Model collects correct data from the client. Mostly, software professionals provide
simplicity to the client about the range of data which can be gained with the
selected technology and collect the complete data in detail.
Stage 2: Grouping different segments of the system: In the second stage, the
Multi-Dimensional Data Model recognizes and classifies all the data to the
respective section they belong to and also builds it problem-free to apply step by
step.
Stage 3: Noticing the different proportions: In the third stage, it is the basis on
which the design of the system is based. In this stage, the main factors are
recognized according to the user’s point of view. These factors are also known as
“Dimensions”.
Stage 4: Preparing the actual-time factors and their respective qualities: In the
fourth stage, the factors which are recognized in the previous step are used further
for identifying the related qualities. These qualities are also known as “attributes”
in the database.
Stage 5: Finding the actuality of factors which are listed previously and their
qualities: In the fifth stage, A Multi-Dimensional Data Model separates and
differentiates the actuality from the factors which are collected by it. These
actually play a significant role in the arrangement of a Multi-Dimensional Data
Model.
Stage 6: Building the Schema to place the data, with respect to the information
collected from the steps above: In the sixth stage, on the basis of the data which
was collected previously, a Schema is built.
For Example:
1. Let us take the example of a firm. The revenue cost of a firm can be recognized
on the basis of different factors such as geographical location of firm’s workplace,
products of the firm, advertisements done, time utilized to flourish a product, etc.
12
Let us take the example of the data of a factory which sells products per quarter in
Bangalore. The data is represented in the table given below:
In the above given presentation, the factory’s sales for Bangalore are, for the time
dimension, which is organized into quarters and the dimension of items, which is
sorted according to the kind of item which is sold. The facts here are represented
in rupees (in thousands).
Now, if we desire to view the data of the sales in a three-dimensional table, then
it is represented in the diagram given below.
Let us consider the data according to item, time and location (like Kolkata, Delhi,
and Mumbai). Here is the table:
13
This data can be represented in the form of three dimensions conceptually, which
is shown in the image below:
Advantages of Multi-Dimensional Data Model
The following are the advantages of a multi-dimensional data model:
A multi-dimensional data model is easy to handle.
It is easy to maintain.
Its performance is better than that of normal databases (e.g. relational
databases).
14
The representation of data is better than traditional databases. That is
because the multi-dimensional databases are multi-viewed and carry
different types of factors.
It is workable on complex systems and applications, contrary to the simple
one-dimensional database systems
Disadvantages of Multi-Dimensional Data Model
The following are the disadvantages of a Multi-Dimensional Data Model:
The multi-dimensional Data Model is slightly complicated in nature and it
requires professionals to recognize and examine the data in the database.
During the work of a Multi-Dimensional Data Model, when the system
caches, there is a great effect on the working of the system.
It is complicated in nature due to which the databases are generally dynamic
in design.
Data Cube
A data cube enables data to be modeled and viewed in several dimensions. It is
represented by dimensions and facts. In other terms, dimensions are the views or
entities related to which an organization is required to keep records.
When data is grouped or combined in multidimensional matrices called Data
Cubes. The data cube method has a few alternative names or a few variants, such
as "Multidimensional databases," "materialized views," and "OLAP (On-Line
Analytical Processing)."
For example, a relation with the schema sales (part, supplier, customer, and sale-
price) can be materialized into a set of eight views as shown in fig,
where psc indicates a view consisting of aggregate function value (such as total-
sales) computed by grouping three attributes part, supplier, and
customer, p indicates a view composed of the corresponding aggregate function
values calculated by grouping part alone, etc.
15
A data cube is created from a subset of attributes in the database.
The model view data in the form of a data cube. OLAP tools are based on the
multidimensional data model. Data cubes usually model n-dimensional data.
A data cube enables data to be modeled and viewed in multiple dimensions. A
multidimensional data model is organized around a central theme, like sales and
transactions.
.
Example: In the 2-D representation, we will look at the All Electronics sales data
for items sold per quarter in the city of Vancouver. The measured display in
dollars sold (in thousands).
3-Dimensional Cuboids
Let suppose we would like to view the sales data with a third dimension. For
example, suppose we would like to view the data according to time, item as well as
the location for the cities Chicago, New York, Toronto, and Vancouver. The
measured display in dollars sold (in thousands). These 3-D data are shown in the
table. The 3-D data of the table are represented as a series of 2-D tables.
16
Let us suppose that we would like to view our sales data with an additional fourth
dimension, such as a supplier.
In data warehousing, the data cubes are n-dimensional. The cuboid which holds the
lowest level of summarization is called a base cuboid.
For example, the 4-D cuboid in the figure is the base cuboid for the given time,
item, location, and supplier dimensions.
17
Figure is shown a 4-D data cube representation of sales data, according to the
dimensions time, item, location, and supplier. The measure displayed is dollars
sold (in thousands).
The topmost 0-D cuboid, which holds the highest level of summarization, is
known as the apex cuboid. In this example, this is the total sales, or dollars sold,
summarized over all four dimensions.
The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids
creating 4-D data cubes for the dimension time, item, location, and supplier. Each
cuboid represents a different degree of summarization.
18
Question: Explain star, snowflakes and fact constellation schema.
OR
SCHEMAS FOR MULTI-DIMENSIONAL DATA MODEL
Schema is a logical description of the entire database. It includes the name and
description of records of all record types including all associated data-items and
aggregates. Much like a database, a data warehouse also requires to maintain a
schema. A database uses relational model, while a data warehouse uses Star,
Snowflake, and Fact Constellation schema.
Star Schema
Each dimension in a star schema is represented with only one-dimension
table.
This dimension table contains the set of attributes.
The following diagram shows the sales data of a company with respect to the
four dimensions, namely time, item, branch, and location.
There is a fact table at the center. It contains the keys to each of four
dimensions.
The fact table also contains the attributes, namely dollars sold and units sold.
Star Schema
19
Snowflake Schema
Some dimension tables in the Snowflake schema are normalized.
The normalization splits up the data into additional tables.
Unlike Star schema, the dimensions table in a snowflake schema are
normalized. For example, the item dimension table in star schema is
normalized and split into two dimension tables, namely item and supplier
table.
Now the item dimension table contains the attributes item_key, item_name,
type, brand, and supplier-key.
The supplier key is linked to the supplier dimension table. The supplier
dimension table contains the attributes supplier_key and supplier_type.
Snowflake Schema
20
Fact Constellation Schema
A fact constellation has multiple fact tables. It is also known as galaxy
schema.
The following diagram shows two fact tables, namely sales and shipping.
The sales fact table is same as that in the star schema.
The shipping fact table has the five dimensions, namely item_key, time_key,
shipper_key, from_location, to_location.
The shipping fact table also contains two measures, namely dollars sold and
units sold.
It is also possible to share dimension tables between fact tables. For
example, time, item, and location dimension tables are shared between the
sales and shipping fact table.
Fact Constellation Schema
21
Give the difference between star and fact constellation multidimensional data
model.
Data Warehouse Applications
22
Data warehouses are widely used in the following fields −
Financial services
Banking services
Consumer goods
Retail sectors
Controlled manufacturing
Data Mart
A data mart is focused on a single functional area of an organization and contains a
subset of data stored in a Data Warehouse.
A data mart is a condensed version of Data Warehouse and is designed for use by a
specific department, unit or set of
users in an organization. E.g., Marketing, Sales, HR or finance. It is often
controlled by a single department in an organization.
Types of Data Mart
There are three main types of data mart:
Dependent: Dependent data marts are created by drawing data directly from
operational, external or both sources. A dependent data marts is a logical subset of a
physical subset of a higher data warehouse. These data mart are dependent on the data
warehouse and extract the essential record from it. In this technique, as the data
warehouse creates the data mart; therefore, there is no need for data mart integration. It
is also known as a top-down approach.
23
Independent: Independent data mart is created without the use of a central data
warehouse. In this approach, as all the data marts are designed independently;
therefore, the integration of data marts is required. It is also termed as a bottom-up
approach as the data marts are integrated to develop a data warehouse.
Hybrid: This type of data marts can take data from data warehouses or operational
systems. It allows us to combine input from sources other than a data warehouse.
24