DATA MINING AND WAREHOUSING UNIT –II
Unit II:Data Warehouse Architecture:
Data warehouse architecture - properties of data warehouse architectures - types of
data warehouse architectures- three-tier data warehouse architecture - ETL
(extract, transform, and load) process - selecting an ELT tool- Difference between
ETL and ELT types of data warehouses - data warehouse modelling - data modelling
life cycle - types of data warehouse models- data warehouse design - data
warehouse implementation- implementation guidelines - meta data - necessary of
metadata in data warehouses - types of metadata- metadata repository - benefits
of metadata repository.
2.1 Data Warehouse Architecture
Data warehouse architecture is a method of defining the overall architecture
of data communication processing and presentation that exist for end-clients
computing within the enterprise. Each data warehouse is different, but all are
characterized by standard vital components.
Production applications such as payroll accounts payable product purchasing
and inventory control are designed for online transaction processing (OLTP).
Such applications gather detailed data from day to day operations.
Data Warehouse applications are designed to support the user ad-hoc data
requirements, an activity recently dubbed online analytical processing
(OLAP). These include applications such as forecasting, profiling, summary
reporting, and trend analysis.
Production databases are updated continuously by either by hand or via
OLTP applications. In contrast, a warehouse database is updated from
operational systems periodically, usually during off-hours. As OLTP data
accumulates in production databases, it is regularly extracted, filtered, and
then loaded into a dedicated warehouse server that is accessible to users. As
the warehouse is populated, it must be restructured tables de-normalized,
data cleansed of errors and redundancies and new fields and keys added to
reflect the needs to the user for sorting, combining, and summarizing data.
Data warehouse and their architecture very depending upon the
elements of an organization's situation.
Three common architectures are:
o Data Warehouse Architecture: Basic
o Data Warehouse Architecture: With Staging Area
o Data Warehouse Architecture: With Staging Area and Data Marts
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 1
DATA MINING AND WAREHOUSING UNIT –II
Data Warehouse Architecture: Basic
Operational System
An operational system is a method used in data warehousing to refer to
a system that is used to process the day-to-day transactions of an
organization.
Flat Files
A Flat file system is a system of files in which transactional data is stored,
and every file in the system must have a different name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data used in Data Warehouse for a variety of purpose, including:
Meta Data summarizes necessary information about data, which can make
finding and work with particular instances of data more accessible. For
example, author, data build, and data changed, and file size are examples of
very basic document metadata.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 2
DATA MINING AND WAREHOUSING UNIT –II
Metadata is used to direct a query to the most appropriate data source.
Lightly and highly summarized data
The area of the data warehouse saves all the predefined lightly and highly
summarized (aggregated) data generated by the warehouse manager.
The goals of the summarized information are to speed up query
performance. The summarized record is updated continuously as new
information is loaded into the warehouse.
End-User access Tools
The principal purpose of a data warehouse is to provide information to the
business managers for strategic decision-making. These customers interact
with the warehouse using end-client access tools.
The examples of some of the end-user access tools can be:
o Reporting and Query Tools
o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools
Data Warehouse Architecture: With Staging Area
We must clean and process your operational information before put it into
the warehouse. IT uses a staging area (A place where data is processed
before entering the warehouse).
A staging area simplifies data cleansing and consolidation for operational
method coming from multiple source systems, especially for enterprise data
warehouses where all relevant data of an enterprise is consolidated.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 3
DATA MINING AND WAREHOUSING UNIT –II
Data Warehouse Staging Area is a temporary location where a record
from source systems is copied.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 4
DATA MINING AND WAREHOUSING UNIT –II
Data Warehouse Architecture: With Staging Area and Data Marts
We may want to customize our warehouse's architecture for multiple groups
within our organization.
We can do this by adding data marts. A data mart is a segment of a data
warehouses that can provided information for reporting and analysis on a
section, unit, department or operation in the company, e.g., sales, payroll,
production, etc.
The figure illustrates an example where purchasing, sales, and stocks are
separated. In this example, a financial analyst wants to analyze historical
data for purchases and sales or mine historical information to make
predictions about customer behavior.
2.2 Properties of Data Warehouse Architectures
The following architecture properties are necessary for a data warehouse
system:
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 5
DATA MINING AND WAREHOUSING UNIT –II
1. Separation: Analytical and transactional processing should be keep apart
as much as possible.
2. Scalability: Hardware and software architectures should be simple to
upgrade the data volume, which has to be managed and processed, and the
number of user's requirements, which have to be met, progressively
increase.
3. Extensibility: The architecture should be able to perform new operations
and technologies without redesigning the whole system.
4. Security: Monitoring accesses are necessary because of the strategic
data stored in the data warehouses.
5. Administerability: Data Warehouse management should not be
complicated.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 6
DATA MINING AND WAREHOUSING UNIT –II
2.3 Types of Data Warehouse Architectures
Single-Tier Architecture
Single-Tier architecture is not periodically used in practice. Its purpose is to
minimize the amount of data stored to reach this goal; it removes data
redundancies.
The figure shows the only layer physically available is the source layer. In
this method, data warehouses are virtual. This means that the data
warehouse is implemented as a multidimensional view of operational data
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 7
DATA MINING AND WAREHOUSING UNIT –II
created by specific middleware, or an intermediate processing layer.
The vulnerability of this architecture lies in its failure to meet the
requirement for separation between analytical and transactional processing.
Analysis queries are agreed to operational data after the middleware
interprets them. In this way, queries affect transactional workloads.
2.3.2 Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-
tier architecture for a data warehouse system, as shown in fig:
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 8
DATA MINING AND WAREHOUSING UNIT –II
Although it is typically called two-layer architecture to highlight a separation
between physically available sources and data warehouses, in fact, consists
of four subsequent data flow stages:
1. Source layer: A data warehouse system uses a heterogeneous source
of data. That data is stored initially to corporate relational databases or
legacy databases, or it may come from an information system outside
the corporate walls.
2. Data Staging: The data stored to the source should be extracted,
cleansed to remove inconsistencies and fill gaps, and integrated to
merge heterogeneous sources into one standard schema. The so-
named Extraction, Transformation, and Loading Tools (ETL) can
combine heterogeneous schemata, extract, transform, cleanse,
validate, filter, and load source data into a data warehouse.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 9
DATA MINING AND WAREHOUSING UNIT –II
3. Data Warehouse layer: Information is saved to one logically
centralized individual repository: a data warehouse. The data
warehouses can be directly accessed, but it can also be used as a
source for creating data marts, which partially replicate data
warehouse contents and are designed for specific enterprise
departments. Meta-data repositories store information on sources,
access procedures, data staging, users, data mart schema, and so on.
4. Analysis: In this layer, integrated data is efficiently, and flexible
accessed to issue reports, dynamically analyze information, and
simulate hypothetical business scenarios. It should feature aggregate
information navigators, complex query optimizers, and customer-
friendly GUIs.
2.3.4 Three-Tier Data Warehouse Architecture
Data Warehouses usually have a three-level (tier) architecture that includes:
1. Bottom Tier (Data Warehouse Server)
2. Middle Tier (OLAP Server)
3. Top Tier (Front end Tools).
A bottom-tier that consists of the Data Warehouse server, which is
almost always an RDBMS. It may include several specialized data marts and
a metadata repository.
Data from operational databases and external sources (such as user profile
data provided by external consultants) are extracted using application
program interfaces called a gateway. A gateway is provided by the
underlying DBMS and allows customer programs to generate SQL code to be
executed at a server.
Examples of gateways contain ODBC (Open Database Connection)
and OLE-DB (Open-Linking and Embedding for Databases), by Microsoft,
and JDBC (Java Database Connection).
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 10
DATA MINING AND WAREHOUSING UNIT –II
A middle-tier which consists of an OLAP server for fast querying of the
data warehouse.
The OLAP server is implemented using either
(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS
that maps functions on multidimensional data to standard relational
operations.
(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose
server that directly implements multidimensional information and
operations.
A top-tier that contains front-end tools for displaying results provided by
OLAP, as well as additional tools for data mining of the OLAP-generated data.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 11
DATA MINING AND WAREHOUSING UNIT –II
The overall Data Warehouse Architecture is shown in fig:
The metadata repository stores information that defines DW objects. It
includes the following parameters and information for the middle and the
top-tier applications:
1. A description of the DW structure, including the warehouse schema,
dimension, hierarchies, data mart locations, and contents, etc.
2. Operational metadata, which usually describes the currency level of
the stored data, i.e., active, archived or purged, and warehouse
monitoring information, i.e., usage statistics, error reports, audit, etc.
3. System performance data, which includes indices, used to improve
data access and retrieval performance.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 12
DATA MINING AND WAREHOUSING UNIT –II
4. Information about the mapping from operational databases, which
provides source RDBMSs and their contents, cleaning and
transformation rules, etc.
5. Summarization algorithms, predefined queries, and reports business
data, which include business terms and definitions, ownership
information, etc.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 13
DATA MINING AND WAREHOUSING UNIT –II
2.4 ETL (Extract, Transform, and Load) Process
What is ETL?
The mechanism of extracting information from source systems and bringing it into the data
warehouse is commonly called ETL, which stands for Extraction, Transformation
and Loading.
The ETL process requires active inputs from various stakeholders, including developers,
analysts, testers, top executives and is technically challenging.
To maintain its value as a tool for decision-makers, Data warehouse technique needs to change
with business changes. ETL is a recurring method (daily, weekly, monthly) of a Data warehouse
system and needs to be agile, automated, and well documented.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 14
DATA MINING AND WAREHOUSING UNIT –II
How ETL Works?
ETL consists of three separate phases:
Extraction
Extraction is the operation of extracting information from a source
system for further use in a data warehouse environment. This is the
first stage of the ETL process.
o Extraction process is often one of the most time-consuming tasks in
the ETL.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 15
DATA MINING AND WAREHOUSING UNIT –II
o The source systems might be complicated and poorly documented,
and thus determining which data needs to be extracted can be
difficult.
o The data has to be extracted several times in a periodic manner to
supply all changed data to the warehouse and keep it up-to-date.
Cleansing
The cleansing stage is crucial in a data warehouse technique because it is supposed to improve
data quality. The primary data cleansing features found in ETL tools are rectification and
homogenization. They use specific dictionaries to rectify typing mistakes and to recognize
synonyms, as well as rule-based cleansing to enforce domain-specific rules and defines
appropriate associations between values.
The following examples show the essential of data cleaning:
If an enterprise wishes to contact its users or its suppliers, a complete, accurate and up-to-date
list of contact addresses, email addresses and telephone numbers must be available.
If a client or supplier calls, the staff responding should be quickly able to find the person in the
enterprise database, but this need that the caller's name or his/her company name is listed in the
database.
If a user appears in the databases with two or more slightly different names or different account
numbers, it becomes difficult to update the customer's information.
Transformation
Transformation is the core of the reconciliation phase. It converts records from its operational
source format into a particular data warehouse format. If we implement a three-layer
architecture, this phase outputs our reconciled data layer.
The following points must be rectified in this phase:
o Loose texts may hide valuable information. For example, XYZ PVT Ltd
does not explicitly show that this is a Limited Partnership company.
o Different formats can be used for individual data. For example, data
can be saved as a string or as three integers.
Following are the main transformation processes aimed at populating the reconciled data layer:
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 16
DATA MINING AND WAREHOUSING UNIT –II
o Conversion and normalization that operate on both storage formats
and units of measure to make data uniform.
o Matching that associates equivalent fields in different sources.
o Selection that reduces the number of source fields and records.
Cleansing and Transformation processes are often closely linked in ETL tools.
Loading
The Load is the process of writing the data into the target database. During the load step, it is
necessary to ensure that the load is performed correctly and with as little resources as possible.
Loading can be carried in two ways:
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 17
DATA MINING AND WAREHOUSING UNIT –II
1. Refresh: Data Warehouse data is completely rewritten. This means that older file is
replaced. Refresh is usually used in combination with static extraction to populate a data
warehouse initially.
2. Update: Only those changes applied to source information are added to the Data
Warehouse. An update is typically carried out without deleting or modifying preexisting
data. This method is used in combination with incremental extraction to update data
warehouses regularly.
2.4.1 Selecting an ETL Tool
Selection of an appropriate ETL Tools is an important decision that has to be made in choosing
the importance of an ODS or data warehousing application. The ETL tools are required to
provide coordinated access to multiple data sources so that relevant data may be extracted from
them. An ETL tool would generally contains tools for data cleansing, re-organization,
transformations, aggregation, calculation and automatic loading of information into the object
database.
An ETL tool should provide a simple user interface that allows data cleansing
and data transformation rules to be specified using a point-and-click
approach. When all mappings and transformations have been defined, the
ETL tool should automatically generate the data extract/transformation/load
programs, which typically run in batch mode.
Basics ETL ELT
Process Data is transferred to the ETL Data remains in the DB except
server and moved back to for cross Database loads (e.g.
DB. High network bandwidth source to object).
required.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 18
DATA MINING AND WAREHOUSING UNIT –II
Transformati Transformations are Transformations are performed
on performed in ETL Server. (in the source or) in the target.
Code Usage Typically used for Typically used for
o Source to target o High amounts of data
transfer
o Compute-intensive
Transformations
o Small amount of data
Time- It needs highs maintenance Low maintenance as data is
Maintenance as you need to select data to always available.
load and transform.
Calculations Overwrites existing column Easily add the calculated column
or Need to append the to the existing table.
dataset and push to the
target platform.
Analysis
2.4.2 Difference between ETL vs. ELT
2.5 Types of Data Warehouses
There are different types of data warehouses, which are as follows:
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 19
DATA MINING AND WAREHOUSING UNIT –II
Host-Based Data Warehouses
There are two types of host-based data warehouses which can be
implemented:
o Host-Based mainframe warehouses which reside on a high volume
database. Supported by robust and reliable high capacity structure
such as IBM system/390, UNISYS and Data General sequent systems,
and databases such as Sybase, Oracle, Informix, and DB2.
o Host-Based LAN data warehouses, where data delivery can be handled
either centrally or from the workgroup environment. The size of the
data warehouses of the database depends on the platform.
Data Extraction and transformation tools allow the automated extraction and
cleaning of data from production systems. It is not applicable to enable direct
access by query tools to these categories of methods for the following
reasons:
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 20
DATA MINING AND WAREHOUSING UNIT –II
1. A huge load of complex warehousing queries would possibly have too
much of a harmful impact upon the mission-critical transaction
processing (TP)-oriented application.
2. These TP systems have been developing in their database design for
transaction throughput. In all methods, a database is designed for
optimal query or transaction processing. A complex business query
needed the joining of many normalized tables, and as result
performance will usually be poor and the query constructs largely
complex.
3. There is no assurance that data in two or more production methods will
be consistent.
Host-Based (MVS) Data Warehouses
Those data warehouse uses that reside on large volume databases on MVS
are the host-based types of data warehouses. Often the DBMS is DB2 with a
huge variety of original source for legacy information, including VSAM, DB2,
flat files, and Information Management System (IMS).
Before embarking on designing, building and implementing such a
warehouse, some further considerations must be given because
1. Such databases generally have very high volumes of data storage.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 21
DATA MINING AND WAREHOUSING UNIT –II
2. Such warehouses may require support for both MVS and customer-
based report and query facilities.
3. These warehouses have complicated source systems.
4. Such systems needed continuous maintenance since these must also
be used for mission-critical objectives.
To make such data warehouses building successful, the following phases are
generally followed:
1. Unload Phase: It contains selecting and scrubbing the operation data.
2. Transform Phase: For translating it into an appropriate form and
describing the rules for accessing and storing it.
3. Load Phase: For moving the record directly into DB2 tables or a
particular file for moving it into another database or non-MVS
warehouse.
An integrated Metadata repository is central to any data warehouse
environment. Such a facility is required for documenting data sources, data
translation rules, and user areas to the warehouse. It provides a dynamic
network between the multiple data source databases and the DB2 of the
conditional data warehouses.
A metadata repository is necessary to design, build, and maintain data
warehouse processes. It should be capable of providing data as to what data
exists in both the operational system and data warehouse, where the data is
located. The mapping of the operational data to the warehouse fields and
end-user access techniques. Query, reporting, and maintenance are another
indispensable method of such a data warehouse. An MVS-based query and
reporting tool for DB2.
Host-Based (UNIX) Data Warehouses
Oracle and Informix RDBMSs support the facilities for such data warehouses.
Both of these databases can extract information from MVS¬ based
databases as well as a higher number of other UNIX¬ based databases.
These types of warehouses follow the same stage as the host-based MVS
data warehouses. Also, the data from different network servers can be
created. Since file attribute consistency is frequent across the inter-network.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 22
DATA MINING AND WAREHOUSING UNIT –II
LAN-Based Workgroup Data Warehouses
A LAN based workgroup warehouse is an integrated structure for building
and maintaining a data warehouse in a LAN environment. In this warehouse,
we can extract information from a variety of sources and support multiple
LAN based warehouses, generally chosen warehouse databases to include
DB2 family, Oracle, Sybase, and Informix. Other databases that can also be
contained through infrequently are IMS, VSAM, Flat File, MVS, and VH.
Designed for the workgroup environment, a LAN based workgroup
warehouse is optimal for any business organization that wants to build a
data warehouse often called a data mart. This type of data warehouse
generally requires a minimal initial investment and technical training.
Data Delivery: With a LAN based workgroup warehouse, customer needs
minimal technical knowledge to create and maintain a store of data that
customized for use at the department, business unit, or workgroup level. A
LAN based workgroup warehouse ensures the delivery of information from
corporate resources by providing transport access to the data in the
warehouse.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 23
DATA MINING AND WAREHOUSING UNIT –II
Host-Based Single Stage (LAN) Data Warehouses
Within a LAN based data warehouse, data delivery can be handled either
centrally or from the workgroup environment so business groups can meet
process their data needed without burdening centralized IT resources,
enjoying the autonomy of their data mart without comprising overall data
integrity and security in the enterprise.
Limitations
Both DBMS and hardware scalability methods generally limit LAN based
warehousing solutions.
Many LAN based enterprises have not implemented adequate job scheduling,
recovery management, organized maintenance, and performance monitoring
methods to provide robust warehousing solutions.
Often these warehouses are dependent on other platforms for source record.
Building an environment that has data integrity, recoverability, and security
require careful design, planning, and implementation. Otherwise,
synchronization of transformation and loads from sources to the server could
cause innumerable problems.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 24
DATA MINING AND WAREHOUSING UNIT –II
A LAN based warehouse provides data from many sources requiring a
minimal initial investment and technical knowledge. A LAN based warehouse
can also work replication tools for populating and updating the data
warehouse. This type of warehouse can include business views, histories,
aggregation, versions in, and heterogeneous source support, such as
o DB2 Family
o IMS, VSAM, Flat File [MVS and VM]
A single store frequently drives a LAN based warehouse and provides
existing DSS applications, enabling the business user to locate data in their
data warehouse. The LAN based warehouse can support business users with
complete data to information solution. The LAN based warehouse can also
share metadata with the ability to catalog business data and make it feasible
for anyone who needs it.
Multi-Stage Data Warehouses
It refers to multiple stages in transforming methods for analyzing data
through aggregations. In other words, staging of the data multiple times
before the loading operation into the data warehouse, data gets extracted
form source systems to staging area first, then gets loaded to data
warehouse after the change and then finally to departmentalized data marts.
This configuration is well suitable to environments where end-clients in
numerous capacities require access to both summarized information for up
to the minute tactical decisions as well as summarized, a commutative
record for long-term strategic decisions. Both the Operational Data Store
(ODS) and the data warehouse may reside on host-based or LAN Based
databases, depending on volume and custom requirements. These contain
DB2, Oracle, Informix, IMS, Flat Files, and Sybase.
Usually, the ODS stores only the most up-to-date records. The data
warehouse stores the historical calculation of the files. At first, the
information in both databases will be very similar. For example, the records
for a new client will look the same. As changes to the user record occur, the
ODs will be refreshed to reflect only the most current data, whereas the data
warehouse will contain both the historical data and the new information.
Thus the volume requirement of the data warehouse will exceed the volume
requirements of the ODS overtime. It is not familiar to reach a ratio of 4 to 1
in practice.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 25
DATA MINING AND WAREHOUSING UNIT –II
Stationary Data Warehouses
In this type of data warehouses, the data is not changed from the sources, as
shown in fig:
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 26
DATA MINING AND WAREHOUSING UNIT –II
ADVERTISEMENT
ADVERTISEMENT
Instead, the customer is given direct access to the data. For many
organizations, infrequent access, volume issues, or corporate necessities
dictate such as approach. This schema does generate several problems for
the customer such as
o Identifying the location of the information for the users
o Providing clients the ability to query different DBMSs as is they were all
a single DBMS with a single API.
o Impacting performance since the customer will be competing with the
production data stores.
Such a warehouse will need highly specialized and sophisticated
'middleware' possibly with a single interaction with the client. This may also
be essential for a facility to display the extracted record for the user before
report generation. An integrated metadata repository becomes an absolute
essential under this environment.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 27
DATA MINING AND WAREHOUSING UNIT –II
Distributed Data Warehouses
The concept of a distributed data warehouse suggests that there are two
types of distributed data warehouses and their modifications for the local
enterprise warehouses which are distributed throughout the enterprise and a
global warehouses as shown in fig:
Characteristics of Local data warehouses
o Activity appears at the local level
o Bulk of the operational processing
o Local site is autonomous
o Each local data warehouse has its unique architecture and contents of
data
o The data is unique and of prime essential to that locality only
o Majority of the record is local and not replicated
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 28
DATA MINING AND WAREHOUSING UNIT –II
o Any intersection of data between local data warehouses is
circumstantial
o Local warehouse serves different technical communities
o The scope of the local data warehouses is finite to the local site
o Local warehouses also include historical data and are integrated only
within the local site.
Virtual Data Warehouses
Virtual Data Warehouses is created in the following stages:
1. Installing a set of data approach, data dictionary, and process
management facilities.
2. Training end-clients.
3. Monitoring how DW facilities will be used
4. Based upon actual usage, physically Data Warehouse is created to
provide the high-frequency results
This strategy defines that end users are allowed to get at operational
databases directly using whatever tools are implemented to the data access
network. This method provides ultimate flexibility as well as the minimum
amount of redundant information that must be loaded and maintained. The
data warehouse is a great idea, but it is difficult to build and requires
investment. Why not use a cheap and fast method by eliminating the
transformation phase of repositories for metadata and another database.
This method is termed the 'virtual data warehouse.'
To accomplish this, there is a need to define four kinds of data:
1. A data dictionary including the definitions of the various databases.
2. A description of the relationship between the data components.
3. The description of the method user will interface with the system.
4. The algorithms and business rules that describe what to do and how to
do it.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 29
DATA MINING AND WAREHOUSING UNIT –II
Disadvantages
1. Since queries compete with production record transactions,
performance can be degraded.
2. There is no metadata, no summary record, or no
individual DSS (Decision Support System) integration or history. All
queries must be copied, causing an additional burden on the system.
3. There is no refreshing process, causing the queries to be very complex.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 30
DATA MINING AND WAREHOUSING UNIT –II
2.5 Data Warehouse Modeling
Data warehouse modeling is the process of designing the schemas of the
detailed and summarized information of the data warehouse. The goal of
data warehouse modeling is to develop a schema describing the reality, or at
least a part of the fact, which the data warehouse is needed to support.
Data warehouse modeling is an essential stage of building a data warehouse
for two main reasons. Firstly, through the schema, data warehouse clients
can visualize the relationships among the warehouse data, to use them with
greater ease. Secondly, a well-designed schema allows an effective data
warehouse structure to emerge, to help decrease the cost of implementing
the warehouse and improve the efficiency of using it.
Data modeling in data warehouses is different from data modeling in
operational database systems. The primary function of data warehouses is to
support DSS processes. Thus, the objective of data warehouse modeling is to
make the data warehouse efficiently support complex queries on long term
information.
In contrast, data modeling in operational database systems targets
efficiently supporting simple transactions in the database such as retrieving,
inserting, deleting, and changing data. Moreover, data warehouses are
designed for the customer with general information knowledge about the
enterprise, whereas operational database systems are more oriented toward
use by software specialists for creating distinct applications.
Data Warehouse model is illustrated in the given diagram.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 31
DATA MINING AND WAREHOUSING UNIT –II
The data within the specific warehouse itself has a particular architecture
with the emphasis on various levels of summarization, as shown in figure:
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 32
DATA MINING AND WAREHOUSING UNIT –II
The current detail record is central in importance as it:
o Reflects the most current happenings, which are commonly the most
stimulating.
o It is numerous as it is saved at the lowest method of the Granularity.
o It is always (almost) saved on disk storage, which is fast to access but
expensive and difficult to manage.
Older detail data is stored in some form of mass storage, and it is
infrequently accessed and kept at a level detail consistent with current
detailed data.
Lightly summarized data is data extract from the low level of detail found
at the current, detailed level and usually is stored on disk storage. When
building the data warehouse have to remember what unit of time is
summarization done over and also the components or what attributes the
summarized data will contain.
Highly summarized data is compact and directly available and can even
be found outside the warehouse.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 33
DATA MINING AND WAREHOUSING UNIT –II
ADVERTISEMENT
Metadata is the final element of the data warehouses and is really of
various dimensions in which it is not the same as file drawn from the
operational data, but it is used as:-
o A directory to help the DSS investigator locate the items of the data
warehouse.
o A guide to the mapping of record as the data is changed from the
operational data to the data warehouse environment.
o A guide to the method used for summarization between the current,
accurate data and the lightly summarized information and the highly
summarized data, etc.
2.5.1 Data Modeling Life Cycle
In this section, we define a data modeling life cycle. It is a straight forward
process of transforming the business requirements to fulfill the goals for
storing, maintaining, and accessing the data within IT systems. The result is
a logical and physical data model for an enterprise data warehouse.
The objective of the data modeling life cycle is primarily the creation of a
storage area for business information. That area comes from the logical and
physical data modeling stages, as shown in Figure:
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 34
DATA MINING AND WAREHOUSING UNIT –II
Conceptual Data Model
A conceptual data model recognizes the highest-level relationships between
the different entities.
Characteristics of the conceptual data model
o It contains the essential entities and the relationships among them.
o No attribute is specified.
o No primary key is specified.
We can see that the only data shown via the conceptual data model is the
entities that define the data and the relationships between those entities. No
other data, as shown through the conceptual data model.
Logical Data Model
A logical data model defines the information in as much structure as
possible, without observing how they will be physically achieved in the
database. The primary objective of logical data modeling is to document the
business data structures, processes, rules, and relationships by a single view
- the logical data model.
Features of a logical data model
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 35
DATA MINING AND WAREHOUSING UNIT –II
o It involves all entities and relationships among them.
o All attributes for each entity are specified.
o The primary key for each entity is stated.
o Referential Integrity is specified (FK Relation).
The phase for designing the logical data model which are as follows:
o Specify primary keys for all entities.
o List the relationships between different entities.
o List all attributes for each entity.
o Normalization.
o No data types are listed
Physical Data Model
Physical data model describes how the model will be presented in the
database. A physical database model demonstrates all table structures,
column names, data types, constraints, primary key, foreign key, and
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 36
DATA MINING AND WAREHOUSING UNIT –II
relationships between tables. The purpose of physical data modeling is the
mapping of the logical data model to the physical structures of the RDBMS
system hosting the data warehouse. This contains defining physical RDBMS
structures, such as tables and data types to use when storing the
information. It may also include the definition of new data structures for
enhancing query performance.
ADVERTISEMENT
Characteristics of a physical data model
o Specification all tables and columns.
o Foreign keys are used to recognize relationships between tables.
The steps for physical data model design which are as follows:
o Convert entities to tables.
o Convert relationships to foreign keys.
o Convert attributes to columns.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 37
DATA MINING AND WAREHOUSING UNIT –II
2.5.2 Types of Data Warehouse Models
Enterprise Warehouse
ADVERTISEMENT
ADVERTISEMENT
An Enterprise warehouse collects all of the records about subjects spanning
the entire organization. It supports corporate-wide data integration, usually
from one or more operational systems or external data providers, and it's
cross-functional in scope. It generally contains detailed information as well as
summarized information and can range in estimate from a few gigabyte to
hundreds of gigabytes, terabytes, or beyond.
An enterprise data warehouse may be accomplished on traditional
mainframes, UNIX super servers, or parallel architecture platforms. It
required extensive business modeling and may take years to develop and
build.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 38
DATA MINING AND WAREHOUSING UNIT –II
Data Mart
A data mart includes a subset of corporate-wide data that is of value to a
specific collection of users. The scope is confined to particular selected
subjects. For example, a marketing data mart may restrict its subjects to the
customer, items, and sales. The data contained in the data marts tend to be
summarized.
Data Marts is divided into two parts:
Independent Data Mart: Independent data mart is sourced from data
captured from one or more operational systems or external data providers,
or data generally locally within a different department or geographic area.
Dependent Data Mart: Dependent data marts are sourced exactly from
enterprise data-warehouses.
Virtual Warehouses
Virtual Data Warehouses is a set of perception over the operational
database. For effective query processing, only some of the possible summary
vision may be materialized. A virtual warehouse is simple to build but
required excess capacity on operational database servers.
2.6 Data Warehouse Design
A data warehouse is a single data repository where a record from multiple
data sources is integrated for online business analytical processing (OLAP).
This implies a data warehouse needs to meet the requirements from all the
business stages within the entire organization. Thus, data warehouse design
is a hugely complex, lengthy, and hence error-prone process. Furthermore,
business analytical functions change over time, which results in changes in
the requirements for the systems. Therefore, data warehouse and OLAP
systems are dynamic, and the design process is continuous.
Data warehouse design takes a method different from view materialization in
the industries. It sees data warehouses as database systems with particular
needs such as answering management related queries. The target of the
design becomes how the record from multiple data sources should be
extracted, transformed, and loaded (ETL) to be organized in a database as
the data warehouse.
There are two approaches
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 39
DATA MINING AND WAREHOUSING UNIT –II
1. "top-down" approach
2. "bottom-up" approach
Top-down Design Approach
In the "Top-Down" design approach, a data warehouse is described as a
subject-oriented, time-variant, non-volatile and integrated data repository for
the entire enterprise data from different sources are validated, reformatted
and saved in a normalized (up to 3NF) database as the data warehouse. The
data warehouse stores "atomic" information, the data at the lowest level of
granularity, from where dimensional data marts can be built by selecting the
data required for specific business subjects or particular departments. An
approach is a data-driven approach as the information is gathered and
integrated first and then business requirements by subjects for building data
marts are formulated. The advantage of this method is which it supports a
single integrated data source. Thus data marts built from it will have
consistency when they overlap.
Advantages of top-down design
Data Marts are loaded from the data warehouses.
Developing new data mart from the data warehouse is very easy.
Disadvantages of top-down design
This technique is inflexible to changing departmental needs.
The cost of implementing the project is high.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 40
DATA MINING AND WAREHOUSING UNIT –II
Bottom-Up Design Approach
In the "Bottom-Up" approach, a data warehouse is described as "a copy of
transaction data specifical architecture for query and analysis," term the star
schema. In this approach, a data mart is created first to necessary reporting
and analytical capabilities for particular business processes (or subjects).
Thus it is needed to be a business-driven approach in contrast to Inmon's
data-driven approach.
Data marts include the lowest grain data and, if needed, aggregated data
too. Instead of a normalized database for the data warehouse, a
denormalized dimensional database is adapted to meet the data delivery
requirements of data warehouses. Using this method, to use the set of data
marts as the enterprise data warehouse, data marts should be built with
conformed dimensions in mind, defining that ordinary objects are
represented the same in different data marts. The conformed dimensions
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 41
DATA MINING AND WAREHOUSING UNIT –II
connected the data marts to form a data warehouse, which is generally
called a virtual data warehouse.
The advantage of the "bottom-up" design approach is that it has quick ROI,
as developing a data mart, a data warehouse for a single subject, takes far
less time and effort than developing an enterprise-wide data warehouse.
Also, the risk of failure is even less. This method is inherently incremental.
This method allows the project team to learn and grow.
Advantages of bottom-up design
Documents can be generated quickly.
The data warehouse can be extended to accommodate new business units.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 42
DATA MINING AND WAREHOUSING UNIT –II
It is just developing new data marts and then integrating with other data
marts.
Disadvantages of bottom-up design
the locations of the data warehouse and the data marts are reversed in the
bottom-up approach design.
Differentiate between Top-Down Design Approach and
Bottom-Up Design Approach
Top-Down Design Approach Bottom-Up Design Approach
Breaks the vast problem into Solves the essential low-level problem and
smaller subproblems. integrates them into a higher one.
Inherently architected- not a Inherently incremental; can schedule
union of several data marts. essential data marts first.
Single, central storage of Departmental information stored.
information about the content.
Centralized rules and control. Departmental rules and control.
It includes redundant information. Redundancy can be removed.
It may see quick results if Less risk of failure, favorable return on
implemented with repetitions. investment, and proof of techniques.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 43
DATA MINING AND WAREHOUSING UNIT –II
2.7 Data Warehouse Implementation
There are various implementations in data warehouses which are as follows
1. Requirements analysis and capacity planning: The first process in
data warehousing involves defining enterprise needs, defining architectures,
carrying out capacity planning, and selecting the hardware and software
tools. This step will contain be consulting senior management as well as the
different stakeholder.
2. Hardware integration: Once the hardware and software has been
selected, they require to be put by integrating the servers, the storage
methods, and the user software tools.
3. Modeling: Modelling is a significant stage that involves designing the
warehouse schema and views. This may contain using a modeling tool if the
data warehouses are sophisticated.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 44
DATA MINING AND WAREHOUSING UNIT –II
4. Physical modeling: For the data warehouses to perform efficiently,
physical modeling is needed. This contains designing the physical data
warehouse organization, data placement, data partitioning, deciding on
access techniques, and indexing.
5. Sources: The information for the data warehouse is likely to come from
several data sources. This step contains identifying and connecting the
sources using the gateway, ODBC drives, or another wrapper.
6. ETL: The data from the source system will require to go through an ETL
phase. The process of designing and implementing the ETL phase may
contain defining a suitable ETL tool vendors and purchasing and
implementing the tools. This may contains customize the tool to suit the
need of the enterprises.
7. Populate the data warehouses: Once the ETL tools have been agreed
upon, testing the tools will be needed, perhaps using a staging area. Once
everything is working adequately, the ETL tools may be used in populating
the warehouses given the schema and view definition.
8. User applications: For the data warehouses to be helpful, there must be
end-user applications. This step contains designing and implementing
applications required by the end-users.
9. Roll-out the warehouses and applications: Once the data warehouse
has been populated and the end-client applications tested, the warehouse
system and the operations may be rolled out for the user's community to
use.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 45
DATA MINING AND WAREHOUSING UNIT –II
2.7.1Implementation Guidelines
1. Build incrementally: Data warehouses must be built incrementally.
Generally, it is recommended that a data marts may be created with one
particular project in mind, and once it is implemented, several other sections
of the enterprise may also want to implement similar systems. An enterprise
data warehouses can then be implemented in an iterative manner allowing
all data marts to extract information from the data warehouse.
2. Need a champion: A data warehouses project must have a champion
who is active to carry out considerable researches into expected price and
benefit of the project. Data warehousing projects requires inputs from many
units in an enterprise and therefore needs to be driven by someone who is
needed for interacting with people in the enterprises and can actively
persuade colleagues.
3. Senior management support: A data warehouses project must be fully
supported by senior management. Given the resource-intensive feature of
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 46
DATA MINING AND WAREHOUSING UNIT –II
such project and the time they can take to implement, a warehouse project
signal for a sustained commitment from senior management.
4. Ensure quality: The only record that has been cleaned and is of a quality
that is implicit by the organizations should be loaded in the data
warehouses.
5. Corporate strategy: A data warehouse project must be suitable for
corporate strategies and business goals. The purpose of the project must be
defined before the beginning of the projects.
6. Business plan: The financial costs (hardware, software, and
peopleware), expected advantage, and a project plan for a data warehouses
project must be clearly outlined and understood by all stakeholders. Without
such understanding, rumors about expenditure and benefits can become the
only sources of data, subversion the projects.
7. Training: Data warehouses projects must not overlook data warehouses
training requirements. For a data warehouses project to be successful, the
customers must be trained to use the warehouses and to understand its
capabilities.
8. Adaptability: The project should build in flexibility so that changes may
be made to the data warehouses if and when required. Like any system, a
data warehouse will require to change, as the needs of an enterprise change.
9. Joint management: The project must be handled by both IT and
business professionals in the enterprise. To ensure that proper
communication with the stakeholder and which the project is the target for
assisting the enterprise's business, the business professional must be
involved in the project along with technical professionals.
2.8 What is Meta Data?
Metadata is data about the data or documentation about the information
which is required by the users. In data warehousing, metadata is one of the
essential aspects.
Metadata includes the following:
1. The location and descriptions of warehouse systems and components.
2. Names, definitions, structures, and content of data-warehouse and
end-users views.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 47
DATA MINING AND WAREHOUSING UNIT –II
3. Identification of authoritative data sources.
4. Integration and transformation rules used to populate data.
5. Integration and transformation rules used to deliver information to
end-user analytical tools.
6. Subscription information for information delivery to analysis
subscribers.
7. Metrics used to analyze warehouses usage and performance.
8. Security authorizations, access control list, etc.
Metadata is used for building, maintaining, managing, and using the data
warehouses. Metadata allow users access to help understand the content
and find data.
Several examples of metadata are:
1. A library catalog may be considered metadata. The directory metadata
consists of several predefined components representing specific
attributes of a resource, and each item can have one or more values.
These components could be the name of the author, the name of the
document, the publisher's name, the publication date, and the
methods to which it belongs.
2. The table of content and the index in a book may be treated metadata
for the book.
3. Suppose we say that a data item about a person is 80. This must be
defined by noting that it is the person's weight and the unit is
kilograms. Therefore, (weight, kilograms) is the metadata about the
data is 80.
4. Another example of metadata are data about the tables and figures in
a report like this book. A table (which is a record) has a name (e.g.,
table titles), and there are column names of the tables that may be
treated metadata. The figures also have titles or names.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 48
DATA MINING AND WAREHOUSING UNIT –II
2.8.1Why is metadata necessary in a data
warehouses?
ADVERTISEMENT
ADVERTISEMENT
o First, it acts as the glue that links all parts of the data warehouses.
o Next, it provides information about the contents and structures to the
developers.
o Finally, it opens the doors to the end-users and makes the contents
recognizable in their terms.
Metadata is Like a Nerve Center. Various processes during the building and
administering of the data warehouse generate parts of the data warehouse
metadata. Another uses parts of metadata generated by one process. In the
data warehouse, metadata assumes a key position and enables
communication among various methods. It acts as a nerve centre in the data
warehouse.
Figure shows the location of metadata within the data warehouse.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 49
DATA MINING AND WAREHOUSING UNIT –II
2.8.2 Types of Metadata
Metadata in a data warehouse fall into three major parts:
o Operational Metadata
o Extraction and Transformation Metadata
o End-User Metadata
Operational Metadata
As we know, data for the data warehouse comes from various operational
systems of the enterprise. These source systems include different data
structures. The data elements selected for the data warehouse have various
fields lengths and data types.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 50
DATA MINING AND WAREHOUSING UNIT –II
In selecting information from the source systems for the data warehouses,
we divide records, combine factor of documents from different source files,
and deal with multiple coding schemes and field lengths. When we deliver
information to the end-users, we must be able to tie that back to the source
data sets. Operational metadata contains all of this information about the
operational data sources.
Extraction and Transformation Metadata
Extraction and transformation metadata include data about the removal of
data from the source systems, namely, the extraction frequencies, extraction
methods, and business rules for the data extraction. Also, this category of
metadata contains information about all the data transformation that takes
place in the data staging area.
End-User Metadata
The end-user metadata is the navigational map of the data warehouses. It
enables the end-users to find data from the data warehouses. The end-user
metadata allows the end-users to use their business terminology and look for
the information in those ways in which they usually think of the business.
ADVERTISEMENT
Metadata Interchange Initiative
The metadata interchange initiative was proposed to bring industry vendors
and user together to address a variety of severe problems and issues
concerning exchanging, sharing, and managing metadata. The goal of
metadata interchange standard is to define an extensible mechanism that
will allow the vendor to exchange standard metadata as well as carry along
"proprietary" metadata. The founding members agreed on the following
initial goals:
1. Creating a vendor-independent, industry-defined, and maintained
standard access mechanisms and application programming interfaces
(API) for metadata.
2. Enabling users to control and manage the access and manipulation of
metadata in their unique environment through the use of interchange
standards-compliant tools.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 51
DATA MINING AND WAREHOUSING UNIT –II
3. Users are allowed to build tools that meet their needs and also will
enable them to adjust accordingly to those tools configurations.
4. Allowing individual tools to satisfy their metadata requirements freely
and efficiently within the content of an interchange model.
5. Describing a simple, clean implementation infrastructure which will
facilitate compliance and speed up adoption by minimizing the amount
of modification.
6. To create a procedure and process not only for maintaining and
establishing the interchange standard specification but also for
updating and extending it over time.
Metadata Interchange Standard Framework
Interchange standard metadata model implementation assumes that the
metadata itself may be stored in storage format of any type: ASCII files,
relational tables, fixed or customized formats, etc.
It is a framework that is based on a framework that will translate an access
request into the standard interchange index.
Several approaches have been proposed in metadata interchange coalition:
o Procedural Approach
o ASCII Batch Approach
o Hybrid Approach
In a procedural approach, the communication with API is built into the tool.
It enables the highest degree of flexibility.
In ASCII Batch approach, instead of relying on ASCII file format which
contains information of various metadata items and standardized access
requirements that make up the interchange standards metadata model.
In the Hybrid approach, it follows a data-driven model.
ADVERTISEMENT
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 52
DATA MINING AND WAREHOUSING UNIT –II
Components of Metadata Interchange Standard
Frameworks
1) Standard Metadata Model: It refers to the ASCII file format, which is
used to represent metadata that is being exchanged.
2) The standard access framework that describes the minimum number
of API functions.
3) Tool profile, which is provided by each tool vendor.
4) The user configuration is a file explaining the legal interchange paths for
metadata in the user's environment.
2.8.3 Metadata Repository
The metadata itself is housed in and controlled by the metadata repository.
The software of metadata repository management can be used to map the
source data to the target database, integrate and transform the data,
generate code for data transformation, and to move data to the warehouse.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 53
DATA MINING AND WAREHOUSING UNIT –II
2.8.4 Benefits of Metadata Repository
1. It provides a set of tools for enterprise-wide metadata management.
2. It eliminates and reduces inconsistency, redundancy, and
underutilization.
3. It improves organization control, simplifies management, and
accounting of information assets.
4. It increases coordination, understanding, identification, and utilization
of information assets.
5. It enforces CASE development standards with the ability to share and
reuse metadata.
6. It leverages investment in legacy systems and utilizes existing
applications.
7. It provides a relational model for heterogeneous RDBMS to share
information.
8. It gives useful data administration tool to manage corporate
information assets with the data dictionary.
9. It increases reliability, control, and flexibility of the application
development process.
DEPARTMENT OF COMPUTER APPLICATION –RASC- N.MANIMOZHI/AP Page 54