What is a Data Warehouse?
A Data Warehouse (DW) is a relational database that is designed for query and analysis
rather than transaction processing. It includes historical data derived from transaction
data from single and multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on
providing support for decision-makers for data modeling and analysis.
A Data Warehouse is a group of data specific to the entire organization, not only to a
particular group of users.
It is not used for daily operations and transaction processing but used for making
decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
o It is a database designed for investigative tasks, using data from various
applications.
o It supports a relatively small number of clients with relatively long interactions.
o It includes current and historical data to provide a historical perspective of
information.
o Its usage is read-intensive.
o It contains a few large tables.
"Data Warehouse is a subject-oriented, integrated, and time-variant store of information
in support of management's decisions."
Characteristics of Data Warehouse
1. Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view around
a particular subject, such as customer, product, or sales, instead of the global
organization's ongoing operations. This is done by excluding data that are not useful
concerning the subject and including all data needed by the users to understand the
subject.
2. Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files,
and online transaction records. It requires performing data cleaning and integration
during data warehousing to ensure consistency in naming conventions, attributes types,
etc., among different data sources.
3. Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files from
3 months, 6 months, 12 months, or even previous data from a data warehouse. These
variations with a transactions system, where often only the most current file is kept.
4. Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the
source operational RDBMS. The operational updates of data do not occur in the data
warehouse, i.e., update, insert, and delete operations are not performed. It usually requires
only two procedures in data accessing: Initial loading of data and access to data.
Therefore, the DW does not require transaction processing, recovery, and concurrency
capabilities, which allows for substantial speedup of data retrieval. Non-Volatile defines
that once entered into the warehouse, and data should not change.
Architecture of Data Warehouse
A data warehouse architecture is a method of defining the overall architecture of data
communication processing and presentation that exist for end-clients computing within
the enterprise. Each data warehouse is different, but all are characterized by standard vital
components.
External Data Sources:
1. An operational system is a method used in data warehousing to refer to a system that
is used to process the day-to-day transactions of an organization.
2. A Flat file system is a system of files in which transactional data is stored, and every
file in the system must have a different name.
Staging Area –
Since the data, extracted from the external sources does not follow a particular format,
so there is a need to validate this data to load into data warehouse. For this purpose, it
is recommended to use ETL tool.
E(Extracted): Data is extracted from External data source.
T(Transform): Data is transformed into the standard format.
L(Load): Data is loaded into data warehouse after transforming it into the standard
format
Load Manager
Load manager is also called the front component. It performs with all the operations
associated with the extraction and load of data into the warehouse. These operations
include transformations to prepare the data for entering into the Data warehouse..
Data-warehouse
After cleansing of data, it is stored in the data warehouse as central repository. It actually
stores the meta data and the actual data gets stored in the data marts.
Meta Data
A set of data that defines and gives information about other data.
Meta Data summarizes necessary information about data, which can make finding and
work with particular instances of data more accessible. For example, author, data build,
and data changed, and file size are examples of very basic document metadata.
Metadata is used to direct a query to the most appropriate data source.
Summarized Data
The area of the data warehouse saves all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager.
The goals of the summarized information are to speed up query performance. The summarized
record is updated continuously as new information is loaded into the warehouse
Warehouse Manager
Warehouse manager performs operations associated with the management of the data
in the warehouse. It performs operations like analysis of data to ensure consistency,
creation of indexes and views, generation of denormalization and aggregations,
transformation and merging of source data and archiving and baking-up data.
Data Marts
We may want to customize our warehouse's architecture for multiple groups within our
organization. We can do this by adding data marts.
Data mart is also a part of storage component. It stores the information of a particular
function of an organization which is handled by single authority. There can be as many
number of data marts in an organization depending upon the functions. We can also say
that data mart contains subset of the data stored in data warehouse.
A data mart is a segment of a data warehouses that can provided information for
reporting and analysis on a section, unit, department or operation in the company, e.g.,
sales, payroll, production, etc.
Query Manager
The query manager is responsible for directing the queries to suitable tables. By directing
the queries to appropriate tables, it speeds up the query request and response process.
In addition, the query manager is responsible for scheduling the execution of the queries
posted by the user. It presents the data to the user in a form they understand.
End-User access Tools
The principal purpose of a data warehouse is to provide information to the business
managers for strategic decision-making. These customers interact with the warehouse
using end-client access tools.
The examples of some of the end-user access tools can be:
o Reporting and Query Tools
o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools