Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views17 pages

Unit 2 Datawarehouse

A data warehouse is a specialized data management system designed to support business intelligence activities, particularly analytics, by storing historical data from various sources for reporting and decision-making. It is characterized by being subject-oriented, integrated, time-variant, and non-volatile, and consists of components such as a central database, ETL tools, metadata, query tools, and data marts. While data warehouses offer advantages like improved data quality and support for data mining, they also present challenges in implementation and maintenance, particularly for large organizations.

Uploaded by

ganavig291
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views17 pages

Unit 2 Datawarehouse

A data warehouse is a specialized data management system designed to support business intelligence activities, particularly analytics, by storing historical data from various sources for reporting and decision-making. It is characterized by being subject-oriented, integrated, time-variant, and non-volatile, and consists of components such as a central database, ETL tools, metadata, query tools, and data marts. While data warehouses offer advantages like improved data quality and support for data mining, they also present challenges in implementation and maintenance, particularly for large organizations.

Uploaded by

ganavig291
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

DATA WAREHOUSE

Definition:- A data warehouse is a type of date management system that is designed


to enable and support business intelligence (BI) activities, especially analytics.
It is a data repository (storage) which stores historical data from Single or multiple
source Used for reporting and analysing four decision making.

Characteristics of Data warehouse:


According to W. H. lnmon, data warehouse is subject orient, integrated, time variant
and non- volatile collection of data which is used for analysis and reporting for
decision making.
1) Subject oriented: - A data warehouse is mainly focused on the subject rather then
the application it means we can analyze a particular area.
Ex. If a company wants to analyze "employee data" in that employee is a subject.
2) Integrated: - Various transactional data bases are integrated (combined) into a
single data warehouse.
3) Time variant: - Historical data is kept in a data warehouse due to that we can
retrieve data from three months, six months etc, Whereas OLTP keeps only recent
data.
4) Non-volatile: Once data is stored in the data warehouse it will not change if any
user wants to change the data it affects the entire data whereas OLTP system is
volatile

Need for data warehouse: -


→ For taking quick and effective decision.
→ Providing necessary information to the user.
→ Creating and managing data repository.
Advantages of data warehouse: -
 Historical information provided to analyst for task to be better.
 Increase quality of the data.
 It helps in data mining process.
 It helps in recovering of data from data base failures.
 Data secured.
 Maintains consistency.
Disadvantages of data warehouse/Difficulty in implementation of data warehouse: -
 Construction of data warehouse for large organizations is a complex task and
it can take time to create.
 Administration of data warehouse like maintenance, hardware and software
requirements cost etc., is also complex and it requires highly technical
persons to maintain.
 Quality control of data warehouse like clear, authentication, correctness etc, is also
complex.

Components of Data Warehouse:-


There are mainly five Components of Date warehouse.
1) DataWarehouse Database: The Central database is the foundation of the data
warehousing environment. This database is implemented on the RDBMS technology.
Although this kind of implementation is constrained by the fact that traditional RDBMS
system in optimized for transactional database processing and not for data
warehousing.
2) SourcingAcquisition, Clean-up and Transformation Tools (ETL): The data sourcing
transformation, and migration tools are performing all the conversions
Summarizations and all the charges needed the transform data into a unified format in
the data warehouse. They are also called Extract, transform and load (ETL) Tools.
3) Metadata: Metadata in data about data which defines the data warehouse. It is used
for building maintaining and managing the data warehouse
4) Query Tools: One of the primary objectives of data warehousing is to provide
information to business to make strategic decisions. Query tools allow users to interact
with the data Warehouse system.
These took fall into four different categories:
→ Query and reporting tool.
→ Applications Development Tools.
→ Data mining Tools.
→ OLAP tools.
5) Data Marts: A data mart in an access layer which is used to get data out to the Users.
It is presented as an option for large size data warehouse as it takes less time and
money to build.
ETL Process [ Extract Transformation / Transform loading]:-
• The process of moving data from traditional databases to data
warehouse is called ETL process
• Transactional databases Cannot answer Complex questions then we can use ETL .

• ETL Provides a method of moving data from various sources to data warehouse.

1. Extract: [data extraction]: In this step, data I extracted from the multiple sources.
2. Data transformation: The second step to ETL process I transformation.
After extracting data, it is a row and it's not useful for that reason we
need to do some transformations like,
→ Filtering: Loading any particular things into the data warehouse.
→ Cleaning: the data should be accurate.
→ Joining: Combining multiple columns to one Column.
→ Splitting: dividing Single Column to multiple Column.
→ Merging: merge the data from multiple Sources.
→ Sorting: Arrange the data in any of the Orders.
Here, staging area gives an opportunity to validate the extracted data
before it is moved to data warehouse.
3. Data loading: The Third step/final step to ETL process is loading. In this step, the
transformed data is finally load into data warehouse
daily/weekly/monthly/yearly.
3-tier architecture of DATA WAREHOUSE
There are three layers
➔ Bottom tier ( Data ware house tier) 2)
➔ Middle tier (OLAP Server)
➔ Top tier (Frontend Client tier)

→ Data Warehouse which is used to connect the data from different OLTPs and external
source.
→ we cannot send OLTP data directly to the data warehouse, first apply ETL
(extract transfer load) then Only the data is loaded into the data Warehouse.
➔ Bottom tier / Data Warehouse tier:
It receives the data from different sources after applying ETL.

Components:
 Data marts: From data warehouse we Can Create data mart it mainly
focused on single subject history.
 Meta data: Data about the data which in stored in the Data Warehouse.
 Monitoring: Monitoring is a Complex task of data warehouse it can
maintain and manage the data.
 Administration: Administration is also complex task of data Warehouse for
that it needs administration they can Control the data, provides security
etc.
➔ Middle tier / OLAP tier :
It acts as an interface between the end user and the data warehouse
➢ Data is stored or represented in different form by using OLAP server.
➢ There are two models of OLAP server they are,
ROLAP: Relational data base online analytical Processing .which in used
to store the data in the form of table.
MOLAP[Multi Dimensional online analytical Processing]: which is used to
store the data in multi-dimensional format.

➔ Top tier/ Front end client tier :


➢ its is a top most tier.
➢ In this tier user presence is an important, which in used to display the
output as per user requirements.
• Query report tool : By using this tier can see the Output in the form of tables.
• Analytics: By using this user can see the output the form of bar charts.
• Data mining: By using these tools user can see the Output in the form of graphs.

Data Mining V/S Data Warehouse


DATA MINING DATA WAREHOUSE
• Data in analysed /retrieved regularly. • Data is stored Periodically.
• Extract Useful Pattern/Knowledge from huge • Integrate data from Various source.
amount of data.
• It is focused on discovered useful patterns. • It is focused on store medium.
• Must take place after data warehouse. • Must take place before data mining.
• Increased throughput (O/P) • Increased system performance.
• It is used to identify the behaviour of the • It is used to identify the overall behaviour
Particular extracted information. of the system.
• It is carried by business users with the help of • It is solely Carried out by engineers.
Engineers.
• Operations: Classification, association, • Operations: Data Clean refresh, extract,
clustering. regression etc.. transform, data loading.
OLAP V/S OLTP

features OLAP OLTP


 Full Form * Online Analytical Processing. Online Transaction Processing.
 Characterization Less number of transactions. More number of transactions.
 Statements Used *we use select statement to retrieve for insert, delete and update.
analysis.
 Data base type Denormalised. Normalized we can insert, delete &
update
easily.
 Number of tables Less number of tables are use so that a More number of tables are used that
smaller a
number of joins are used. greater number of joins are applied.
 Number of indices Unlimited. Limited.
 Used by Analyst End Users.
 Functionality Analysing and reporting. Day to Day transaction.
 Query type Complex query for huge data. Simple query for few records.
 Processing Speed Depends on amount of Data. Usually fast Compared to OLAP
 Storage limit 100 GB to ZB 100MB to GB

DATA WAREHOUSE MODELS


There are three data warehouse models.
1) Enterprise Warehouse: An enterprise warehouse collects all of the information
about subjects spanning the entire organisation.
It typically contains detailed data as well as summarised data, and can range in size
from a few gigabytes to hundreds of gigabytes, terabytes, or beyond.
An enterprise data warehouse may be implemented on traditional mainframes,
computer super servers, or parallel architecture platforms. It requires extensive
business modelling and may take years to design and build.
2) Data Mart: A data mart contains a subset of corporate-wide data that is of value to a
specific group of users. The scope is confined to specific selected subjects. For
example, a marketing data mart may confine its subjects to customer, item, and sales.
The data contained in data marts tend to be summarised.
Data marts are usually implemented on low-cost departmental servers that are
UNIX/LINUX- or Windows-based.
Depending on the source of data, data marts can be categorised as independent more
dependent. Independent data marts are sourced from data captured from one or
more operational systems or external information providers, or from data generated
locally within a particular department or geographic area. Dependent data marts are
source directly from enterprise data warehouses.
3) Virtual Warehouse: A virtual warehouse is a set of views over operational
databases. For efficient query processing, only some of the possible summary views
may be materialised.
A virtual warehouse is easy to build but requires excess capacity on operational database servers.
Applications of Data Warehouse:
1) Banking Industry: In the banking Industry, Concentration is given to risk
management and Policy reversal as well analysing consumer data market trends,
government regulations and reports, and more importantly financial decision making.
Most banks use data warehouse to store the bank data to utilize them for market
research, performance analysis each product, interchange exchange rates and to
develop marketing programs.
2) Government and Education: The federal government utilizes the warehouse for
research in compliance, whereas the state government uses it for services related to
human resources like recruitment and accounting like payroll management.
The government uses data warehouse to maintain and analyse tax records, health
Policy records and their respective providers, and also their entire criminal law
database in Connected to the state's data warehouse. Criminal activity is Predicted
from the patterns and trends, results of analyse of historical data associated with past
criminals.
3)Health Care: One of the most important section which utilizes data warehouses is
the Health Care sector. All of their financial, Clinical and employee records are fed to
Warehouses as if helps them to strategize and predict Outcomes, track and analyse
their service feedback, generate patient reports, share data with tie-in insurance
Companies, medical and services, etc.
4)Hospitality Industry : A major proportion of this industry is dominated by hotel and
restaurant service. Car rental services, and holiday home services.
They utilize Warehouse services to design and evaluate their advertising and
promotion Campaigns where they target customers based on their feedback and travel
patterns.
5)Manufacturing and Distribution Industry: A manufacturing organisations has to take
several make- or-buy decisions which can influence the future of the sector. which is
why they utilize high-end OLAP tools as a part of date warehouses to predict market
Changes, analyse Current business trends, detect warning Conditions View marketing
developments and ultimately take better decisions.
They also use them for product shipment records, records of product portfolios,
identify, Profitable product lines, analyse previous data and Customer feedback to
evaluate the weaker product lines and eliminate them.
6) The Retailers: Retailers serve as middlemen between producers, and consumers. It
is important for them to maintain records of both the parties to ensure their
existence in the market.
7)Telephone Industry: The telephone industry operates over both offline and online
data burdening them with lot of historical data which as to be consolidated and
integrated.

Data Cube / Multi dimensional Cube / OLAP Cube


A Cube is a multidimensional data set, which in used to store the data in
multidimensional format for reporting purpose

* It is an approach to analyse data from multiple Perspective.

* OLAP Cube can hold the data, Collect the data & Calculate the data.

* OLAP Cube data represented in the dimensions and all these dimensions are
combined to create a Cube.
FEATURES OF MULTIDIMENSIONAL DATA MODELS
➢ Measures: Measures are numerical data that can be analysed and compared,
such as sales or revenue. They are typically stored in fact tables in a
multidimensional data model.
➢ Dimensions: Dimensions are attributes that describe the measures, such as time,
location, or product. They are typically stored in dimension tables in a
multidimensional data model.
➢ Cubes: Cubes are structures that represent the multidimensional relationships
between measures and dimensions in a data model. They provide a fast and
efficient way to retrieve and analyse data.
➢ Aggregation: Aggregation is the process of summarising data across dimensions
and levels of detail. This is a key feature of multidimensional data models, as it
enables users to quickly analyse data at different levels of granularity.
➢ Drill-down and roll-up: Drill-down is the process of moving from a higher-level
summary of data to a lower level of detail, while roll-up is the opposite process of
moving from a lower-level detail to a higher-level summary. These features enable
users to explore data in greater detail and gain insights into the underlying
patterns.
➢ Hierarchies: Hierarchies are a way of organising dimensions into levels of detail.
For example, a time dimension might be organised into years, quarters, months,
and days. Hierarchies provide a way to navigate the data and perform drill-down
and roll-up operations.
➢ OLAP (Online Analytical Processing): OLAP is a type of multidimensional data
model that supports fast and efficient querying of large datasets. OLAP systems are
designed to handle complex queries and provide fast response times.

WORKING ON A MULTIDIMENSIONAL DATA MODEL


On the basis of the pre-decided steps, the Multidimensional Data Model works.

The following stages should be followed by every project for building a Multidimensional Data
model:
→ Stage 1: Assembling data from the client: In first stage, a Multidimensional Data
Model collects correct data from the client. Mostly, software professionals provide
simplicity to the client about the range of data which can be gained with the
selected technology and collect the complete data in detail.
→ Stage 2: Grouping different segments of the system: In the second stage, the
Multidimensional Data Model recognises and classifies all the data to the respective
section they belong to and also builds it problem-free to apply step by step.
→ Stage 3: Noticing the different proportions: In the third stage, it is the basis on
which the design of the system is based. In this stage, the main factors are
recognised according to the user's point of view. These factors are also known as
"Dimensions".
→ Stage 4: Preparing the actual-time factors and their respective qualities: In the
fourth stage, the factors which are recognised in the previous step are used further
for identifying the related qualities. These qualities are also known as "attributes"
in the database.
→ Stage 5: Finding the actuality of factors which are listed previously and their
qualities: In the fifth stage, A Multi Dimensional Data Model separates and
differentiates the actuality from the factors which are collected by it. These actually
play a significant role in the arrangement of a Multidimensional Data Model.
→ Stage 6: Building the Schema to place the data, with respect to the information
collected from the steps above: In the sixth stage, on the basis of the data which
was collected previously, a Schema is built.

ADVANTAGES OF MULTIDIMENSIONAL DATA MODEL:


➢ A multidimensional data model is easy to handle.
➢ It is easy to maintain.
➢ Its performance is better than that of normal databases (ex. relational databases).
➢ The representation of data is better than traditional databases. That is
because the multidimensional databases are multi viewed and carry different
types of factors.
➢ It is workable on complex systems and applications, contrary to the simple
one- dimensional database systems.
The compatibility in this type of database is an upliftment for projects having lower
bandwidth for maintenance staff.
DISADVANTAGES OF MULTIDIMENSIONAL DATA MODEL
➢ The Multidimensional Data Model is slightly complicated in nature and
it requires professionals to recognize and examine the data in the database.
➢ During the work of a Multidimensional Data Model, when the system caches,
there is a great effect on the working of the system.
➢ It is complicated in nature due to which the databases are generally dynamic in design.
➢ The path to achieving the end product is complicated most of the time.
➢ As the Multidimensional Data Model has complicated systems, databases have
a large number of databases due to which the system is very insecure when
there is a security break.
OLAP(Online Analytical Processing):
OLAP is an approach to answering Multidimensional Analytical (MDA) queries swiftly.
OLAP is part of the broader category of Business Intelligence, which also
encompasses relational Database ,report writing and Data mining.
OLAP server is based on the multidimensional data model. It allows managers
and analyse to get an Insight of the information through form Consistent and
interactive access to information.
List of OLAP operations:
→ Roll-up
→ Drill down
→ Slice and dice
→ Pivot (rotate)
1. Roll-up: Roll up performs aggregation on a data Cube in any of the following way.
 By Climbing up a Concept hierarchy for a dimensions.
 By dimension reduction.
 The following diagram illustrates how roll-up work.

2. Drill down: It is the reverse operation of roll It is performed by either of the following ways
 By stepping down a Concept hierarchy for a dimension.
 By introducing a new dimension.

3. Slice: The slice operation selects one particular dimension from a given cube and
provides new sub-cube.
4. Dice: Dice selects two or more dimensions from a given Cube and provides a new sub-cube.

5. Pivot: The Pivot operation is also known as rotation. It rotates the data axes in
view in order to Provide an alternative presentation of data.

OLAP Servers: These servers allow the analyst to get the information through the information
access.
TYPES OF OLAP:-
 ROLAP [ Relational online Analytical Processing ]
• ROLAP work directly with relational data bass and doesn't requires precomputation.
• ROLAP tools can analyse large amount of data
• In this information stored as a table form and new tables are created to store
resultant information.
• Poor query performance because huge amount & data.
• Only expertized persons can deal with ROLAP Server.
• the data accessing is slow compared to MOLAP.
• It is not suitable for heavy calculations on Such case it cannot translate well into SOL.

 MOLAP [multi-dimensional Online analytical Processing model ]


• It is very easy to use and data is stored in the form of multi-dimensional cubes.
• Here all calculations are pre generated / Computed When the cube is Created.
Advantages:
• fast query Performance due to small amount of date Compared to MOLAP.
• Information retrieval is also fast because all the data is dimension format.
• It can perform complex Calculations.
Disadvantages
• MOLAP can handle only data a limited amount of data
• less scalable
• It requires additional data base that is MDDB hyper Cube for storing pre
calculations, for this additional storage space is required.
 HOLAP [Hybrid online analytical processing model]
• It is a combination of both ROLAP and MOLAP
• For Summery / report Purpose HOLAP can combine taka MOLAP for faster
performance also.
• For detailed information HOLAP Con take ROLAP.

Data Warehouse schema:


• A schema is an overall structure or design of Objects like tables view Index etc.

The data Warehouse is designed using either one of these three schemas:
1. Star:

 It is the simplest schema and very easy the Understand.

 It has only one fact table which stores foreign Keys and it refers number
of dimension tables.

 In this schema all dimension tables are "not normalised". It means a


smaller of tables are used for that smaller number joins used in this.

 It is Suitable for query processing.

 Ex: Consider 4-dimensional table like book, Calling employee and students
and one fact table that is University.
NOTE:
→ fact table is a Centre table. It store quantitative information for analysis.
→ One fact table: It contains measurement metrics (or) facts of a business
process. Centre of a star and snow flake schema.
→ In the above after schema "University" is a table which referred to all other tables
→ This schema in most suitable for query Processing because we can use simple query.
→ Problem is more data redundancy redundancy because tables are denormalized.
2. Snow flake schema

It is also same as the star schema which is also having only one "fact" table which
is reference to number of dimension tables.
 But the difference in in this schema all "dimension tables are
normalised" and these tables can have multiple levels.
 If the tables are normalised, mare number of tab are used and more joins
are used in order to get the result.
 Advantage is, less redundancy because of normalised hence dimension
tables are easy to update and maintain.

Consider the same 4- dimension tables but again we are going to take next level of book
table.

3. Fact / Galaxy Schema

 In this we can use multiple fact tables that Share Common dimension tables
 It is Complex schema due to multiple fact table's maintenance
 Dimension tables are also very large in this suppose we have 2 fact
tables sales and Publisher but one-dimension table book.
Data Processing in Data Mining
Data Pre-Processing in a Data mining technique which is used to transform the
row. Data in a useful and efficient format.
steps involved in Data Pre-Processing:-
1. Data Cleaning: The Data can have many irrelevant and missing Parts, to handle
this part Data Cleaning is done. It involves handling of missing data, noisy data
etc.
Missing Data: This situation arises when some data missing in the data, it can
handled in various ways.
Some of them are ignore the tuples: This approach is suitable only when the
Dataset we have in quite large and multiple values are missing with in the tuples.
Fill the missing values: There are various ways to do this task you can choose to
fill the missing values manually, by attribute mean (or) the most Probable Values.
Noisy Data: It is a meaning less Data that Cannot be interpreted by Machines. It can be
generated due to faulty Data Collection. Data Entering Errors etc...
It can be handled in the following ways
➢ Binning method: This method works on sorted data in order to smooth it. The
whole Data is Divided into Segments of equal size and then various methods
Performed to complete the task.
➢ Regression: Here Data Can be made smooth by fitting into a regression
function. The Regression used may be linear (Having one independent variable
) (or) Multiple (Having multiple independent variable
➢ Clustering: This approach groups Similar data in Cluster the out layer maybe
undetected /it will fall outside the cluster
2. Data transformation: This step is taken in order to transform the Data in
appropriate forms. Suitable for mining process.
This involves The following ways.
➢ Normalization: It is done in Onder to scale the data value in a specified range.
➢ Attribute selection: This strategy new attributes are Constructed from the given
set of attributes to help the Datamining.
➢ Concept Hierarchy Generation : Here attributes are Converted from level to
Higher Level in Hierarchy. Ex: The attribute "City" Con be converted to
"Country".
3. Data Reduction: Since Data mining is a technique that is use to handle huge
amount of Data. While working with huge amount of Data Analysis begin harder
in such cases. In order to get Ride of this we Uses Data Reduction technique. It
aim to increase the storage efficiency and reduce data storage and analysis cost.
The various steps to Data Reduction are
➢ Data Cube Aggregation
➢ Attribute subset selection.
➢ Neumorocity Reduction
➢ Dimensionality Reduction

 Data Cube Aggregation: Aggregation operation is applied to Data for


construction of Data Cubes.
 Attribute subset selection: The Highly Relevant attributes should be used
.Rest all can be discarded
 Neumorocity Reduction: The Enables to stove the model of Data instead of whole Data
 Dimensionality Reduction :Thin Reduces the size of Data by Encoding
Mechanism. The two effective method l of Dimensionality Reduction are wave
let transforms and PCA (Principle Component Analysis).

4. Data Integration: It involves Combing Data from several disparate source, which
are stored using various technologies and provide a unified view of the data
→ It merges the data from multiple data stores
→ It includes multiple databases, data cubes flat files.
→ Metadata, Correlation analysis, data Conflict and resolution of semantic
heterogeneity Contribute towards smooth data integration.

Advantages:-
➢ Independence
➢ Faster query processing
➢ Complex query Processing
➢ Advanced data summarization and storage Possible
➢ High volume data Processing.

Disadvantages:-
➢ Latency
➢ Costlier

You might also like