0% found this document useful (0 votes)

11 views11 pages

Fundamentals of Data Science Notes (Module - 2)

The document provides an overview of data warehousing, including its definition, types, design processes, architecture, and models. It discusses the importance of OLAP for analytical processing, data cleaning, integration, reduction, and transformation techniques. Key concepts include enterprise warehouses, data marts, and OLAP types such as ROLAP and MOLAP, along with strategies for data management and analysis.

Uploaded by

cssanjaycs438

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views11 pages

Fundamentals of Data Science Notes (Module - 2)

Uploaded by

cssanjaycs438

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

FUNDAMENTALS OF DATA SCIENCE

MODULE -2

 Introduction to Data Warehouse:-

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection ofdata

in support of management's decision making process.

Types of data stored in Data Warehouse:-

 Subject-Oriented:

A data warehouse can be used to analyze a particular subject area. Forexample, "sales"
can be a particular subject.

 Integrated:
A data warehouse integrates data from multiple data sources. For example, source Aand
source B may have different ways of identifying a product, but in a data warehouse, there
will be only a single way of identifying a product.

 Time-Variant:
Historical data is kept in a data warehouse. For example, one can retrieve data from 3
months, 6 months, 12 months, or even older data from a data warehouse.
o This contrasts with a transactions system, where often only the most recent data is kept.
For example, a transaction system may hold the most recent address of a customer,
where a data warehouse can hold all addresses associated with a customer.
 Non-volatile:
Once data is in the data warehouse, it will not change. So, historical data in a data
warehouse should never be altered.
 Data Warehouse Design Process:-

A data warehouse can be built using a top-down approach, a bottom-up approach, or

acombination of both.
 top-down approach :-
o The top-down approach starts with the overall design and planning.
o It is useful in cases where the technology is mature and well known, and where the
business problems that must be solved are clear and well understood.

 bottom-up approach :-
o The bottom-up approach starts with experiments and prototypes.
o This is useful in the early stage of business modeling and technology development.
o It allows an organization to move forward at considerably less expense and to
evaluate the benefits of the technology before making significant commitments.
 comination of both:-
o In the combined approach, an organization can exploit the planned and strategic nature
of the top-down approach while retaining the rapid implementation and opportunistic
application of the bottom-up approach.

The warehouse design process consists of the following steps:

 Choose a business process to model, for example, orders, invoices, shipments, inventory,
account administration, sales, or the general ledger.

 If the business process is organizational and involves multiple complex object collections,
a data warehouse model should be followed. However, if the process is departmental and
focuses on the analysis of one kind of business process, a data mart model should be chosen.
 Choose the grain of the business process. The grain is the fundamental, atomic level of data
to be represented in the fact table for this process, for example, individual transactions,
individual daily snapshots, and so on.
 Choose the dimensions that will apply to each fact table record. Typical dimensions are
time, item, customer, supplier, warehouse, transaction type, and status.
 Choose the measures that will populate each fact table record. Typical measures are
numericadditive quantities like dollars sold and units sold.

 A Three Tier Data Warehouse Architecture:

 Tier-1:

o The bottom tier is a warehouse database server that is almost always a relational
database system.
o Back-end tools and utilities are used to feed data into the bottom tier from operational
databases or other external sources (such as customer profile information provided by
external consultants).
o These tools and utilities perform data extraction, cleaning, and transformation (e.g., to
merge similar data from different sources into a unified format), as well as load and
refresh functions to update the data warehouse.
o The data are extracted using application program interfaces known as gateways.
o A gateway is supported by the underlying DBMS and allows client programs to
generateSQL code to be executed at a server.
o Examples of gateways include ODBC (Open Database Connection) and OLEDB
(Open Linking and Embedding for Databases) by Microsoft and JDBC (Java Database
Connection).
o This tier also contains a metadata repository, which stores information about the data
warehouse and its contents.

 Tier-2:

o The middle tier is an OLAP server that is typically implemented using either a relational
OLAP(ROLAP) model or a multidimensional OLAP.
o OLAP model is an extended relational DBMS thatmaps operations on multidimensionaldata
to standard relational operations.
o A multidimensional OLAP (MOLAP) model, that is, a special-purpose server thatdirectly
implements multidimensional data and operations.

 Tier-3:

The top tier is a front-end client layer, which contains query and reporting tools, analysis
tools,and/or data mining tools (e.g., trend analysis, prediction, and so on).

 Data Warehouse Models:

There are three data warehouse models.

1. Enterprise warehouse:

 An enterprise warehouse collects all of the information about subjects spanning the entire
organization.
 It provides corporate-wide data integration, usually from one or more operational systems
or external information providers, and is cross-functional in scope.
 It typically contains detailed data as well as summarized data, and can range in size from
a few gigabytes to hundreds of gigabytes, terabytes, or beyond.

 An enterprise data warehouse may be implemented on traditional mainframes, computer

super servers, or parallel architecture platforms.

 It requires extensive business modeling and may take years to design and build.

2. Data mart:

 A data mart contains a subset of corporate-wide data that is of value to a specific group of
users.

 The scope is confined to specific selected subjects.

 For example, a marketing data mart may confine its subjects to customer, item, and
sales.The data contained in data marts tend to be summarized.

 Data marts are usually implemented on low-cost departmental servers that are
UNIX/LINUX- or Windows-based.

 The implementation cycle of a data mart is more likely to be measured in weeks rather than
months or years.

 However, it may involve complex integration in the long run if its design and planning were
not enterprise-wide.

 Depending on the source of data, data marts can be categorized as independent more
dependent.

 Independent data marts are sourced from data captured from one or more operational
systems or external information providers, or from data generated locally within a
particular department or geographic area.
 Dependent data marts are source directly from enterprise data warehouses.

3. Virtual warehouse:

 A virtual warehouse is a set of views over operational databases. For efficient query
processing, only some of the possible summary views may be materialized.
 A virtual warehouse is easy to build but requires excess capacity on operational database
servers.

 OLAP(Online analytical Processing):

 OLAP is an approach to answering multi-dimensional analytical (MDA) queries swiftly.

 OLAP is part of the broader category of business intelligence, which also encompasses
relational database, report writing and data mining.

 OLAP tools enable users to analyze multidimensional data interactively from multiple
perspectives.

OLAP consists of three basic analytical operations:

 Consolidation (Roll-Up)
 Drill-Down
 Slicing And Dicing

 Consolidation (Roll-Up) :-

o Consolidation involves the aggregation of data that can be accumulated and computed
inone or more dimensions.

o For example, all sales offices are rolled up to the sales department or sales division
to anticipate sales trends.
 Drill-Down :-

The drill-down is a technique that allows users to navigate through the details. Forinstance,
users can view the sales by individual products that make up a region’s sales.

 Slicing and dicing :-

Slicing and dicing is a feature whereby users can take out (slicing) a specific set of dataof
the OLAP cube and view (dicing) the slices from different viewpoints.

➢ Types of OLAP:

 Relational OLAP (ROLAP):

o ROLAP works directly with relational databases.

o The base data and the dimension tables are stored as relational tables and new tables are
created to hold the aggregated information.
o It depends on a specialized schema design.
o This methodology relies on manipulating the data stored in the relational database to
give the appearance of traditional OLAP's slicing and dicing functionality.
o In essence, eachaction of slicing and dicing is equivalent to adding a "WHERE" clause
in the SQL statement.
o ROLAP tools do not use pre-calculated data cubes but instead pose the query to the
standard relational database and its tables in order to bring back the data required to
answer the question.
o ROLAP tools feature the ability to ask any question because the methodology does not
limit to the contents of a cube.
o ROLAP also has the ability to drill down to the lowest level of detail in the database.
 Multidimensional OLAP (MOLAP):

o MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP.
o MOLAP stores this data in an optimized multi-dimensional array storage, rather than
ina relational database.
o Therefore it requires the pre-computation and storage of information in the cube - the
operation known as processing.
o MOLAP tools generally utilize a pre-calculated data set referred to as a data cube.
o The data cube contains all the possible answers to a given range of questions.
o MOLAP tools have a very fast response time and the ability to quickly write back data
into the data set.

 Hybrid OLAP (HOLAP):

o There is no clear agreement across the industry as to what constitutes Hybrid

OLAP, except that a database will divide data between relational and
specialized storage.
o For example, for some vendors, a HOLAP database will use relational tables
to hold the larger quantities of detailed data, and use specialized storage for
at least some aspects of the smaller quantities of more-aggregate or less-
detailed data.
o HOLAP addresses the shortcomings of MOLAP and ROLAP by combining
the capabilities of both approaches.
o HOLAP tools can utilize both pre-calculated cubes and relational data
sources.

 Data cleaning :-

Data cleaning stage helps in smooth out noise, attempts to fill in missing values,
removing outliers,and correct inconsistency in data.

 Handling missing values:

i. Ignoring the tuple: Used when class label is missing. This method is not
very effectivewhen more missing value is present.
ii. Fill in missing value manually: It is time consuming.
iii. Using global constant to fill missing value: Ex: unknown or ∞
iv. Use attribute mean to fill the missing value
v. Use attribute mean for all samples belonging to the same class as the given tuple
vi. Use most probable value to fill the missing value: (using decision tree)

 Noisy data:
Noise is a random error or variance in measured variable.
 Regression:
Data smoothing can also be done by regression(linear regression, multiple
linear regression). In this one attribute can be used to predict the value of
another.
 Outlier analysis:
Outliers can be done by clustering. The value outside the clusters are
outliers.

 Data Integration :-

Data mining often works on integrated data from multiple repositories.

Careful integration helpsin accuracy of data mining results.

Challenges of DI :

a. Entity Identification Problem:

“How to match schema and objects from many sources?” This is called
EntityIdentification Problem.
Ex: Cust-id in one table and Cust-no in another table.Metadata helps in avoiding
these problems.
b. Redundancy and correlation analysis:
Redundancy -> repetition.
o Some redundancy can be detected by correlation analysis.
o Given two attributes, correlationtell how strongly the relationship is
(Chi-square test, correlation coefficient are ex).

 Data Reduction :-

Data Reduction techniques can be applied to obtain a reduced representation

of the data set that ismuch smaller in volume, yet closely maintain the integrity
of the original data.

Data Reduction Strategies:

1) Dimensionality reduction:
Reducing the number of attributes/variables under consideration.

Ex: Attribute subset selection, Wavelet Transform, PCA.

2) Numerosity reduction:
Replace original data by alternate smaller forms.
Ex: Histograms, Sampling, Data cube aggregation.
3) Data compression:
Reduce the size of data.

 Data Transformation :-

The data is transformed or consolidated so that the resulting mining process may
be more efficient,and the patterns found may be easier to understand.

Data Transformation Strategies overview:

i. Smoothing:
Performed to remove noise.
Ex: Binning, regression, clustering.
ii. Attribute construction:
New attributes are added to help mining process.
iii. Aggregation:
Data is summarized or aggregated.
Ex: Sales data is aggregated into monthly & annual sales. This step is used
for constructingdata cube.
iv. Normalization:
Data is scaled so as to fall within a smaller range.
Ex: -1.0 to +1.0.
v. Data Discretization:
Where raw values are replaced by interval labels or conceptual labels.

Ex: Age
 Interval labels (10-18, 19-50)
 Conceptual labels (youth, adult)
vi. Concept hierarchy generation for nominal data:
Attributes are generalized to higher levelconcepts
Ex: Street is generalized to city or country.

Unit - 2
No ratings yet
Unit - 2
116 pages
Data Warehouse-Design Process
No ratings yet
Data Warehouse-Design Process
16 pages
DWDM Unit1
No ratings yet
DWDM Unit1
8 pages
Rdbmsiii 190703162808
No ratings yet
Rdbmsiii 190703162808
20 pages
DWDM Lecture Notes U-1
No ratings yet
DWDM Lecture Notes U-1
11 pages
DWDM Lecture Notes III-II
No ratings yet
DWDM Lecture Notes III-II
81 pages
DWDM Lecture Notes III-ii - For Nlcad-6-86
No ratings yet
DWDM Lecture Notes III-ii - For Nlcad-6-86
81 pages
Data Warehouse and Its Architecture
No ratings yet
Data Warehouse and Its Architecture
5 pages
Data Warehousing and Data Mining: UNIT-1
No ratings yet
Data Warehousing and Data Mining: UNIT-1
118 pages
4 Data Warehousing & OLAP
No ratings yet
4 Data Warehousing & OLAP
62 pages
CS2202 DataWarehouse OLAP
No ratings yet
CS2202 DataWarehouse OLAP
49 pages
Module 1 Notes
No ratings yet
Module 1 Notes
29 pages
Knowledge Discovery in Databases (KDD) Lect 4
No ratings yet
Knowledge Discovery in Databases (KDD) Lect 4
28 pages
Unit I Data Warehouse
No ratings yet
Unit I Data Warehouse
33 pages
Lec09-Data Warehousing
No ratings yet
Lec09-Data Warehousing
32 pages
Unit Ii DWDM
No ratings yet
Unit Ii DWDM
10 pages
FDS Unit 2
No ratings yet
FDS Unit 2
21 pages
Module-1: Data Warehousing & Modelling
No ratings yet
Module-1: Data Warehousing & Modelling
13 pages
DATA WAREHOUSE Basic Concepts
No ratings yet
DATA WAREHOUSE Basic Concepts
26 pages
TB 216 Workshop Manual TB216
No ratings yet
TB 216 Workshop Manual TB216
296 pages
Lecture 2
No ratings yet
Lecture 2
11 pages
DWDM 5 Unit Notes
No ratings yet
DWDM 5 Unit Notes
86 pages
Data Warehouse Essentials
No ratings yet
Data Warehouse Essentials
6 pages
Unit3 Notes
No ratings yet
Unit3 Notes
15 pages
DATA Mining UNIT1 DATA Mining UNIT1: Operating System (Sindhi College) Operating System (Sindhi College)
No ratings yet
DATA Mining UNIT1 DATA Mining UNIT1: Operating System (Sindhi College) Operating System (Sindhi College)
24 pages
ALL YOU NEED Data - Mining - and - Warehousing
No ratings yet
ALL YOU NEED Data - Mining - and - Warehousing
42 pages
02 DataWarehousing and OLAP
No ratings yet
02 DataWarehousing and OLAP
66 pages
Data Warehousing Introduction Pages 2 53
No ratings yet
Data Warehousing Introduction Pages 2 53
52 pages
Btech Cse & Aids DWDM Material - 2025
100% (1)
Btech Cse & Aids DWDM Material - 2025
45 pages
Unit 2 Ques
No ratings yet
Unit 2 Ques
80 pages
Data Ware House
No ratings yet
Data Ware House
25 pages
Designz Tweet Book
No ratings yet
Designz Tweet Book
117 pages
Introduction On Data Warehouse With OLTP and OLAP: Arpit Parekh
No ratings yet
Introduction On Data Warehouse With OLTP and OLAP: Arpit Parekh
5 pages
Unit Ii-Ba
No ratings yet
Unit Ii-Ba
16 pages
Unit Ii
No ratings yet
Unit Ii
59 pages
Unit - II Data Warehouseing&OLAP
No ratings yet
Unit - II Data Warehouseing&OLAP
17 pages
C Lecture
No ratings yet
C Lecture
8 pages
Data Warehouse Design & Architecture
No ratings yet
Data Warehouse Design & Architecture
12 pages
Data Warehouse
No ratings yet
Data Warehouse
19 pages
Module-3 Data Warehousing
No ratings yet
Module-3 Data Warehousing
44 pages
Data Warehouse & OLAP Essentials
No ratings yet
Data Warehouse & OLAP Essentials
17 pages
Data Warehousing for IT Students
No ratings yet
Data Warehousing for IT Students
64 pages
Data Warehouse Definition: - Users and System Orientation
No ratings yet
Data Warehouse Definition: - Users and System Orientation
6 pages
Data Mining and Warehosuing Lecture 01
No ratings yet
Data Mining and Warehosuing Lecture 01
36 pages
Data Warehousing for Business Analysts
100% (1)
Data Warehousing for Business Analysts
28 pages
Olap and Oltap
No ratings yet
Olap and Oltap
14 pages
DM Module 1
No ratings yet
DM Module 1
16 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
46 pages
Chapter-2 DM
No ratings yet
Chapter-2 DM
23 pages
Unit 2 - Data Science BCA
No ratings yet
Unit 2 - Data Science BCA
20 pages
Chapter 1
No ratings yet
Chapter 1
9 pages
Data Warehouse Components
No ratings yet
Data Warehouse Components
26 pages
Data Warehouse Modeling Guide
No ratings yet
Data Warehouse Modeling Guide
17 pages
Data Mining Unit-2 Notes
No ratings yet
Data Mining Unit-2 Notes
8 pages
A Review On Springback Effect in Sheet Metal Forming Process
No ratings yet
A Review On Springback Effect in Sheet Metal Forming Process
7 pages
Grocery Ads Good Through May 4, 2010
No ratings yet
Grocery Ads Good Through May 4, 2010
2 pages
Data Warehouse Concepts
No ratings yet
Data Warehouse Concepts
53 pages
DM Part 2
No ratings yet
DM Part 2
24 pages
DWDM Unit 1
No ratings yet
DWDM Unit 1
103 pages
District Test On The Circular Flow Model-1-1
100% (2)
District Test On The Circular Flow Model-1-1
7 pages
Wireless World 1983 03
No ratings yet
Wireless World 1983 03
126 pages
Data Warehousing and On-Line Analytical Processing
No ratings yet
Data Warehousing and On-Line Analytical Processing
40 pages
Form Mechanics Lien Claim
No ratings yet
Form Mechanics Lien Claim
3 pages
Stack by Linked List (By C++) : #Include
No ratings yet
Stack by Linked List (By C++) : #Include
4 pages
Chapter 1 Governments and Individuals PDF
No ratings yet
Chapter 1 Governments and Individuals PDF
24 pages
Pci Leasing and Finance
No ratings yet
Pci Leasing and Finance
6 pages
Uniswap Formulas
100% (2)
Uniswap Formulas
14 pages
LED LIGHTING Research Report Abstract
0% (1)
LED LIGHTING Research Report Abstract
14 pages
EMA Literature Review Guide
No ratings yet
EMA Literature Review Guide
7 pages
The Architecture of Flex and Java Applications
No ratings yet
The Architecture of Flex and Java Applications
33 pages
Improving Population Health Using Electronic Health Records: Methods For Data Management and Epidemiological Analysis 1st Edition Goldstein
No ratings yet
Improving Population Health Using Electronic Health Records: Methods For Data Management and Epidemiological Analysis 1st Edition Goldstein
60 pages
Polity (Articles Compilation June2024-Jan2025) M IE Explained - All Subjects (Dec 2025)
No ratings yet
Polity (Articles Compilation June2024-Jan2025) M IE Explained - All Subjects (Dec 2025)
23 pages
Radix Senegae
No ratings yet
Radix Senegae
13 pages
Insurance Premium Rates Guide
No ratings yet
Insurance Premium Rates Guide
6 pages
GFSI Terms & Definitions Guide
No ratings yet
GFSI Terms & Definitions Guide
7 pages
TR Bro Updated Erl221
No ratings yet
TR Bro Updated Erl221
4 pages
File Chinh Thuc - HSG 2020 - Vòng 2
No ratings yet
File Chinh Thuc - HSG 2020 - Vòng 2
17 pages
Ambassador SWOT Examples
No ratings yet
Ambassador SWOT Examples
18 pages
LESSON PLAN FORMAT HAND TOOLS Arbelle
No ratings yet
LESSON PLAN FORMAT HAND TOOLS Arbelle
2 pages
Investment Math Syllabus
No ratings yet
Investment Math Syllabus
7 pages
Dietetics As A Profession
No ratings yet
Dietetics As A Profession
11 pages
1 s2.0 S1877705812011332 Main
No ratings yet
1 s2.0 S1877705812011332 Main
10 pages
Reading Unit 4
No ratings yet
Reading Unit 4
3 pages
NIKE Bleed Blue Integrated Campaign
No ratings yet
NIKE Bleed Blue Integrated Campaign
2 pages
Committed Vs Aspirational OKRs The Idea OKRE V1 0
No ratings yet
Committed Vs Aspirational OKRs The Idea OKRE V1 0
3 pages
Pantry Evaluation Proposal Internship
No ratings yet
Pantry Evaluation Proposal Internship
6 pages
Data Mining and Data Warehouse BY
100% (1)
Data Mining and Data Warehouse BY
12 pages

Fundamentals of Data Science Notes (Module - 2)

Uploaded by

Fundamentals of Data Science Notes (Module - 2)

Uploaded by

FUNDAMENTALS OF DATA SCIENCE

 Introduction to Data Warehouse:-

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection ofdata

Types of data stored in Data Warehouse:-

A data warehouse can be built using a top-down approach, a bottom-up approach, or

The warehouse design process consists of the following steps:

 A Three Tier Data Warehouse Architecture:

 Data Warehouse Models:

There are three data warehouse models.

 An enterprise data warehouse may be implemented on traditional mainframes, computer

 The scope is confined to specific selected subjects.

 OLAP(Online analytical Processing):

 OLAP is an approach to answering multi-dimensional analytical (MDA) queries swiftly.

OLAP consists of three basic analytical operations:

 Slicing and dicing :-

 Relational OLAP (ROLAP):

o ROLAP works directly with relational databases.

 Hybrid OLAP (HOLAP):

o There is no clear agreement across the industry as to what constitutes Hybrid

 Handling missing values:

Data mining often works on integrated data from multiple repositories.

a. Entity Identification Problem:

Data Reduction techniques can be applied to obtain a reduced representation

Data Reduction Strategies:

Ex: Attribute subset selection, Wavelet Transform, PCA.

Data Transformation Strategies overview:

You might also like