FUNDAMENTALS OF DATA SCIENCE
MODULE -2
Introduction to Data Warehouse:-
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection ofdata
in support of management's decision making process.
Types of data stored in Data Warehouse:-
Subject-Oriented:
A data warehouse can be used to analyze a particular subject area. Forexample, "sales"
can be a particular subject.
Integrated:
A data warehouse integrates data from multiple data sources. For example, source Aand
source B may have different ways of identifying a product, but in a data warehouse, there
will be only a single way of identifying a product.
Time-Variant:
Historical data is kept in a data warehouse. For example, one can retrieve data from 3
months, 6 months, 12 months, or even older data from a data warehouse.
o This contrasts with a transactions system, where often only the most recent data is kept.
For example, a transaction system may hold the most recent address of a customer,
where a data warehouse can hold all addresses associated with a customer.
Non-volatile:
Once data is in the data warehouse, it will not change. So, historical data in a data
warehouse should never be altered.
Data Warehouse Design Process:-
A data warehouse can be built using a top-down approach, a bottom-up approach, or
acombination of both.
top-down approach :-
o The top-down approach starts with the overall design and planning.
o It is useful in cases where the technology is mature and well known, and where the
business problems that must be solved are clear and well understood.
bottom-up approach :-
o The bottom-up approach starts with experiments and prototypes.
o This is useful in the early stage of business modeling and technology development.
o It allows an organization to move forward at considerably less expense and to
evaluate the benefits of the technology before making significant commitments.
comination of both:-
o In the combined approach, an organization can exploit the planned and strategic nature
of the top-down approach while retaining the rapid implementation and opportunistic
application of the bottom-up approach.
The warehouse design process consists of the following steps:
Choose a business process to model, for example, orders, invoices, shipments, inventory,
account administration, sales, or the general ledger.
If the business process is organizational and involves multiple complex object collections,
a data warehouse model should be followed. However, if the process is departmental and
focuses on the analysis of one kind of business process, a data mart model should be chosen.
Choose the grain of the business process. The grain is the fundamental, atomic level of data
to be represented in the fact table for this process, for example, individual transactions,
individual daily snapshots, and so on.
Choose the dimensions that will apply to each fact table record. Typical dimensions are
time, item, customer, supplier, warehouse, transaction type, and status.
Choose the measures that will populate each fact table record. Typical measures are
numericadditive quantities like dollars sold and units sold.
A Three Tier Data Warehouse Architecture:
Tier-1:
o The bottom tier is a warehouse database server that is almost always a relational
database system.
o Back-end tools and utilities are used to feed data into the bottom tier from operational
databases or other external sources (such as customer profile information provided by
external consultants).
o These tools and utilities perform data extraction, cleaning, and transformation (e.g., to
merge similar data from different sources into a unified format), as well as load and
refresh functions to update the data warehouse.
o The data are extracted using application program interfaces known as gateways.
o A gateway is supported by the underlying DBMS and allows client programs to
generateSQL code to be executed at a server.
o Examples of gateways include ODBC (Open Database Connection) and OLEDB
(Open Linking and Embedding for Databases) by Microsoft and JDBC (Java Database
Connection).
o This tier also contains a metadata repository, which stores information about the data
warehouse and its contents.
Tier-2:
o The middle tier is an OLAP server that is typically implemented using either a relational
OLAP(ROLAP) model or a multidimensional OLAP.
o OLAP model is an extended relational DBMS thatmaps operations on multidimensionaldata
to standard relational operations.
o A multidimensional OLAP (MOLAP) model, that is, a special-purpose server thatdirectly
implements multidimensional data and operations.
Tier-3:
The top tier is a front-end client layer, which contains query and reporting tools, analysis
tools,and/or data mining tools (e.g., trend analysis, prediction, and so on).
Data Warehouse Models:
There are three data warehouse models.
1. Enterprise warehouse:
An enterprise warehouse collects all of the information about subjects spanning the entire
organization.
It provides corporate-wide data integration, usually from one or more operational systems
or external information providers, and is cross-functional in scope.
It typically contains detailed data as well as summarized data, and can range in size from
a few gigabytes to hundreds of gigabytes, terabytes, or beyond.
An enterprise data warehouse may be implemented on traditional mainframes, computer
super servers, or parallel architecture platforms.
It requires extensive business modeling and may take years to design and build.
2. Data mart:
A data mart contains a subset of corporate-wide data that is of value to a specific group of
users.
The scope is confined to specific selected subjects.
For example, a marketing data mart may confine its subjects to customer, item, and
sales.The data contained in data marts tend to be summarized.
Data marts are usually implemented on low-cost departmental servers that are
UNIX/LINUX- or Windows-based.
The implementation cycle of a data mart is more likely to be measured in weeks rather than
months or years.
However, it may involve complex integration in the long run if its design and planning were
not enterprise-wide.
Depending on the source of data, data marts can be categorized as independent more
dependent.
Independent data marts are sourced from data captured from one or more operational
systems or external information providers, or from data generated locally within a
particular department or geographic area.
Dependent data marts are source directly from enterprise data warehouses.
3. Virtual warehouse:
A virtual warehouse is a set of views over operational databases. For efficient query
processing, only some of the possible summary views may be materialized.
A virtual warehouse is easy to build but requires excess capacity on operational database
servers.
OLAP(Online analytical Processing):
OLAP is an approach to answering multi-dimensional analytical (MDA) queries swiftly.
OLAP is part of the broader category of business intelligence, which also encompasses
relational database, report writing and data mining.
OLAP tools enable users to analyze multidimensional data interactively from multiple
perspectives.
OLAP consists of three basic analytical operations:
Consolidation (Roll-Up)
Drill-Down
Slicing And Dicing
Consolidation (Roll-Up) :-
o Consolidation involves the aggregation of data that can be accumulated and computed
inone or more dimensions.
o For example, all sales offices are rolled up to the sales department or sales division
to anticipate sales trends.
Drill-Down :-
The drill-down is a technique that allows users to navigate through the details. Forinstance,
users can view the sales by individual products that make up a region’s sales.
Slicing and dicing :-
Slicing and dicing is a feature whereby users can take out (slicing) a specific set of dataof
the OLAP cube and view (dicing) the slices from different viewpoints.
➢ Types of OLAP:
Relational OLAP (ROLAP):
o ROLAP works directly with relational databases.
o The base data and the dimension tables are stored as relational tables and new tables are
created to hold the aggregated information.
o It depends on a specialized schema design.
o This methodology relies on manipulating the data stored in the relational database to
give the appearance of traditional OLAP's slicing and dicing functionality.
o In essence, eachaction of slicing and dicing is equivalent to adding a "WHERE" clause
in the SQL statement.
o ROLAP tools do not use pre-calculated data cubes but instead pose the query to the
standard relational database and its tables in order to bring back the data required to
answer the question.
o ROLAP tools feature the ability to ask any question because the methodology does not
limit to the contents of a cube.
o ROLAP also has the ability to drill down to the lowest level of detail in the database.
Multidimensional OLAP (MOLAP):
o MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP.
o MOLAP stores this data in an optimized multi-dimensional array storage, rather than
ina relational database.
o Therefore it requires the pre-computation and storage of information in the cube - the
operation known as processing.
o MOLAP tools generally utilize a pre-calculated data set referred to as a data cube.
o The data cube contains all the possible answers to a given range of questions.
o MOLAP tools have a very fast response time and the ability to quickly write back data
into the data set.
Hybrid OLAP (HOLAP):
o There is no clear agreement across the industry as to what constitutes Hybrid
OLAP, except that a database will divide data between relational and
specialized storage.
o For example, for some vendors, a HOLAP database will use relational tables
to hold the larger quantities of detailed data, and use specialized storage for
at least some aspects of the smaller quantities of more-aggregate or less-
detailed data.
o HOLAP addresses the shortcomings of MOLAP and ROLAP by combining
the capabilities of both approaches.
o HOLAP tools can utilize both pre-calculated cubes and relational data
sources.
Data cleaning :-
Data cleaning stage helps in smooth out noise, attempts to fill in missing values,
removing outliers,and correct inconsistency in data.
Handling missing values:
i. Ignoring the tuple: Used when class label is missing. This method is not
very effectivewhen more missing value is present.
ii. Fill in missing value manually: It is time consuming.
iii. Using global constant to fill missing value: Ex: unknown or ∞
iv. Use attribute mean to fill the missing value
v. Use attribute mean for all samples belonging to the same class as the given tuple
vi. Use most probable value to fill the missing value: (using decision tree)
Noisy data:
Noise is a random error or variance in measured variable.
Regression:
Data smoothing can also be done by regression(linear regression, multiple
linear regression). In this one attribute can be used to predict the value of
another.
Outlier analysis:
Outliers can be done by clustering. The value outside the clusters are
outliers.
Data Integration :-
Data mining often works on integrated data from multiple repositories.
Careful integration helpsin accuracy of data mining results.
Challenges of DI :
a. Entity Identification Problem:
“How to match schema and objects from many sources?” This is called
EntityIdentification Problem.
Ex: Cust-id in one table and Cust-no in another table.Metadata helps in avoiding
these problems.
b. Redundancy and correlation analysis:
Redundancy -> repetition.
o Some redundancy can be detected by correlation analysis.
o Given two attributes, correlationtell how strongly the relationship is
(Chi-square test, correlation coefficient are ex).
Data Reduction :-
Data Reduction techniques can be applied to obtain a reduced representation
of the data set that ismuch smaller in volume, yet closely maintain the integrity
of the original data.
Data Reduction Strategies:
1) Dimensionality reduction:
Reducing the number of attributes/variables under consideration.
Ex: Attribute subset selection, Wavelet Transform, PCA.
2) Numerosity reduction:
Replace original data by alternate smaller forms.
Ex: Histograms, Sampling, Data cube aggregation.
3) Data compression:
Reduce the size of data.
Data Transformation :-
The data is transformed or consolidated so that the resulting mining process may
be more efficient,and the patterns found may be easier to understand.
Data Transformation Strategies overview:
i. Smoothing:
Performed to remove noise.
Ex: Binning, regression, clustering.
ii. Attribute construction:
New attributes are added to help mining process.
iii. Aggregation:
Data is summarized or aggregated.
Ex: Sales data is aggregated into monthly & annual sales. This step is used
for constructingdata cube.
iv. Normalization:
Data is scaled so as to fall within a smaller range.
Ex: -1.0 to +1.0.
v. Data Discretization:
Where raw values are replaced by interval labels or conceptual labels.
Ex: Age
Interval labels (10-18, 19-50)
Conceptual labels (youth, adult)
vi. Concept hierarchy generation for nominal data:
Attributes are generalized to higher levelconcepts
Ex: Street is generalized to city or country.