GATE DA Data Warehousing
GATE DA Data Warehousing
Instructions:
• Read this study material carefully and make your own handwritten short notes. (Short
notes must not be more than 5-6 pages)
• Revise this material at least 5 times and once you have prepared your short notes, then
revise your short notes twice a week
• If you are not able to understand any topic or required detailed explanation,
please mention it in our discussion forum on webiste
• Let me know, if there are any typos or mistake in study materials. Mail
me at [email protected]
1 Data Warehousing
• A data warehouse is a repository (data and metadata) that contains integrated, cleansed,
and reconciled data from disparate sources for decision support applications, with an
emphasis on online analytical processing.
1. Extract: The first stage in the ETL process is to extract data from various sources such
as transactional systems, spreadsheets, and flat files. In this step, data from various
source systems is extracted which can be in various formats like relational databases,
No SQL, XML, and flat files into the staging area. It is important to extract the data
from various source systems and store it into the staging area first and not directly
into the data warehouse because the extracted data is in various formats and can be
corrupted also. Hence loading it directly into the data warehouse may damage it and
rollback will be much more difficult. Therefore, this is one of the most important steps
of ETL process
2. Transform: In this stage, the extracted data is transformed into a format that is
suitable for loading into the data warehouse. In this step, a set of rules or functions
are applied on the extracted data to convert it into a single standard format. It may
involve following processes/tasks:
3. Load: After the data is transformed, it is loaded into the data warehouse. This step
involves creating the physical data structures and loading the data into the warehouse.
In this step, the transformed data is finally loaded into the data warehouse. Sometimes
the data is updated by loading into the data warehouse very frequently and sometimes
it is done after longer but regular intervals. The rate and period of loading solely
depend on the requirements and varies from system to system.
The ETL process is an iterative process that is repeated as new data is added to the ware-
house. The process is important because it ensures that the data in the data warehouse is
accurate, complete, and up-to-date. It also helps to ensure that the data is in the format
required for data mining and reporting.
• Better data integration: ETL process helps to integrate data from multiple sources
and systems, making it more accessible and usable.
• Increased data security: The ETL process can help to improve data security by control-
ling access to the data warehouse and ensuring that only authorized users can access
the data.
• Improved scalability: The ETL process can help to improve scalability by providing a
way to manage and analyze large amounts of data.
• Increased automation: ETL tools and technologies can automate and simplify the
ETL process, reducing the time and effort required to load and update data in the
warehouse.
• Complexity: ETL process can be complex and difficult to implement, especially for
organizations that lack the necessary expertise or resources.
• Limited flexibility: ETL process can be limited in terms of flexibility, as it may not be
able to handle unstructured data or real-time data streams.
• Limited scalability: ETL process can be limited in terms of scalability, as it may not
be able to handle very large amounts of data.
• Data privacy concerns: ETL process can raise concerns about data privacy, as large
amounts of data are collected, stored, and analyzed.
• The top-down approach starts with the overall design and planning. It is useful in cases
where the technology is mature and well known, and where the business problems that
must be solved are clear and well understood.
• The bottom-up approach starts with experiments and prototypes. This is useful in the
early stage of business modeling and technology development. It allows an organiza-
tion to move forward at considerably less expense and to evaluate the benefits of the
technology before making significant commitments.
• In the combined approach, an organization can exploit the planned and strategic nature
of the top-down approach while retaining the rapid implementation and opportunistic
application of the bottom-up approach.
• Choose a business process to model, for example, orders, invoices, shipments, inventory,
account administration, sales, or the general ledger. If the business process is orga-
nizational and involves multiple complex object collections, a data warehouse model
should be followed. However, if the process is departmental and focuses on the analysis
of one kind of business process, a data mart model should be chosen.
• Choose the grain of the business process. The grain is the fundamental, atomic level
of data to be represented in the fact table for this process, for example, individual
transactions, individual daily snapshots, and so on.
• Choose the dimensions that will apply to each fact table record. Typical dimensions
are time, item, customer, supplier, warehouse, transaction type, and status.
• Choose the measures that will populate each fact table record. Typical measures are
numeric additive quantities like dollars sold and units sold.
Tier-1:
• The bottom tier is a warehouse database server that is almost always a relational
database system.
• Back-end tools and utilities are used to feed data into the bottom tier from operational
databases or other external sources (such as customer profile information provided by
external consultants). These tools and utilities perform data extraction, cleaning, and
transformation (e.g., to merge similar data from different sources into a unified format),
as well as load and refresh functions to update thedata warehouse .
• The data are extracted using application program interfaces known as gateways. A
gateway is supported by the underlying DBMS andallows client programs to generate
SQL code to be executed at a server.
• Examplesof gateways include ODBC (Open Database Connection) and OLEDB (Open
Linking and Embedding for Databases) by Microsoft and JDBC (Java Database Con-
nection).
This tier also contains a metadata repository, which stores information about the data
warehouse and its contents.
Tier-2:
– The middle tier is an OLAP server that is typically implemented using either a
relational OLAP (ROLAP) model or a multidimensional OLAP.
– OLAP model is an extended relational DBMS that operations on multidimen-
sional data to standard relational operations.
– A multidimensional OLAP (MOLAP) model, that is, a special-purpose server
that directly implements multidimensional data and operations.
Tier-3:
The top tier is a front-end client layer, which contains query and reporting tools,
analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on)
• An enterprise warehouse collects all of the information about subjects spanning the
entire organization.
• It provides corporate-wide data integration, usually from one or more operational sys-
tems or external information providers, and is cross-functional in scope.
• It typically contains detailed data aswell as summarized data, and can range in size
from a few gigabytes to hundreds of gigabytes, terabytes, or beyond.
2. Data mart:
• A data mart contains a subset of corporate-wide data that is of value to aspecific group
of users. The scope is confined to specific selected subjects. For example,a marketing
data mart may confine its subjects to customer, item, and sales. The data contained
in data marts tend to be summarized.
• Data marts are usually implemented on low-cost departmental servers that are UNIX/LINUX-
or Windows-based. The implementation cycle of a data mart is more likely to be
measured in weeks rather than months or years. However, it may involve complex
integration in the long run if its design and planning were not enterprise-wide.
• A description of the structure of the data warehouse, which includes the warehouse
schema, view, dimensions, hierarchies, and derived data definitions, as well as data
mart locations and contents.
• Operational metadata, which includes data lineage (history of migrated data and the
the sequence of transformations applied to it), the currency of data (active, archived,
or purged), and monitoring information (warehouse usage statistics, error reports, and
audit trails).
• The algorithms used for summarization, which include measure and dimension defi-
nition algorithms, data on granularity, partitions, subject areas, aggregation, summa-
rization, and predefined queries and reports.
• The mapping from the operational environment to the data warehouse, which include
databases and their contents, gateway descriptions, data partitions, data extraction,
cleaning, transformation rules and defaults, data refresh and purging rules, and security
(user authorization and access control).
• Data related to system performance, which includes indices and profiles that improved
access and retrieval performance, in addition to rules for the timing and scheduling
of refresh, update, and replication cycles. Business metadata, which includes business
terms and definitions, data ownership information, and charging policies
1. Star schema: The most common modeling paradigm is the star schema, in which the
data warehouse contains
(1) a large central table (fact table) containing the bulk of the data, with no redun-
dancy, and
(2) a set of smaller attendant tables (dimension tables), one for each dimension. The
schema graph resembles a starburst, with the dimension tables displayed in a radial
pattern around the central fact table.
A star schema for AllElectronics sales is shown Table 1, Sales are considered along four
dimensions: time, item, branch, and location. The schema contains a central fact table
for sales that contains keys to each of the four dimensions, along with two measures:
dollars sold and units sold. To minimize the size of the fact table, dimension identifiers
(e.g., time key and item key) are system generated identifiers. Notice that in the star
schema, each dimension is represented by only one table, and each table contains a set
of attributes.
Example: Let, an organization sells products throughout the world. The main four
major dimensions are time, location, time, and branch.
Figure 1
2. Snowflake schema:
The snowflake schema is a variant of the star schema model, where some dimension
tables are normalized, thereby further splitting the data into additional tables. The
resulting schema graph forms a shape similar to a snowflake. The major difference
between the snowflake and star schema models is that the dimension tables of the
snowflake model may be kept in the normalized form to reduce redundancies. Such
a table is easy to maintain and saves storage space. However, this space savings is
negligible in comparison to the typical magnitude of the fact table. Furthermore, the
snowflake structure can reduce the effectiveness of browsing, since more joins will be
needed to execute a query. Consequently, the system performance may be adversely
impacted. Hence, although the snowflake schema reduces redundancy, it is not as
popular as the star schema in data warehouse design.
Figure 2
A snowflake schema for AllElectronics sales is given in Figure 2. Here, the sales fact
table is identical to that of the star schema in Figure 1. The main difference between
the two schemas is in the definition of dimension tables. The single dimension table
for an item in the star schema is normalized in the snowflake schema, resulting in
new item and supplier tables. For example, the item dimension table now contains
the attributes item key, item name, brand, type, and supplier key, where supplier key
is linked to the supplier dimension table, containing supplier key and supplier type
information. Similarly, the single dimension table for location in the star schema can
be normalized into two new tables: location and city. The city key in the new location
table links to the city dimension. Notice that, when desirable, further normalization
can be performed on province or state and country in the snowflake schema shown in
Figure 6.
10
Figure 3
A fact constellation schema is shown in Figure 3. This schema specifies two fact tables,
sales, and shipping. The sales table definition is identical to that of the star schema
(Figure 1). The shipping table has five dimensions, or keys—item key, time key, shipper
key, from location, and to location— and two measures—dollars cost and units shipped.
A fact constellation schema allows dimension tables to be shared between fact tables.
For example, the dimensions tables for time, item, and location are shared between
the sales and shipping fact tables.
11
Figure 4
12
13
• OLAP is part of the broader category of business intelligence, which also encompasses
relational database, report writing and data mining.
• OLAP tools enable users to analyze multidimensional data interactively from multiple
perspectives.
“A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by
dimensions and facts.
Dimensions are the entities concerning which an organization wants to keep records.
Facts are numerical measures. It is the quantities by which we want to analyze relation-
ships between dimensions.
The data cube is used by the users of the decision support system to see their data. The
cuboid that holds the lowest level of summarization is called the base cuboid. The 0-D
cuboid, which holds the highest level of summarization is called the apex cuboid.
14
OLAP operations:
Five basic analytical operations can be performed on an OLAP cube:
1. Drill down: In drill-down operation, the less detailed data is converted into highly
detailed data. It can be done by:
Moving down in the concept hierarchy
Adding a new dimension
In the cube given in the overview section, the drill-down operation is performed by
moving down in the concept hierarchy of Time dimension (Quarter → Month).
15
2. Roll up: It is just the opposite of the drill-down operation. It performs aggregation
on the OLAP cube. It can be done by:
Climbing up in the concept hierarchy
Reducing the dimensions
In the cube given in the overview section, the roll-up operation is performed by climb-
ing up in the concept hierarchy of the Location dimension (City → Country).
3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions.
In the cube given in the overview section, a sub-cube is selected by selecting the fol-
lowing dimensions with criteria:
Location = “Delhi” or “Kolkata”
Time = “Q1” or “Q2”
Item = “Car” or “Bus”
16
4. Slice: It selects a single dimension from the OLAP cube which results in a new sub-
cube creation. In the cube given in the overview section, Slice is performed on the
dimension Time = “Q1”.
5. Pivot: It is also known as rotation operation as it rotates the current view to get a
new view of the representation. In the sub-cube obtained after the slice operation,
performing pivot operation gives a new view of it.
17
Types of OLAP:
• ROLAP works directly with relational databases. The base data and the dimen-
sion tables are stored as relational tables and new tables are created to hold the
aggregated information. It depends on a specialized schema design.
• This methodology relies on manipulating the data stored in the relational database
to give the appearance of traditional OLAP’s slicing and dicing functionality. In
essence, each action of slicing and dicing is equivalent to adding a ”WHERE”
clause in the SQL statement.
• ROLAP tools do not use pre-calculated data cubes but instead pose the query to
the standard relational database and its tables in order to bring back the data
required to answer the question.
• ROLAP tools feature the ability to ask any question because the methodology
does not limited to the contents of a cube. ROLAP also has the ability to drill
down to the lowest level of detail in the database.
• MOLAP is the ’classic’ form of OLAP and is sometimes referred to as just OLAP.
• MOLAP stores this data in an optimized multi-dimensional array storage, rather
than in a relational database. Therefore it requires the pre-computation and
storage of information in the cube - the operation known as processing.
• MOLAP tools generally utilize a pre-calculated data set referred to as a data cube.
The data cube contains all the possible answers to a given range of questions.
• MOLAP tools have a very fast response time and the ability to quickly write back
data into the data set.
18
OLTP vs OLAP
19
2 Data Mining
2.1 What is data mining?
• Data mining refers to extracting or mining knowledge from large amounts of data.
• It is the computational process of discovering patterns in large data sets involving
methods at the intersection of artificial intelligence, machine learning, statistics, and
database systems.
• The overall goal of the data mining process is to extract information from a data set
and transform it into an understandable structure for further use.
• Data mining, also known as knowledge discovery in data (KDD), is the process of
uncovering patterns and other valuable information from large data sets.
The key properties of data mining are
20
• Clustering – is the task of discovering groups and structures in the data that are in
some way or another ”similar”, without using known structures in the data.
• Classification – is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as ”legitimate” or as
”spam”. Regression – attempts to find a function that models the data with the least
error.
1. Knowledge Base: This is the domain knowledge that is used to guide the search
orevaluate the interestingness of resulting patterns. Such knowledge can include con-
cepthierarchies, used to organize attributes or attribute values into different levels of
abstraction. Knowledge such as user beliefs, which can be used to assess a pattern’s
interestingness based on its unexpectedness, may also be included
2. Data Mining Engine: This is essential to the data mining system and ideally
consists of a set of functional modules for tasks such as characterization, association
and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and
evolution analysis.
21
mining module, depending on the implementation of the data mining method used.
For efficient data mining, it is highly recommended to push the evaluation of pattern
interestingness as deep as possible into the mining process so as to confine the search
to only the interesting patterns.
4. User interface: This module communicates between users and the data mining sys-
tem, allowing the user to interact with the system by specifying a data mining query or
task, providing information to help focus the search, and performing exploratory data
mining based on the intermediate data mining results. In addition, this component
allows the user to browse database and data warehouse schemas or data structures,
evaluate mined patterns, and visualize the patterns in different forms.
22
• Outlier detection (and removal) – Outliers are unusual data values that are not
consistent with most observations. Commonly, outliers result from measurement
errors, coding and recording errors, and, sometimes, are natural, abnormal val-
ues. Such nonrepresentative samples can seriously affect the model produced
later. There are two strategies for dealing with outliers:
a. Detect and eventually remove outliers as a part of the preprocessing phase, or
b. Develop robust modeling methods that are insensitive to outliers.
23
• Database Technology
• Statistics
• Machine Learning
• Information Science
• Visualization
• Other Disciplines
Some Other Classification Criteria:
24
25
• Smoothing, which works to remove noise from the data. Such techniques include
binning, regression, and clustering.
• Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated to compute monthly and annual total
amounts. This step is typically used in constructing a data cube for analysis of the
data at multiple granularities.
• Generalization of the data, where low-level or —primitive (raw) data are replaced
by higher-level concepts through the use of concept hierarchies. For example, cate-
gorical attributes, like streets, can be generalized to higher-level concepts, like city or
country.
• Normalization, where the attribute data are scaled to fall within a small specified
range, such as 1:0 to 1:0, or 0:0 to 1:0.
• Attribute construction (or feature construction), where new attributes are con-
structed and added from the given set of attributes to help the mining process.
• Data cube aggregation, where aggregation operations are applied to the data in the
construction of a data cube.
• Attribute subset selection, where irrelevant, weakly relevant, or redundant attributesor
dimensions may be detected and removed.
• Dimensionality reduction, where encoding mechanisms are used to reduce the dataset
size.
• Numerosityreduction, where the data are replaced or estimated by alternative, smaller
data representations such as parametric models (which need to store only the model
parameters instead of the actual data) or nonparametric methods such as clustering,
sampling, and the use of histograms.
26
• Discretization and concept hierarchy generation, where rawdata values for attributes
are replaced by ranges or higher conceptual levels. Data discretization is a form of
numerosity reduction that is very useful for the automatic generation of concept hi-
erarchies. Discretization and concept hierarchy generation are powerful tools for data
mining, in that they allow the mining of data at multiple levels of abstraction.
1. Min-Max normalization: This technique scales the values of a feature to a range be-
tween 0 and 1. This is done by subtracting the minimum value of the feature from
each value, and then dividing by the range of the feature.
2. Z-score normalization: This technique scales the values of a feature to have a mean of
0 and a standard deviation of 1. This is done by subtracting the mean of the feature
from each value, and then dividing by the standard deviation.
3. Decimal Scaling: This technique scales the values of a feature by dividing the values
of a feature by a power of 10.
5. Root transformation: This technique applies a square root transformation to the values
of a feature. This can be useful for data with a wide range of values, as it can help to
reduce the impact of outliers. It’s important to note that normalization should be ap-
plied only to the input features, not the target variable and that different normalization
techniques may work better for different types of data and models.
27
28
• The selection process can be based on various criteria, such as random selection, strat-
ified sampling, or cluster sampling.
• Random selection involves selecting data points randomly without any specific criteria.
• Stratified sampling involves dividing the dataset into homogeneous groups, known as
strata, and selecting samples from each stratum.
• Cluster sampling involves dividing the dataset into clusters and selecting entire clusters
as samples.
• Once the sample is selected, statistical techniques can be applied to analyze the sam-
ple and draw conclusions about the larger dataset. The accuracy of the conclusions
depends on the representativeness of the sample and the sampling method used.
29
References
• https://www.geeksforgeeks.org/olap-operations-in-dbms/
• https://www.javatpoint.com/data-warehouse
• https://www.guru99.com/oltp-vs-olap.html
30