Data Warehousing & Data Mining - Study Material
Data Warehousing & Data Mining - Study Material
important
2 Marks
1. Why is Data Valuable?
2. What is Meta Data?
There is also data about data. It is called metadata. For example, people regularly upload
videos on YouTube. The format of the video file (whether it was a high-def file or lower
resolution) is metadata. The information about the time of uploading is metadata. The
account from which it was uploaded is also metadata.
3. What is Database?
• A database is a modelled collection of data that is accessible in many ways. A data model
can be designed to integrate the operational data of the organization.
• The data model abstracts the key entities involved in an action and their relationships. Most
databases today follow the relational data model and its variants.
4. What is ER model?
• E-R model of real world – Entities (objects)
• E.g., customers, accounts, bank branch – Relationships between entities
• E.g., Account A-101 is held by customer Johnson
• Relationship set depositor associates customers with accounts
• Widely used for database design – Database design in E-R model usually converted to
design in the relational model (coming up next) which is used for storage and processing
5. What is DDL?
Data Definition Language (DDL)
• Specification notation for defining the database schema – E.g., create table account (
account-number char (10), balance integer)
• DDL compiler generates a set of tables stored in a data dictionary
• Data dictionary contains metadata (i.e., data about data) – database schema – Data
storage and definition language
• language in which the storage structure and access methods used by the database system are
specified
• Usually, an extension of the data definition language
6. What is DML?
Data Manipulation Language (DML)
• Language for accessing and manipulating the data organized by the appropriate data
model – DML also known as query language
• Two classes of languages – Procedural – user specifies what data is required and how to get
those data – Nonprocedural – user specifies what data is required without specifying how to
get those data
• SQL is the most widely used query language
7. What is a Transaction / Transaction Management?
• A transaction is a collection of operations that performs a single logical function in a
database application
• Transaction-management component ensures that the database remains in a consistent
(correct) state despite system failures (e.g., power failures and operating system crashes) and
transaction failures.
• Concurrency-control manager controls the interaction among the concurrent transactions,
to ensure the consistency of the database.
8. What is Data Warehouse?
• A data warehouse is an organized store of data from all over the organization, specially
designed to help make management decisions.
• Data can be extracted from operational database to answer a particular set of queries. This
data, combined with other data, can be rolled up to a consistent granularity and uploaded
to a separate data store called the data warehouse.
• Therefore, the data warehouse is a simpler version of the operational data base, with the
purpose of addressing reporting and decision-making needs only.
• The data in the warehouse cumulatively grows as more operational data becomes available
and is extracted and appended to the data warehouse. Unlike in the operational database, the
data values in the warehouse are not updated
9. What is Data Mining?
• Data Mining is the art and science of discovering useful innovative patterns from data.
There is a wide variety of patterns that can be found in the data. There are many techniques,
simple or complex, that help with finding patterns.
10. When to Mine for Data?
• Data mining should be done to solve high-priority, high-value problems.
• Much effort is required to gather data, clean and organize it, mine it with many techniques,
interpret the results, and find the right insight.
• It is important that there be a large expected payoff from finding the insight.
• One should select the right data (and ignore the rest), organize it into a nice and imaginative
framework that brings relevant data together, and then apply data mining techniques to
deduce the right insight.
11. What is Data Pre-processing?
• Data pre-processing is a crucial step in the data analysis process, as it helps to ensure that
the data is clean, accurate, and ready for analysis. Without proper pre-processing, data
may be incomplete, incorrect, or difficult to work with, leading to flawed conclusions and
flawed analyses.
• Data preprocessing is crucial because it helps to ensure that the data is accurate and
consistent, and that it can be easily analysed. Without proper preprocessing, data may be
incomplete, incorrect, or difficult to work with, leading to flawed conclusions and flawed
analyses
12. What is Data Discretization?
• Data discretization refers to converting a huge number of data values into smaller ones
so that the evaluation and management of data become easy.
• Like Age into Age groups i.e., Child (0-14), Young (14-30), Mature (30-55) and old (>55)
and categorizing bank accounts based on Limits (>5k<50k;>50k<500k) etc
• Discretization is the process through which we can transform continuous variables, models,
or functions into a discrete form. We do this by creating a set of contiguous intervals
(Bins/Buckets) that go across the range of our desired variables, models, or functions
• Continuous data is Measured, while Discrete data is Counted.
• Data binning is another name of data discretization, data categorization, data
bucketing, or data quantization
6 Marks
1. What are the types of data?
1. Nominal Data: Data could be an unordered collection of values. For example, a retailer
sells shirts of red, blue, and green colors. There is no intrinsic ordering among these color
values. One can hardly argue that any one color is higher or lower than the other. This is
called nominal (means names) data.
2. Ordinal Data: Data could be ordered values like small, medium and large. For
example, the sizes of shirts could be extra-small, small, medium, and large. There is clarity
that medium is bigger than small, and large is bigger than medium. But the differences may
not be equal. This is called ordinal (ordered) data.
3. Discrete/Equal Interval: Another type of data has discrete numeric values defined in a
certain range, with the assumption of equal distance between the values. Customer
satisfaction score may be ranked on a 10-point scale with 1 being lowest and 10 being
highest. This requires the respondent to carefully calibrate the entire range as objectively as
possible and place his own measurement in that scale.
4. Ratio (Continuous Data): The highest level of numeric data is ratio data which can take
on any numeric value. The weights and heights of all employees would be exact numeric
values. The price of a shirt will also take any numeric value. It is called ratio (any fraction)
data.
2. What are the types of Database users?
Users are differentiated by the way they expect to interact with the system
• Application programmers – interact with system through DML calls
• Sophisticated users – form requests in a database query language
• Specialized users – write specialized database applications that do not fit into the
traditional data processing framework
• Naïve users – invoke one of the permanent application programs that have been written
previously
3. Explain Database VS Data Warehouse?
4. Explain Data Mining Techniques?
Decision Trees: They help classify populations into classes. It is said that 70% of all data
mining work is about classification solutions; and that 70% of all classification work uses
decision trees. Thus, decision trees are the most popular and important data mining technique.
There are decision trees are the most popular and important data mining technique. There are
many popular algorithms to make decision trees. They differ in terms of their mechanisms
and each technique work well for different situations. It is possible to try multiple
decision-tree algorithms on a data set and compare the predictive accuracy of each tree.
Regression: This is a well-understood technique from the field of statistics. The goal is to
find a best fitting curve through the many data points. The best fitting curve is that which
minimizes the (error) distance between the actual data points and the values predicted by the
curve. Regression models can be projected into the future for prediction and forecasting
purposes.
Artificial Neural Networks: Originating in the field of artificial intelligence and machine
learning, ANNs are multi-layer non-linear information processing models that learn from
past data and predict future values. These models predict well, leading to their popularity.
The model’s parameters may not be very intuitive. Thus, neural networks are opaque like a
black-box. These systems also require a large amount of past data to adequate train the
system
• Cluster analysis: This is an important data mining technique for dividing and conquering
large data sets. The data set is divided into a certain number of clusters, by discerning
similarities and dissimilarities within the data. There is no one right answer for the number of
clusters in the dissimilarities within the data. There is no one right answer for the number of
clusters in the data. The user needs to make a decision by looking at how well the number of
clusters chosen fit the data. This is most commonly used for market segmentation. Unlike
decision trees and regression, there is no one right answer for cluster analysis
• Association Rule Mining: Also called Market Basket Analysis when used in retail
industry, these techniques look for associations between data values. An analysis of items
frequently found together in a market basket can help Cross sell products, and also create
product bundles.
5. Explain the types of Data Pre-Processing?
• Data cleaning: Data cleaning involves removing errors or missing values from the data.
This includes identifying and correcting errors, filling in missing values, and standardizing
data formats.
• Data transformation: Data transformation is the process of converting data from one
format to another, such as from a raw data format to a structured format. This is useful when
the data needs to be combined with other data sources or analysed using a specific tool or
software. (Date Format, Year XXXX or Year XX)
• Data normalization: Data normalization is the process of scaling data to a common range,
such as 0 to 1. This is useful when the data contains values measured on different scales,
ensuring that all values are comparable. There are several methods for normalizing data,
including min-max normalization, z-score normalization, and decimal scaling.
• Data aggregation: Data aggregation combines data from multiple sources or levels into a
single, summary value. This is useful when working with large, complex datasets, reducing
the amount of data to be analysed and making it easier to identify trends and patterns.
• Data reduction: Data reduction removes unnecessary or redundant data from a dataset.
This is useful when working with large datasets, reducing the amount of data to be analysed
and making it easier to identify trends and patterns. There are several methods for reducing
data, including dimensionality reduction, data compression, and feature selection.
Unit – 2
2 Marks
1. What is Time Variant?
The data in DW should grow at daily or other chosen intervals. That allows latest
comparisons over time.
2. List out Characteristics of Data Warehouse?
● Subject oriented
● Integrated
● Time variant
● Non-volatile
3. What are the data sources for Data Warehouse?
1. Operations data
2. Specialized applications
3. External syndicated data
1. Data should be extracted from the operational (transactional) database sources, as well as
from other applications, on a regular basis.
2. The extracted data should be aligned together by key fields and integrated into a single
data set. It should be cleansed of any irregularities or missing values. It should be rolled-up
together to the same level of granularity. Desired fields, such as daily sales totals, should be
computed. The entire data should then be brought to the same format as the central table of
DW.
3. This transformed data should then be uploaded into the DW.
• This ETL process should be run at a regular frequency. Daily transaction data can be
extracted from ERPs, transformed, and uploaded to the database the same night. Thus, the
DW is up to date every morning.
• If a DW is needed for near-real-time information access, then the ETL processes would
need to be executed more frequently.
• ETL work is usually done using automated using programming scripts that are written,
tested, and then deployed for periodically updating the DW.
5. What is a Data Cube?
•” A data cube allows data to be modelled and viewed in multiple dimensions. It is defined
by dimensions and facts.
• In general terms, dimensions are the perspectives or entities with respect to which an
organization wants to keep records.
• For example, All Electronics may create a sales data warehouse in order to keep records of
the store’s sales with respect to the dimensions time, item, branch, and location. These
dimensions allow the store to keep track of things like monthly sales of items and the
branches and locations
6. What is Lattice of Cuboids?
Given a set of dimensions, we can generate a cuboid for each of the possible subsets of the
given dimensions. The result would form a lattice of cuboids, each showing the data at a
different level of summarization, or group by. The lattice of cuboids is then referred to as a
data cube. The cuboid that holds the lowest level of summarization is called the base cuboid
7. What is Concept Hierarchy?
• Defines a sequence of mappings from a set of low-level concepts to higher-level, more
general concepts. Consider a concept hierarchy for the dimension location. City values for
location include Vancouver, Toronto, New York, and Chicago.
• Each city, however, can be mapped to the province or state to which it belongs.
• For example, Vancouver can for example, Vancouver can be mapped be mapped to British
Columbia British Columbia, and Chicago to Illinois.
• The provinces and states can in turn be mapped to the country to which they belong, such as
Canada or the USA.
• These mappings form a concept hierarchy for the dimension location, mapping a set of
low-level concepts (i.e., cities) to higher- level, more general concepts (i.e., countries)
3. Draw the Structure and explain Design Star or Design Snowflake or Fact
Constellation?
4. Explain Star VS Snowflake?
5. Explain Data Warehouse back-end tools and utilities?
Data warehouse systems use back-end tools and utilities to populate and refresh their data.
These tools and utilities include the following functions:
• Data extraction, which typically gathers data from multiple, heterogeneous, and external
sources
• Data cleaning, which detects errors in the data and rectifies them when possible
• Data transformation, which converts data from legacy or host format to warehouse format
• Load, which sorts, summarizes, consolidates, computes views, checks integrity, and builds
indices and partitions
• Refresh, which propagates the updates from the data sources to the warehouse
6. What is the Usage of Data Warehouse?
• There are three kinds of data warehouse applications:
• information processing, analytical processing, and data mining:
• Information processing supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts, or graphs. A current trend in data warehouse information processing
is to construct low-cost Web-based accessing tools that are then integrated with Web
browsers.
• Analytical processing supports basic OLAP operations, including slice-and-dice,
drill-down, roll-up, and pivoting. It generally operates on historical data in both summarized
and detailed forms. The major strength of on-line analytical processing over information
processing is the multidimensional data analysis of data warehouse data.
• Data mining supports knowledge discovery by finding hidden patterns and associations,
constructing analytical models, performing classification and prediction, and presenting the
mining results using visualization tools.
Unit – 3
2 Marks
1. What is Tuple?
• In mathematics, a tuple is a finite ordered list (sequence) of elements. An n-tuple is a
sequence (or ordered list) of n elements, where n is a non-negative integer. There is only
one 0-tuple, referred to as the empty tuple. An n- tuple may be defined inductively using the
one 0-tuple, referred to as the empty tuple. An n- tuple may be defined inductively using the
construction of an ordered pair.
• Mathematicians usually write tuples by listing the elements within parentheses “()" and
separated by a comma and a space; for example, (2, 7, 4, 1,7) denotes a 5-tuple.
2. What is Market Basket Analysis?
A data mining technique that is used to uncover purchase patterns in any retail setting is
known as Market Basket Analysis. In simple terms Basically, Market basket analysis in data
mining is to analyse the combination of products which been bought together.
This is a technique that gives the careful study of purchases done by a customer in a
supermarket. This concept identifies the pattern of frequent purchase items by customers.
This analysis can help to promote deals, offers, sale by the company
3. What is Association Rules?
The information that customers who purchase computers also tend to buy antivirus software
at the same time is represented in the following association rule:
Computer antivirus software [support =2%, confidence = 60%]
• Rule support and confidence are two measures of rule interestingness. They respectively
reflect the usefulness and certainty of discovered rules.
• A support of 2% means that 2% of all the transactions under analysis show that computer
and antivirus software are purchased together.
• A confidence of 60% means that 60% of the customers who purchased a computer also
bought the software.
• Note: Association rules are considered interesting if they satisfy both a minimum support
threshold and a minimum confidence threshold
4. What is Lift?
Lift is a simple correlation measure that is given as follows. The occurrence of itemset A is
independent of the occurrence of itemset, if P (AUB)= P(A)P(B); otherwise, itemset A and B
are dependent and correlated as events. This definition can easily be extended to more than
two itemset.
6 marks
1. Explain Association Rules?
The information that customers who purchase computers also tend to buy antivirus software
at the same time is represented in the following association rule:
Computer antivirus software [support =2%, confidence = 60%]
• Rule support and confidence are two measures of rule interestingness. They respectively
reflect the usefulness and certainty of discovered rules.
• A support of 2% means that 2% of all the transactions under analysis show that computer
and antivirus software are purchased together.
• A confidence of 60% means that 60% of the customers who purchased a computer also
bought the software.
• Note: Association rules are considered interesting if they satisfy both a minimum support
threshold and a minimum confidence threshold
• A set of items is referred to as an itemset. An itemset that contains k items is a k-itemset.
The set {computer, antivirus software} is a 2-itemset.
• The occurrence frequency of an itemset is the number of transactions that contain the
itemset. This is also known, simply, as the frequency, support count, or count of the itemset.
Example of Association
• Example: To illustrate the concepts, we use a small example from the supermarket domain.
The set of items is I= {milk, bread, butter, beer} and a small database containing the items (1
codes presence and 0 absence of an item in a transaction) is shown in the table.
• An example rule for the supermarket could be meaning that if butter and bread are bought,
customers also buy milk – {butter, bread} {milk}
Important concepts of Association Rule Mining:
Support of a Rule
The support Supp (X) of an itemset (X) is defined as the proportion of transactions in the
data set which contain the itemset.
• In the example database, the itemset {milk, bread, butter} has a support of (1/5) =0.2 since
it occurs in 20% of all transactions (1 out of 5 transactions).
Confidence of a Rule
• Conf (X Y) = Supp (X U Y)/ Supp (Y)
• Here {bread, butter} {milk} has a confidence of (0.2/0.2) =1
100% in the database, which means that for 100% of the transactions containing butter and
bread the rule is correct (100% of the times a customer buys butter and bread, milk is bought
as well).
Strong Association Rules
• Rules that satisfy both a minimum support threshold (min sup) and a minimum confidence
threshold (min conf) are called Strong Association Rules.
• In general, association rule mining can be viewed as a two-step process:
1. Find all frequent itemset: By definition, each of these itemset will occur at least as
frequently as a predetermined minimum support count, minisub.
2. Generate strong association rules from the frequent itemset: By definition, these rules
must satisfy minimum support and minimum confidence.
10 marks
1. Explain Frequent Pattern Mining?
Can be classified in various ways, based on the following criteria:
1. Based on the completeness of patterns to be mined: We can mine the complete set of
frequent itemset, the closed frequent itemset, and the maximal frequent itemset, given a
minimum support threshold. We can also mine constrained frequent itemset, approximate
frequent itemset, near match frequent itemset, top-k frequent itemset and so on...
2. Based on the levels of abstraction involved in the rule set: Some methods for
association rule mining can find rules at differing levels of abstraction.
• For example, suppose that a set of association rules mined includes the following rules
where X is a variable representing a customer:
• buys (X, ―computer‖)) =>buys (X, ―HP printer‖) (1)
• buys (X, ―laptop computer‖)) =>buys (X, ―HP printer‖) (2)
• In rule (1) and (2), the items bought are referenced at different levels of abstraction (e.g.,
computer‖ is a higher-level abstraction of laptop computer‖
3. Based on the number of data dimensions involved in the rule: If the items or attributes
in an association rule reference only one dimension, then it is a single-dimensional
association rule.
4. Based on the types of values handled in the rule: If a rule involves associations between
the presence or absence of items, it is a Boolean association rule. If a rule describes
associations between quantitative items or attributes, then it is a quantitative association
rule.
5. Based on the kinds of rules to be mined: Frequent pattern analysis can generate various
kinds of rules and other interesting relationships. Association rule mining can generate a large
number of rules, many of which are redundant or do not indicate correlation relationship
among itemset.
6. Based on the kinds of patterns to be mined: Sequential pattern mining searches for
frequent subsequences in a sequence data set, where a sequence records an ordering of events
Unit – 4
2 Marks
1. What is Rule-based Classifier?
• Classify records by using a collection of “if...then...” rules
• Rule: (Condition) y
• were
• Condition is a conjunctions of attributes
• y is the class label
• LHS: rule antecedent or pre condition
• RHS: rule consequent
• Rules: R= (R1^R2^...^ Rk) –
Examples of classification rules:
• (Blood Type=Warm) ^ (Lay Eggs=Yes) Birds
• (Taxable Income < 50K) ^ (Refund=Yes) Evade=No
2. What are the characteristics of Rule-based Classifier?
• Mutually exclusive rules
– Classifier contains mutually exclusive rules if no two rules are triggered by the same
record.
– Every record is covered by at most one rule
• Exhaustive rules – Classifier has exhaustive coverage if it accounts for every possible
combination of attribute
3. What is Support Vendor Machine?
• A method for the classification of both linear and nonlinear data.
• In a nutshell, an SVM is an algorithm that works as follows. It uses a nonlinear mapping
to transform the original training data into a higher dimension.
• Within this new dimension, it searches for the linear optimal separating hyperplane
(i.e., a “decision boundary” separating the tuples of one class from another).
• With an appropriate nonlinear mapping to a sufficiently high dimension, data from two
classes can always be separated by a hyperplane.
• The SVM finds this hyperplane using support vectors (“essential” training tuples) and
margins (defined by the support vectors)
4. How does Classification work?
6 marks
1. Explain Metrics for Evaluation Classifier Performance?
• Tuples – (tuples of the main class of interest) and – negative tuples (all other tuples).
• Given two classes, for example, the positive tuples may be buys computer = yes while
(buys computer=no) is negative tuples
• True positives (TP): These refer to the positive tuples that were correctly labelled by the
classifier. Let TP be the number of true positives.
• True negatives (TN): These are the negative tuples that were correctly labelled
by the classifier. Let TN be the number of true negatives.
• False positives (FP): These are the negative tuples that were incorrectly labelled as positive
(e.g., tuples of class buy computer = no for which the classifier predicted buys computer =
yes). Let FP be the number of false positives.
• False negatives (FN): These are the positive tuples that were mislabelled as negative (e.g.,
tuples of class buy computer = yes for which the classifier predicted buys computer = no).
Let FN be the number of false negatives. He negative tuples are (buys Computer=no)
Then, [1] Posterior probability: P(H/X) is the probability that the employee X will belongs
to senior class given that we know the employee’s department, age and salary.
P(X/H) is the probability that an employee, X, is of sales department, with age 35 and have
salary 40K, given that, it belongs to senior class.
2] Prior probability: P(H) is the probability that any given employee will belong to the
senior class regardless of any information like department, age etc.
P(X) is the probability of an employee whose age is 35 years; department is sales and the
salary is 40K.
Bayesian classification method:
Let D is the training data set. Each tuple in D is represented by an n- dimensional attribute
vector, X = (x1, x2, x3, . . .., xn) where x1, x2, x3...., Xn are the values of attributes A1, A2,
A3, . . .., An respectively.
Let there are m classes, C1, C2, C3, . . .., Cm.
Then, The Bayesian classifier predicts that a given tuple X belongs to the class Ci if and only
if the posterior probability P (Ci / X) is highest. i.e., P (Ci / X) > P (Cj / X) for 1 ≤ j ≤ m, j ≠i
If the class prior probabilities P (Ci) are not known, then it is commonly assumed that the
classes are equally likely, i.e., P(C1) = P(C2) = P(C3) = .... = P(Cm).
Our goal is to maximize P (Ci / X).