Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
19 views27 pages

Data Warehousing & Data Mining - Study Material

1. The document discusses various concepts related to databases including metadata, database models, data definition language (DDL), data manipulation language (DML), transactions, data warehouses, data mining, data preprocessing, and data discretization. 2. It describes key database concepts such as metadata being data about data, databases following relational or other data models, DDL being used to define the database schema, and DML allowing users to access and manipulate data. 3. The document also discusses data warehouses storing organized data from across an organization for analysis, data mining techniques finding patterns in data, important steps in data preprocessing, and discretization converting data values into smaller groups.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views27 pages

Data Warehousing & Data Mining - Study Material

1. The document discusses various concepts related to databases including metadata, database models, data definition language (DDL), data manipulation language (DML), transactions, data warehouses, data mining, data preprocessing, and data discretization. 2. It describes key database concepts such as metadata being data about data, databases following relational or other data models, DDL being used to define the database schema, and DML allowing users to access and manipulate data. 3. The document also discusses data warehouses storing organized data from across an organization for analysis, data mining techniques finding patterns in data, important steps in data preprocessing, and discretization converting data values into smaller groups.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Unit – 1 Note: Italic and underlined ques are

important
2 Marks
1. Why is Data Valuable?
2. What is Meta Data?
There is also data about data. It is called metadata. For example, people regularly upload
videos on YouTube. The format of the video file (whether it was a high-def file or lower
resolution) is metadata. The information about the time of uploading is metadata. The
account from which it was uploaded is also metadata.
3. What is Database?
• A database is a modelled collection of data that is accessible in many ways. A data model
can be designed to integrate the operational data of the organization.
• The data model abstracts the key entities involved in an action and their relationships. Most
databases today follow the relational data model and its variants.
4. What is ER model?
• E-R model of real world – Entities (objects)
• E.g., customers, accounts, bank branch – Relationships between entities
• E.g., Account A-101 is held by customer Johnson
• Relationship set depositor associates customers with accounts
• Widely used for database design – Database design in E-R model usually converted to
design in the relational model (coming up next) which is used for storage and processing
5. What is DDL?
Data Definition Language (DDL)
• Specification notation for defining the database schema – E.g., create table account (
account-number char (10), balance integer)
• DDL compiler generates a set of tables stored in a data dictionary
• Data dictionary contains metadata (i.e., data about data) – database schema – Data
storage and definition language
• language in which the storage structure and access methods used by the database system are
specified
• Usually, an extension of the data definition language
6. What is DML?
Data Manipulation Language (DML)
• Language for accessing and manipulating the data organized by the appropriate data
model – DML also known as query language
• Two classes of languages – Procedural – user specifies what data is required and how to get
those data – Nonprocedural – user specifies what data is required without specifying how to
get those data
• SQL is the most widely used query language
7. What is a Transaction / Transaction Management?
• A transaction is a collection of operations that performs a single logical function in a
database application
• Transaction-management component ensures that the database remains in a consistent
(correct) state despite system failures (e.g., power failures and operating system crashes) and
transaction failures.
• Concurrency-control manager controls the interaction among the concurrent transactions,
to ensure the consistency of the database.
8. What is Data Warehouse?
• A data warehouse is an organized store of data from all over the organization, specially
designed to help make management decisions.
• Data can be extracted from operational database to answer a particular set of queries. This
data, combined with other data, can be rolled up to a consistent granularity and uploaded
to a separate data store called the data warehouse.
• Therefore, the data warehouse is a simpler version of the operational data base, with the
purpose of addressing reporting and decision-making needs only.
• The data in the warehouse cumulatively grows as more operational data becomes available
and is extracted and appended to the data warehouse. Unlike in the operational database, the
data values in the warehouse are not updated
9. What is Data Mining?
• Data Mining is the art and science of discovering useful innovative patterns from data.
There is a wide variety of patterns that can be found in the data. There are many techniques,
simple or complex, that help with finding patterns.
10. When to Mine for Data?
• Data mining should be done to solve high-priority, high-value problems.
• Much effort is required to gather data, clean and organize it, mine it with many techniques,
interpret the results, and find the right insight.
• It is important that there be a large expected payoff from finding the insight.
• One should select the right data (and ignore the rest), organize it into a nice and imaginative
framework that brings relevant data together, and then apply data mining techniques to
deduce the right insight.
11. What is Data Pre-processing?
• Data pre-processing is a crucial step in the data analysis process, as it helps to ensure that
the data is clean, accurate, and ready for analysis. Without proper pre-processing, data
may be incomplete, incorrect, or difficult to work with, leading to flawed conclusions and
flawed analyses.
• Data preprocessing is crucial because it helps to ensure that the data is accurate and
consistent, and that it can be easily analysed. Without proper preprocessing, data may be
incomplete, incorrect, or difficult to work with, leading to flawed conclusions and flawed
analyses
12. What is Data Discretization?
• Data discretization refers to converting a huge number of data values into smaller ones
so that the evaluation and management of data become easy.
• Like Age into Age groups i.e., Child (0-14), Young (14-30), Mature (30-55) and old (>55)
and categorizing bank accounts based on Limits (>5k<50k;>50k<500k) etc
• Discretization is the process through which we can transform continuous variables, models,
or functions into a discrete form. We do this by creating a set of contiguous intervals
(Bins/Buckets) that go across the range of our desired variables, models, or functions
• Continuous data is Measured, while Discrete data is Counted.
• Data binning is another name of data discretization, data categorization, data
bucketing, or data quantization
6 Marks
1. What are the types of data?
1. Nominal Data: Data could be an unordered collection of values. For example, a retailer
sells shirts of red, blue, and green colors. There is no intrinsic ordering among these color
values. One can hardly argue that any one color is higher or lower than the other. This is
called nominal (means names) data.
2. Ordinal Data: Data could be ordered values like small, medium and large. For
example, the sizes of shirts could be extra-small, small, medium, and large. There is clarity
that medium is bigger than small, and large is bigger than medium. But the differences may
not be equal. This is called ordinal (ordered) data.
3. Discrete/Equal Interval: Another type of data has discrete numeric values defined in a
certain range, with the assumption of equal distance between the values. Customer
satisfaction score may be ranked on a 10-point scale with 1 being lowest and 10 being
highest. This requires the respondent to carefully calibrate the entire range as objectively as
possible and place his own measurement in that scale.
4. Ratio (Continuous Data): The highest level of numeric data is ratio data which can take
on any numeric value. The weights and heights of all employees would be exact numeric
values. The price of a shirt will also take any numeric value. It is called ratio (any fraction)
data.
2. What are the types of Database users?
Users are differentiated by the way they expect to interact with the system
• Application programmers – interact with system through DML calls
• Sophisticated users – form requests in a database query language
• Specialized users – write specialized database applications that do not fit into the
traditional data processing framework
• Naïve users – invoke one of the permanent application programs that have been written
previously
3. Explain Database VS Data Warehouse?
4. Explain Data Mining Techniques?
Decision Trees: They help classify populations into classes. It is said that 70% of all data
mining work is about classification solutions; and that 70% of all classification work uses
decision trees. Thus, decision trees are the most popular and important data mining technique.
There are decision trees are the most popular and important data mining technique. There are
many popular algorithms to make decision trees. They differ in terms of their mechanisms
and each technique work well for different situations. It is possible to try multiple
decision-tree algorithms on a data set and compare the predictive accuracy of each tree.
Regression: This is a well-understood technique from the field of statistics. The goal is to
find a best fitting curve through the many data points. The best fitting curve is that which
minimizes the (error) distance between the actual data points and the values predicted by the
curve. Regression models can be projected into the future for prediction and forecasting
purposes.
Artificial Neural Networks: Originating in the field of artificial intelligence and machine
learning, ANNs are multi-layer non-linear information processing models that learn from
past data and predict future values. These models predict well, leading to their popularity.
The model’s parameters may not be very intuitive. Thus, neural networks are opaque like a
black-box. These systems also require a large amount of past data to adequate train the
system
• Cluster analysis: This is an important data mining technique for dividing and conquering
large data sets. The data set is divided into a certain number of clusters, by discerning
similarities and dissimilarities within the data. There is no one right answer for the number of
clusters in the dissimilarities within the data. There is no one right answer for the number of
clusters in the data. The user needs to make a decision by looking at how well the number of
clusters chosen fit the data. This is most commonly used for market segmentation. Unlike
decision trees and regression, there is no one right answer for cluster analysis
• Association Rule Mining: Also called Market Basket Analysis when used in retail
industry, these techniques look for associations between data values. An analysis of items
frequently found together in a market basket can help Cross sell products, and also create
product bundles.
5. Explain the types of Data Pre-Processing?
• Data cleaning: Data cleaning involves removing errors or missing values from the data.
This includes identifying and correcting errors, filling in missing values, and standardizing
data formats.
• Data transformation: Data transformation is the process of converting data from one
format to another, such as from a raw data format to a structured format. This is useful when
the data needs to be combined with other data sources or analysed using a specific tool or
software. (Date Format, Year XXXX or Year XX)
• Data normalization: Data normalization is the process of scaling data to a common range,
such as 0 to 1. This is useful when the data contains values measured on different scales,
ensuring that all values are comparable. There are several methods for normalizing data,
including min-max normalization, z-score normalization, and decimal scaling.
• Data aggregation: Data aggregation combines data from multiple sources or levels into a
single, summary value. This is useful when working with large, complex datasets, reducing
the amount of data to be analysed and making it easier to identify trends and patterns.
• Data reduction: Data reduction removes unnecessary or redundant data from a dataset.
This is useful when working with large datasets, reducing the amount of data to be analysed
and making it easier to identify trends and patterns. There are several methods for reducing
data, including dimensionality reduction, data compression, and feature selection.

Unit – 2
2 Marks
1. What is Time Variant?
The data in DW should grow at daily or other chosen intervals. That allows latest
comparisons over time.
2. List out Characteristics of Data Warehouse?
● Subject oriented
● Integrated
● Time variant
● Non-volatile
3. What are the data sources for Data Warehouse?
1. Operations data
2. Specialized applications
3. External syndicated data

4. What is Data loading process – ELT Cycle?


The heart of a useful DW is the processes to populate the DW with good quality data. This is
called the Extract-Transform-Load (ETL) cycle.

1. Data should be extracted from the operational (transactional) database sources, as well as
from other applications, on a regular basis.
2. The extracted data should be aligned together by key fields and integrated into a single
data set. It should be cleansed of any irregularities or missing values. It should be rolled-up
together to the same level of granularity. Desired fields, such as daily sales totals, should be
computed. The entire data should then be brought to the same format as the central table of
DW.
3. This transformed data should then be uploaded into the DW.
• This ETL process should be run at a regular frequency. Daily transaction data can be
extracted from ERPs, transformed, and uploaded to the database the same night. Thus, the
DW is up to date every morning.
• If a DW is needed for near-real-time information access, then the ETL processes would
need to be executed more frequently.
• ETL work is usually done using automated using programming scripts that are written,
tested, and then deployed for periodically updating the DW.
5. What is a Data Cube?
•” A data cube allows data to be modelled and viewed in multiple dimensions. It is defined
by dimensions and facts.
• In general terms, dimensions are the perspectives or entities with respect to which an
organization wants to keep records.
• For example, All Electronics may create a sales data warehouse in order to keep records of
the store’s sales with respect to the dimensions time, item, branch, and location. These
dimensions allow the store to keep track of things like monthly sales of items and the
branches and locations
6. What is Lattice of Cuboids?
Given a set of dimensions, we can generate a cuboid for each of the possible subsets of the
given dimensions. The result would form a lattice of cuboids, each showing the data at a
different level of summarization, or group by. The lattice of cuboids is then referred to as a
data cube. The cuboid that holds the lowest level of summarization is called the base cuboid
7. What is Concept Hierarchy?
• Defines a sequence of mappings from a set of low-level concepts to higher-level, more
general concepts. Consider a concept hierarchy for the dimension location. City values for
location include Vancouver, Toronto, New York, and Chicago.
• Each city, however, can be mapped to the province or state to which it belongs.
• For example, Vancouver can for example, Vancouver can be mapped be mapped to British
Columbia British Columbia, and Chicago to Illinois.
• The provinces and states can in turn be mapped to the country to which they belong, such as
Canada or the USA.
• These mappings form a concept hierarchy for the dimension location, mapping a set of
low-level concepts (i.e., cities) to higher- level, more general concepts (i.e., countries)

8. What is Roll-up and Drill-down model?


• Roll up (drill-up): summarize data o It is performed by climbing up hierarchy of a
dimension or by dimension reduction (reduce the cube by one or more dimensions).
• The roll up operation in the example is based location (roll up on location) is equivalent to
grouping the data by country.
9. State the formula for Cuboids in a Cube?
For an n-dimensional data cube, the total number of cuboids that can be generated (including
the cuboids generated by climbing up the hierarchies along each dimension) is

10. What is Materialization of data cube and its types?


• It is unrealistic to precompute and materialize all of the cuboids that can possibly be
generated for a data cube (or from a base cuboid).
• If there are many cuboids, and these cuboids are large in size, a more reasonable option is
partial materialization, that is, to materialize only some of the possible cuboids that can be
generated
• Materialize every (cuboid) (full materialization), none (no materialization), or some (partial
materialization)
• Selection of which cuboids to materialize is Based on size, sharing, access frequency, etc.
Define new warehouse relations using SQL expressions
6 Marks
1. Draw the High level – Data warehouse architecture diagram?

2. Enumerate OLTP VS OLAP?


OLTP online transaction processing. OLAP online analytical processing

3. Draw the Structure and explain Design Star or Design Snowflake or Fact
Constellation?
4. Explain Star VS Snowflake?
5. Explain Data Warehouse back-end tools and utilities?
Data warehouse systems use back-end tools and utilities to populate and refresh their data.
These tools and utilities include the following functions:
• Data extraction, which typically gathers data from multiple, heterogeneous, and external
sources
• Data cleaning, which detects errors in the data and rectifies them when possible
• Data transformation, which converts data from legacy or host format to warehouse format
• Load, which sorts, summarizes, consolidates, computes views, checks integrity, and builds
indices and partitions
• Refresh, which propagates the updates from the data sources to the warehouse
6. What is the Usage of Data Warehouse?
• There are three kinds of data warehouse applications:
• information processing, analytical processing, and data mining:
• Information processing supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts, or graphs. A current trend in data warehouse information processing
is to construct low-cost Web-based accessing tools that are then integrated with Web
browsers.
• Analytical processing supports basic OLAP operations, including slice-and-dice,
drill-down, roll-up, and pivoting. It generally operates on historical data in both summarized
and detailed forms. The major strength of on-line analytical processing over information
processing is the multidimensional data analysis of data warehouse data.
• Data mining supports knowledge discovery by finding hidden patterns and associations,
constructing analytical models, performing classification and prediction, and presenting the
mining results using visualization tools.
Unit – 3
2 Marks
1. What is Tuple?
• In mathematics, a tuple is a finite ordered list (sequence) of elements. An n-tuple is a
sequence (or ordered list) of n elements, where n is a non-negative integer. There is only
one 0-tuple, referred to as the empty tuple. An n- tuple may be defined inductively using the
one 0-tuple, referred to as the empty tuple. An n- tuple may be defined inductively using the
construction of an ordered pair.
• Mathematicians usually write tuples by listing the elements within parentheses “()" and
separated by a comma and a space; for example, (2, 7, 4, 1,7) denotes a 5-tuple.
2. What is Market Basket Analysis?
A data mining technique that is used to uncover purchase patterns in any retail setting is
known as Market Basket Analysis. In simple terms Basically, Market basket analysis in data
mining is to analyse the combination of products which been bought together.
This is a technique that gives the careful study of purchases done by a customer in a
supermarket. This concept identifies the pattern of frequent purchase items by customers.
This analysis can help to promote deals, offers, sale by the company
3. What is Association Rules?
The information that customers who purchase computers also tend to buy antivirus software
at the same time is represented in the following association rule:
Computer antivirus software [support =2%, confidence = 60%]
• Rule support and confidence are two measures of rule interestingness. They respectively
reflect the usefulness and certainty of discovered rules.
• A support of 2% means that 2% of all the transactions under analysis show that computer
and antivirus software are purchased together.
• A confidence of 60% means that 60% of the customers who purchased a computer also
bought the software.
• Note: Association rules are considered interesting if they satisfy both a minimum support
threshold and a minimum confidence threshold
4. What is Lift?
Lift is a simple correlation measure that is given as follows. The occurrence of itemset A is
independent of the occurrence of itemset, if P (AUB)= P(A)P(B); otherwise, itemset A and B
are dependent and correlated as events. This definition can easily be extended to more than
two itemset.

6 marks
1. Explain Association Rules?
The information that customers who purchase computers also tend to buy antivirus software
at the same time is represented in the following association rule:
Computer antivirus software [support =2%, confidence = 60%]
• Rule support and confidence are two measures of rule interestingness. They respectively
reflect the usefulness and certainty of discovered rules.
• A support of 2% means that 2% of all the transactions under analysis show that computer
and antivirus software are purchased together.
• A confidence of 60% means that 60% of the customers who purchased a computer also
bought the software.
• Note: Association rules are considered interesting if they satisfy both a minimum support
threshold and a minimum confidence threshold
• A set of items is referred to as an itemset. An itemset that contains k items is a k-itemset.
The set {computer, antivirus software} is a 2-itemset.
• The occurrence frequency of an itemset is the number of transactions that contain the
itemset. This is also known, simply, as the frequency, support count, or count of the itemset.
Example of Association
• Example: To illustrate the concepts, we use a small example from the supermarket domain.
The set of items is I= {milk, bread, butter, beer} and a small database containing the items (1
codes presence and 0 absence of an item in a transaction) is shown in the table.

• An example rule for the supermarket could be meaning that if butter and bread are bought,
customers also buy milk – {butter, bread} {milk}
Important concepts of Association Rule Mining:
Support of a Rule
The support Supp (X) of an itemset (X) is defined as the proportion of transactions in the
data set which contain the itemset.
• In the example database, the itemset {milk, bread, butter} has a support of (1/5) =0.2 since
it occurs in 20% of all transactions (1 out of 5 transactions).
Confidence of a Rule
• Conf (X Y) = Supp (X U Y)/ Supp (Y)
• Here {bread, butter} {milk} has a confidence of (0.2/0.2) =1
100% in the database, which means that for 100% of the transactions containing butter and
bread the rule is correct (100% of the times a customer buys butter and bread, milk is bought
as well).
Strong Association Rules
• Rules that satisfy both a minimum support threshold (min sup) and a minimum confidence
threshold (min conf) are called Strong Association Rules.
• In general, association rule mining can be viewed as a two-step process:
1. Find all frequent itemset: By definition, each of these itemset will occur at least as
frequently as a predetermined minimum support count, minisub.
2. Generate strong association rules from the frequent itemset: By definition, these rules
must satisfy minimum support and minimum confidence.
10 marks
1. Explain Frequent Pattern Mining?
Can be classified in various ways, based on the following criteria:
1. Based on the completeness of patterns to be mined: We can mine the complete set of
frequent itemset, the closed frequent itemset, and the maximal frequent itemset, given a
minimum support threshold. We can also mine constrained frequent itemset, approximate
frequent itemset, near match frequent itemset, top-k frequent itemset and so on...
2. Based on the levels of abstraction involved in the rule set: Some methods for
association rule mining can find rules at differing levels of abstraction.
• For example, suppose that a set of association rules mined includes the following rules
where X is a variable representing a customer:
• buys (X, ―computer‖)) =>buys (X, ―HP printer‖) (1)
• buys (X, ―laptop computer‖)) =>buys (X, ―HP printer‖) (2)
• In rule (1) and (2), the items bought are referenced at different levels of abstraction (e.g.,
computer‖ is a higher-level abstraction of laptop computer‖
3. Based on the number of data dimensions involved in the rule: If the items or attributes
in an association rule reference only one dimension, then it is a single-dimensional
association rule.
4. Based on the types of values handled in the rule: If a rule involves associations between
the presence or absence of items, it is a Boolean association rule. If a rule describes
associations between quantitative items or attributes, then it is a quantitative association
rule.
5. Based on the kinds of rules to be mined: Frequent pattern analysis can generate various
kinds of rules and other interesting relationships. Association rule mining can generate a large
number of rules, many of which are redundant or do not indicate correlation relationship
among itemset.
6. Based on the kinds of patterns to be mined: Sequential pattern mining searches for
frequent subsequences in a sequence data set, where a sequence records an ordering of events
Unit – 4
2 Marks
1. What is Rule-based Classifier?
• Classify records by using a collection of “if...then...” rules
• Rule: (Condition) y
• were
• Condition is a conjunctions of attributes
• y is the class label
• LHS: rule antecedent or pre condition
• RHS: rule consequent
• Rules: R= (R1^R2^...^ Rk) –
Examples of classification rules:
• (Blood Type=Warm) ^ (Lay Eggs=Yes) Birds
• (Taxable Income < 50K) ^ (Refund=Yes) Evade=No
2. What are the characteristics of Rule-based Classifier?
• Mutually exclusive rules
– Classifier contains mutually exclusive rules if no two rules are triggered by the same
record.
– Every record is covered by at most one rule
• Exhaustive rules – Classifier has exhaustive coverage if it accounts for every possible
combination of attribute
3. What is Support Vendor Machine?
• A method for the classification of both linear and nonlinear data.
• In a nutshell, an SVM is an algorithm that works as follows. It uses a nonlinear mapping
to transform the original training data into a higher dimension.
• Within this new dimension, it searches for the linear optimal separating hyperplane
(i.e., a “decision boundary” separating the tuples of one class from another).
• With an appropriate nonlinear mapping to a sufficiently high dimension, data from two
classes can always be separated by a hyperplane.
• The SVM finds this hyperplane using support vectors (“essential” training tuples) and
margins (defined by the support vectors)
4. How does Classification work?
6 marks
1. Explain Metrics for Evaluation Classifier Performance?
• Tuples – (tuples of the main class of interest) and – negative tuples (all other tuples).
• Given two classes, for example, the positive tuples may be buys computer = yes while
(buys computer=no) is negative tuples
• True positives (TP): These refer to the positive tuples that were correctly labelled by the
classifier. Let TP be the number of true positives.
• True negatives (TN): These are the negative tuples that were correctly labelled
by the classifier. Let TN be the number of true negatives.
• False positives (FP): These are the negative tuples that were incorrectly labelled as positive
(e.g., tuples of class buy computer = no for which the classifier predicted buys computer =
yes). Let FP be the number of false positives.
• False negatives (FN): These are the positive tuples that were mislabelled as negative (e.g.,
tuples of class buy computer = yes for which the classifier predicted buys computer = no).
Let FN be the number of false negatives. He negative tuples are (buys Computer=no)

2. What are the Evaluation Measures?


10 marks
1. Confusion Matrix – Predicted Class Sum
Unit – 5
2 marks
1. What is Cluster Analysis?
• The process of grouping a set of physical or abstract objects into classes of similar
objects is called clustering.
• A cluster is a collection of data objects that are similar to one another within the same
cluster and are dissimilar to the objects in other clusters.
• A cluster of data objects can be treated collectively as one group and so may be considered
as a form of data compression.
2. List out major Clustering Methods?
• Partitioning Methods
• Hierarchical Methods
• Density-Based Methods
• Grid-Based Methods
• Model-Based Methods
3. What is Outlier Analysis?
• There exist data objects that do not comply with the general behaviour or model of the
data. Such data objects, which are grossly different from or inconsistent with the remaining
set of data, are called outliers.
• Many data mining algorithms try to minimize the influence of outliers or eliminate them all
together. This, however, could result in the loss of important hidden information because one
person’s noise could be another person’s signal. In other words, the outliers may be of
particular interest, such as in the case of fraud detection, where outliers may indicate
fraudulent activity. Thus, outlier detection and analysis are an interesting data mining
task, referred to as outlier mining.
• It can be used in fraud detection, for example, by detecting unusual usage of credit cards or
telecommunication services. In addition, it is useful in customized marketing for identifying
the spending behaviour of customers with extremely low or extremely high incomes, or in
medical analysis for finding unusual responses to various medical treatments.
• Outlier mining can be described as follows: Given a set of n data points or objects and k, the
expected number of outliers, find the top k objects that are considerably dissimilar,
exceptional, or inconsistent with respect to the remaining data. The outlier mining problem
can be viewed as two subproblems: Define what data can be considered as inconsistent in
a given data set, and find an efficient method to mine the outliers so defined
4. List out Types of Outlier Detection?
• Statistical Distribution-Based Outlier Detection
• Distance-Based Outlier Detection
• Density-Based Local Outlier Detection
• Deviation-Based Outlier Detection
6 marks
1. What are the Applications of Cluster Analysis?
• Cluster analysis has been widely used in numerous applications, including market
research, pattern recognition, data analysis, and image processing.
• In business, clustering can help marketers discover distinct groups in their customer bases
and characterize customer groups based on purchasing patterns.
• In biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionality, and gain insight into structures inherent in populations.
• Clustering may also help in the identification of areas of similar land use in an earth
observation database and in the identification of groups of houses in a city according to
house type, value, and geographic location, as well as the identification of groups of
automobile insurance policy holders with a high average claim cost.
• Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity.
2. Explain major Clustering Methods?
Partitioning Method
• A partitioning method constructs k partitions of the data, where each partition represents a
cluster and k <= n. That is, it classifies the data into k groups, which together satisfy the
following requirements:
– Each group must contain at least one object, and
– Each object must belong to exactly one group.
• A partitioning method creates an initial partitioning. It then uses an iterative relocation
technique that attempts to improve the partitioning by moving objects from one group to
another. The general criterion of a good partitioning is that objects in the same cluster are
close or related to each other, whereas objects of different clusters are far apart or very
different
Hierarchical Method
• A hierarchical method creates a hierarchical decomposition of the given set of data
objects.
• A hierarchical method can be classified as being either agglomerative or divisive, based on
how the hierarchical decomposition is formed.
– The agglomerative approach, also called the bottom-up approach, starts with each object
forming a separate group. It successively merges the objects or groups that are close to one
another, until all of the groups are merged into one or until a termination condition holds.
– The divisive approach, also called the top-down approach, starts with all of the objects in
the same cluster. In each successive iteration, a cluster is split up into smaller clusters, until
eventually each object in one cluster, or until a termination condition holds.
Density Based Methods
• Clustering methods have been developed based on the notion of density. Their general idea
is to continue growing the given cluster as long as the density in the neighbourhood
exceeds some threshold; that is, for each data point within a given cluster, the
neighbourhood of a given radius has to contain at least a minimum number of points. Such
a method can be used to filter out noise (outliers)and discover clusters of arbitrary shape.
• DBSCAN and its extension, OPTICS, are typical density- based methods that grow clusters
according to a density- based connectivity analysis. DENCLUE is a method that clusters
objects based on the analysis of the value distributions of density functions.
GRID BASED, METHODS
• Grid-based methods quantize the object space into a finite number of cells that form a
grid structure.
• All of the clustering operations are performed on the grid structure i.e., on the quantized
space. The main advantage of this approach is its fast-processing time, which is typically
independent of the number of data objects and dependent only on the number of cells in
each dimension in the quantized space.
• STING is a typical example of a grid-based method. Wave Cluster applies wavelet
transformation for clustering analysis and is both grid-based and density- based
Model-based methods
• Model-based methods hypothesize a model for each of the clusters and find the best fit of
the data to the given model.
• A model-based algorithm may locate clusters by constructing a density function that
reflects the spatial distribution of the data points.
• It also leads to a way of automatically determining the number of clusters based on
standard statistics, taking “noise” or outliers into account and thus yielding robust clustering
methods
3. Explain Typical Requirement of Clustering in data mining?
• Scalability: Many clustering algorithms work well on small data sets containing fewer than
several hundred data objects; however, a large database may contain millions of objects.
Clustering on a sample of a given large data set may lead to biased results. Highly
scalable clustering algorithms are needed.
• Ability to deal with different types of attributes: Many algorithms are designed to cluster
interval-based (numerical) data. However, applications may require clustering other types
of data, such as binary, categorical (nominal), and ordinal data, or mixtures of these data
types.
• Discovery of clusters with arbitrary shape: Many clustering algorithms determine
clusters based on Euclidean or Manhattan distance measures. Algorithms based on such
distance measures tend to find spherical clusters with similar size and density. However, a
cluster could be of any shape. It is important to develop algorithms that can detect clusters of
arbitrary shape.
• Minimal requirements for domain knowledge to determine input parameters: Many
clustering algorithms require users to input certain parameters in cluster analysis (such as
the number of desired clusters). The clustering results can be quite sensitive to input
parameters. Parameters are often difficult to determine, especially for data sets containing
high-dimensional objects
• Ability to deal with noisy data: Most real-world databases contain outliers or missing,
unknown, or erroneous data. Some clustering algorithms are sensitive to such data and may
lead to clusters of poor quality.
• Incremental clustering and insensitivity to the order of input records: Some clustering
algorithms cannot incorporate newly inserted data (i.e., database updates) into existing
clustering structures and, instead, must determine a new clustering from scratch. Some
clustering algorithms are sensitive to the order of input data.
• High dimensionality: A database or a data warehouse can contain several dimensions or
attributes. Many clustering algorithms are good at handling low-dimensional data, involving
only two to three dimensions
• Constraint-based clustering: Real-world applications may need to perform clustering
under various kinds of constraints. Suppose that your job is to choose the locations for a
given number of new automatic banking machines (ATMs) in a city. To decide upon this, you
may cluster households while considering constraints such as the city’s rivers and highway
networks, and the type and number of customers per cluster. A challenging task is to find
groups of data with good clustering behaviour that satisfy specified constraints.
• Interpretability and usability: Users expect clustering results to be interpretable,
comprehensible, and usable. That is, clustering may need to be tied to specific semantic
interpretations and applications. It is important to study how an application goal may
influence the selection of clustering features and methods.
10 marks
1. Briefly explain Bayes’ Theorem?
Bayes Classification
• Bayesian classifier can predict the probability that a given tuple belongs to a particular
class.
• It is based on Bayes’ theorem.
• Bayesian classifier gives more speed and accuracy as compared to Decision tree
classifier.
Bayes’ theorem:
Let X be a data tuple which is described by measurements made on a set of n attributes.
[Note: In Bayesian term X is also called evidence]
Let H be some hypothesis.
● We assume the hypothesis H: the data tuple X belongs to a specified class C.
● Then, for classification problems, we have to find P(H/X) i.e., the probability of
hypothesis H when X is given.
● Example: Suppose we have employee database consists of several attributes like
{department, age and salary, etc}.
● And status is the class label attribute which is either senior or junior.
● Let X is the detail of an employee. i.e., X= (department= sales, age= 35, salary=40K)
● Let the Hypothesis H: employee belongs to senior class.

Then, [1] Posterior probability: P(H/X) is the probability that the employee X will belongs
to senior class given that we know the employee’s department, age and salary.
P(X/H) is the probability that an employee, X, is of sales department, with age 35 and have
salary 40K, given that, it belongs to senior class.
2] Prior probability: P(H) is the probability that any given employee will belong to the
senior class regardless of any information like department, age etc.
P(X) is the probability of an employee whose age is 35 years; department is sales and the
salary is 40K.
Bayesian classification method:
Let D is the training data set. Each tuple in D is represented by an n- dimensional attribute
vector, X = (x1, x2, x3, . . .., xn) where x1, x2, x3...., Xn are the values of attributes A1, A2,
A3, . . .., An respectively.
Let there are m classes, C1, C2, C3, . . .., Cm.
Then, The Bayesian classifier predicts that a given tuple X belongs to the class Ci if and only
if the posterior probability P (Ci / X) is highest. i.e., P (Ci / X) > P (Cj / X) for 1 ≤ j ≤ m, j ≠i

If the class prior probabilities P (Ci) are not known, then it is commonly assumed that the
classes are equally likely, i.e., P(C1) = P(C2) = P(C3) = .... = P(Cm).
Our goal is to maximize P (Ci / X).

You might also like