Data Mining PDF
Data Mining PDF
• Data Warehouse is a large OLTP system can The slice operation selects dimensional Data Models
repository of data collected Process: OLTP system is one particular dimension from 1. Data Cube
from different sources an online database a given cube and provides a 2. Star
whereas Data Mart is only changing system. new sub-cube. 3. Snow Flakes
subtype of a data warehouse. Therefore, it supports Dice 4. Fact constellation
• Data Warehouse is focused database query such as Dice selects two or more 1. Data Cube
on all departments in an insert, update, and delete dimensions from a given cube • When data is grouped or
organization whereas Data information from the and provides a new sub-cube. combined in multidimensional
Mart focuses on a specific database. Consider a point Pivot matrices called Data Cubes.
group. of sale system of a The pivot operation is also • A data cube is created from
• Data Warehouse designing supermarket, following are known as rotation. It rotates a subset of attributes in the
process is complicated the sample queries that the data axes in view in order database.
whereas the Data Mart this system can process: to provide an alternative • Specific attributes are
process is easy to design. • Retrieving the presentation of data chosen to be measure
• Data Warehouse takes a description of a particular Multidimensional Data Mode attributes, i.e., the attributes
long time for data handling product • The multidimensional data whose values are of interest.
whereas Data Mart takes a . • Filtering all products model is an integral part of • Another attributes are
short time for data handling. related to the supplier. On-Line Analytical selected as dimensions or
• Data Warehouse size range • Searching the record of Processing, or OLAP. functional attributes. The
is 100 GB to 1 TB+ whereas the customer • Because OLAP is on-line, it measure attributes are
Data Mart size is less than . • Listing products having a must provide answers aggregated according to the
100 GB. price less than the quickly; analysts pose dimensions. • Data cube
• Data Warehouse expected amount iterative queries during method is an interesting
implementation process takes Online Analytical interactive sessions, not in technique with many
1 month to 1 year whereas Processing Server (OLAP) batch jobs that run overnight. applications. • Data cubes
Data Mart takes a few months • Online Analytical And because OLAP is also could be sparse in many
to complete the Processing Server (OLAP) analytic, the queries are cases because not every cell
implementation process. is based on the complex in each dimension may have
Online Transaction multidimensional data . • The multidimensional data corresponding data in the
Processing (OLTP) model. • It allows model is designed to solve database.
• The full form of OLTP is managers, and analysts to complex queries in real time. 2. Star
Online Transaction get an insight of the • The multidimensional data • Star Schema in data
Processing. information through fast, model is composed of logical warehouse, in which the
• OLTP is an operational consistent, and interactive cubes, measures, dimensions, center of the star can have
system that supports access to information hierarchies, levels, and one fact table and a number
transaction-oriented OLAP Operations attributes. of associated dimension
applications in a 3-tier Since OLAP servers are • The simplicity of the model tables. • It is known as star
architecture. based on multidimensional is inherent because it defines schema as its structure
• It administers the day to view of data, we will objects that represent real- resembles a star
day transaction of an discuss OLAP operations in world business entities. • The Star Schema data model
organization. multidimensional data. • Analysts know which is the simplest type of Data
• OLTP is basically focused on Here is the list of OLAP business measures they are Warehouse schema. • It is
query processing, maintaining operations − interested in examining, also known as Star Join
data integrity in multi-access • Roll-up which dimensions and Schema and is optimized for
environments as well as • Drill-down attributes make the data querying large data sets. • In
effectiveness that is • Slice and dice meaningful, and how the the following Star Schema
measured by the total number • Pivot (rotate) dimensions of their business example, the fact table is at
of transactions per second. Roll-up are organized into levels and the center which contains
Characteristics of OLTP Roll-up performs hierarchies. keys to every dimension table
• OLTP uses transactions that aggregation on a data cube like Dealer_ID, Model ID,
include small amounts of in any of the following Date_ID, Product_ID,
data. • Indexed data in the ways − • By climbing up a Branch_ID & other attributes
database can be accessed concept hierarchy for a like Units sold and revenue
easily. • OLTP has a large dimension • By dimension 3. Snowflake
number of users. • It has fast reduction • Snowflake Schema in data
response times • Databases Drill-down Drill-down is warehouse is a logical
are directly accessible to end- the reverse operation of arrangement of tables in a
users • OLTP uses a fully roll-up. It is performed by multidimensional database
normalized schema for either of the following such that the ER diagram
database consistency ways • By stepping down a resembles a snowflake shape.
concept hierarchy for a • A Snowflake Schema is an
dimension • By introducing extension of a Star Schema,
a new dimension. and it adds additional
dimensions. The dimension 5. Data Discretization Data Integration can be challenging due to
tables are normalized which Data discretization refers to a Data integration in data the variety of data formats,
splits data into additional method of converting a huge mining refers to the structures, and semantics
tables. • In the following number of data values into process of combining data used by different data
Snowflake Schema example, smaller ones so that the from multiple sources into sources. Different data
Country is further normalized evaluation and management a single, unified view. This sources may use different
into an individual tabl of data become easy. In other can involve cleaning and data types, naming
4. Fact Constellation words, data discretization is a transforming the data, as conventions, and schemas,
Schema method of converting well as resolving any making it difficult to
• A Fact constellation means attributes values of inconsistencies or conflicts combine the data into a
two or more fact tables continuous data into a finite that may exist between the single view. Data
sharing one or more set of intervals with minimum different sources. The goal integration typically
dimensions. It is also called data loss. There are two forms of data integration is to involves a combination of
Galaxy schema. • Fact of data discretization first is make the data more useful manual and automated
Constellation Schema supervised discretization, and and meaningful for the processes, including data
describes a logical structure the second is unsupervised purposes of analysis and profiling, data mapping,
of data warehouse or data discretization. Supervised decision making. data transformation, and
mart. Fact Constellation discretization refers to a Techniques used in data data reconciliation.
Schema can design with a method in which the class integration include data Six stages of data
collection of de-normalized data is used. Unsupervised warehousing, ETL (extract, processing
FACT, Shared, and Conformeddiscretization refers to a transform, load) 1. Data collection
Dimension tables. method depending upon the processes, and data Collecting data is the first
Fact Constellation Schema is way which operation federation. Data step in data processing.
a sophisticated database proceeds. It means it works Integration is a data Data is pulled from
design that is difficult to on the top-down splitting preprocessing technique available sources,
summarize information. Fact strategy and bottom-up that combines data from including data lakes and
Constellation Schema can merging strategy. multiple heterogeneous data warehouses. 2. Data
implement between • Binning: data sources into a preparation Once the data
aggregate Fact tables or Binning methods smooth a coherent data store and is collected, it then enters
decompose a complex Fact sorted data value by provides a unified view of the data preparation stage.
table into independent consulting its the data. These sources Data preparation, often
simplex Fact tables. "neighbourhood," that is, the may include multiple data referred to as “pre-
values around it. The sorted cubes, databases, or flat processing” is the stage at
1. Data Cleaning values are distributed into a files. The data integration which raw data is cleaned
Real-world data tend to be number of "buckets," or bins. approaches are formally up and organized for the
incomplete, noisy, and Because binning methods defined as triple where, G following stage of data
inconsistent. Data cleaning consult the neighbourhood of stand for the global processing. 3. Data input
(or data Cleansing) routines values, they perform local schema, S stands for the The clean data is then
attempt to fill in missing smoothing. Figure 2.11 heterogeneous source of entered into its destination
values, smooth out noise illustrates some binning schema, M stands for (perhaps a CRM like
while identifying outliers, and techniques. In this example, mapping between the Salesforce or a data
correct inconsistencies in the the data for price are first queries of source and warehouse like Redshift),
data sorted and then partitioned global schema. Data and translated into a
Steps Involved in KDD into equal-frequency bins of integration is the process language that it can
Process: size 3 (i.e., each bin contains of combining data from understand. 4. Processing
1. Data cleaning three values). In smoothing by multiple sources into a During this stage, the data
2. Data integration bin means, each value in a bin cohesive and consistent inputted to the computer in
3. Data selection is replaced by the mean value view. This process involves the previous stage is
4. Data transformation of the bin. For example, the identifying and accessing actually processed for
5. Data mining mean of the values 4, 8, and the different data sources, interpretation. Processing
6. Pattern evaluation 15 in Bin 1 is 9. Therefore, mapping the data to a is done using machine
7. Knowledge presentation each original value in this bin common format, and learning algorithms 5. Data
2. Data Integration is replaced by the value 9. reconciling any output/interpretation The
Data integration is defined as Similarly, smoothing by bin inconsistencies or output/interpretation stage
heterogeneous data from medians can be employed, in discrepancies between the is the stage at which data
multiple sources combined in which each bin value is sources. The goal of data is finally usable to non-
a common source(Data replaced by the bin median. In integration is to make it data scientists. It is
Warehouse). Data integration smoothing by bin boundaries, easier to access and translated, readable, and
using Data Migration tools, the minimum and maximum analyze data that is spread often in the form of graphs,
Data Synchronization tools values in a given bin are across multiple systems or videos, images, plain text,
and ETL(Extract-Load- identified as the bin platforms, in order to gain etc.). 6. Data storage The
Transformation) process boundaries. Each bin value is a more complete and final stage of data
then replaced by the closest accurate understanding of processing is storage. After
boundary value. the data. Data integration all of the data is processed,
What is Market basket Association Rule Mining Advantages Apriori says
analysis? Market Basket Association rules can be 1. Efficient discovery of : The probability that item I is
Analysis is one of the thought of as an IF-THEN patterns: Association rule not frequent is if:
fundamental techniques used relationship. Suppose item mining algorithms are • P(I) < minimum support
by large retailers to uncover A is being bought by the efficient at discovering threshold, then I is not
the association between customer, then the patterns in large datasets, frequent.
items. In other words, it chances of item B being making them useful for tasks • P (I+A) < minimum support
allows retailers to identify the picked by the customer too such as market basket threshold, then I+A is not
relationship between items under the same analysis and recommendation frequent, where A also
which are more frequently Transaction ID is found systems. belongs to itemset.
bought together. Frequent out. 2. Easy to interpret: The • If an itemset set has value
itemset mining leads to the Antecedent (IF): This is an results of association rule less than minimum support
discovery of association and item/group of items that mining are easy to understand then all of its supersets will
correlation among items in are typically found in the and interpret, making it also fall below min support,
large transactional and Itemsets or Datasets. possible to explain the and thus can be ignored. This
relational data sets. With Consequent (THEN): This patterns found in the data. property is called the
massive amount of data comes along as an item 3. Can be used in a wide Antimonotone property.
continuously being collected with an Antecedent/group range of applications: The steps followed in the
and sorted, many industries of Antecedents. But here Association rule mining can Apriori Algorithm of data
are becoming interested in comes a constraint. be used in a wide range of mining are:
mining such patterns from Suppose you made a rule applications such as retail, 1. Join Step: This step
their database. The discovery about an item, you still finance, and healthcare, generates (K+1) itemset from
of interesting patterns among have around 9999 items to which can help to improve K-itemsets by joining each
huge amount of business consider for rule-making. decision-making and increase item with itself.
transaction records can help This is where the Apriori revenue. 2. Prune Step: This step
in many decision making Algorithm comes into play. 4. Handling large datasets: scans the count of each item
process such as catalogue So before we understand These algorithms can handle in the database. If the
making, cross marketing and the Apriori Algorithm, let’s large datasets with many candidate item does not meet
customer shopping behaviour understand the math items and transactions, which minimum support, then it is
analysis. For instance, if behind it. There are 3 ways makes them suitable for big- regarded as infrequent and
customers are buying milk, to measure association: data scenarios thus it is removed. This step is
how likely are they to also buy • Support Disadvantages performed to reduce the size
bread on the same shopping • Confidence 1. Large number of generated of the candidate itemsets.
list? Such information can • Lift rules: Association rule mining Advantages
lead to increased sales by Support: It gives the can generate a large number 1. Easy to understand
helping retailers do selective fraction of transactions of rules, many of which may algorithm
marketing and plan their shelf which contains item A and be irrelevant or uninteresting, 2. Join and Prune steps are
space. B. Basically Support tells which can make it difficult to easy to implement on large
Support us about the frequently identify the most important itemsets in large databases
Suppose we have a dataset of bought items or the patterns. Disadvantages
1000 transactions, and the combination of items 2. Limited in detecting 1. It requires high
itemset {milk, bread} appears bought frequently. So with complex relationships: computation if the itemsets
in 100 of those transactions. this, we can filter out the Association rule mining is are very large and the
The support of the itemset items that have a low limited in its ability to detect minimum support is kept very
{milk, bread} would be frequency. complex relationships low.
calculated as follows: Confidence: It tells us how between items, and it only 2. The entire database needs
Support ({milk, bread}) often the items A and B considers the cooccurrence to be scanned.
= Number of transactions occur together, given the of items in the same Improve Apriori Efficiency
containing number times A occurs. transaction. 1. Hash-Based Technique:
{milk, bread} / Total Lift: Lift indicates the 3. Can be computationally This method uses a hash-
number of transactions strength of a rule over the expensive: As the number of based structure called a hash
= 100 / 1000 random occurrence of A items and transactions table for generating the k-
= 10% and B. It basically tells us increases, the number of itemset and its corresponding
the strength of any rule. candidate item sets also count. It uses a hash function
increases, which can make for generating the table. 2.
the algorithm computationally Transaction Reduction: This
expensive. method reduces the number
4. Need to define the of transactions scanning in
minimum support and iterations. The transactions
confidence threshold: The which do not contain frequent
minimum support and items are marked or removed.
confidence threshold must be 3. Partitioning: This method
set before the association rule requires only two database
mining process, scans to mine the frequent
e Apriori Efficiency Classification and expenditure in rupees of Bayesian classification
item sets. It says that for any Prediction potential customers on Bayesian classification is
itemset to be potentially The two forms of data computer equipment based on Bayes’ theorem.
frequent in the database, it analysis that can be used for given their income and Bayesian classifiers are the
should be frequent in at least extracting models describing occupation. Typical statistical classifiers.
one of the partitions of the important classes or to application for Bayesian classifier can
database. predict future data trends. classification and predict class membership
4. Sampling: This method ➢ Classification prediction includes credit probabilities such as the
picks a random sample S from ➢ Prediction A bank loans approval, target probability that a given
Database D and then searches officer needs analysis of her marketing, medical tuple belong to a particular
for frequent itemset in S. It data in order to learn which diagnosis, etc. class. Naive Bayesian
may be possible to lose a loan applicants are "safe" Decision Tree Induction classifies is a simple
global frequent itemset. This and which are "risky" for the Decision tree induction is Bayesian classifier which
can be reduced by lowering the bank. A marketing manager the learning of decision has comparable
min_sup. at AllElectronics needs data trees from class-labelled performance with decision
5. Dynamic Itemset Counting: analysis to help guess training tuples. A tree and selected neural
This technique can add new whether a customer with a decision tree is a classifiers. Naïve Bayesian
candidate itemsets at any given profile will buy a new flowchart-like tree classifier is incremental in
marked start point of the computer. A medical structure, where each nature. Each training
database during the scanning researcher wants to analyse internal node (non-leaf example can incrementally
of the database. blood cancer data in order to node) denotes a test on increase or decrease the
Applications Of Apriori predict which one of three an attribute, each branch probability that a
Algorithm specific treatments a patient represents an outcome hypotheses is correct. This
Some fields where Apriori is should receive. In each of of the test, and each leaf means prior knowledge
used: these examples, the data node (or terminal node) can be combined with
1. In Education Field: analysis task is holds a class label. The observer
Extracting association rules in classification, where a model topmost node in a tree is Bayes' TheoremBayes'
data mining of admitted or classifier is constructed to the root node theorem is named after
students through predict categorical labels, A decision tree is a Thomas Bayes, a
characteristics and specialties. such as "safe" or "risky" for structure that includes a nonconformist who did
2. In the Medical field: For the loan application data; root node, branches, and early work in probability
example Analysis of the "yes" or "no" for the leaf nodes. Each internal and decision theory during
patient’s database. marketing data; or node denotes a test on the 18th century. Let X be
3. In Forestry: Analysis of "treatment A," "treatment an attribute, each branch a data tuple. In Bayesian
probability and intensity of B," or "treatment C" for the denotes the outcome of a terms, X is considered
forest fire with the forest fire medical data. These test, and each leaf node "evidence." As usual, it is
data categories can be holds a class label. The described by made on a set
4. Apriori is used by many represented by discrete topmost node in the tree of n attributes. Let H be
companies like Amazon in the values, where the ordering is the root node. A typical some hypothesis, such as
Recommender System and by among values has no decision tree is shown in that the data tuple X
Google for the auto-complete meaning. For example, the Figure. It represents the belongs to a specified class
feature values 1, 2, and 3 may be concept buy_computer, C. For classification
used to represent that is, it predicts problems, we want to
treatments A, B, and C, whether a customer at determine P(H/X), the
where there is no ordering AllElectronics is likely to probability that the
implied among this group of purchase a computer. hypothesis H holds given
treatment rules Internal nodes are the "evidence" or observed
Classification predicts denoted by rectangles, data tuple X. In other
categorical class labels. It and leaf nodes are words, we are looking for
classifies data (Constructs a denoted by ovals. Some the probability that tuple X
model) based on training set decision tree algorithms belongs to class C, given
and values in a classifying produce only binary trees that we know the attribute
attribute and uses it in (where each internal description of X. P(HX) is
classifying new data. E.g., node branches to exactly the posterior probability, or
We can build a classification two other nodes), a posteriori probability, of
model to categorise bank whereas others can H conditioned on X. For
loan application is either produce nonbinary trees example, suppose our
safe or risky. world of data tuples is
Prediction models confined to customers
continuous valued function. described by the attributes
i.e., predicts unknown or Bayes' theorem is age and income,
missing data. E.g., A P(H|X)= P(X|H) P(H)/ respectively, and that X is a
prediction model can be P(X). 35-year-old customer with
used to predict the an income of $40,000.
Posterior probability [ Back propagation Types of SVM Lazy Learning Lazy
P(H/X)] : A posterior Backpropagation is an o Linear SVM: learning refers to machine
probability, in Bayesian algorithm that backpropagates Linear SVM is used for learning processes in
statistics, is the revised or the errors from the output linearly separable data, which generalization of the
updated probability of an nodes to the input nodes. which means if a dataset training data is delayed
event occurring after taking Therefore, it is simply referred can be classified into two until a query is made to the
into consideration new to as the backward classes by using a single system. This type of
information. The posterior propagation of errors. It uses straight line, then such learning is also known as
probability is calculated by in the vast applications of data is termed as linearly Instance-based Learning.
updating the prior probabilityneural networks in data mining separable data, and Lazy classifiers are very
using Bayes' theorem. like Character recognition, classifier is used called as useful when working with
➢ Prior probability [ P(H)] :Signature verification, etc. Linear SVM classifier. large datasets that have a
Prior probability, in BayesianBackpropagation, or backward few attributes. The main
propagation of errors, is an
statistics, is the probability of o Non-linear SVM: purpose behind using lazy
an event before new data is algorithm that is designed to Non-Linear SVM is used learning is that (like in the
collected. This is the best test for errors working back for non-linearly separated K-nearest neighbours
rational assessment of the from output nodes to input data, which means if a algorithm which is
probability of an outcome nodes. It's an important dataset cannot be employed in online
based on the current mathematical tool for classified by using a recommendation engines
knowledge before an improving the accuracy of straight line, then such like those of Netflix and
experiment is performed. predictions in data mining and data is termed as non- Amazon) the dataset gets
Rule extraction machine learning linear data and classifier updated with new entries
Here we will learn how to Support Vector Machine used is called as Non- on a continuous basis. Due
build a rule-based classifier Algorithm linear SVM classifier to these continuous
by extracting IF-THEN rules Support Vector Machine or updates, the training data
from the decision tree. Rules SVM is one of the most popular becomes outdated in a
are easier to understand than Supervised Learning short period of time. So
large trees. One rule is algorithms, which is used for there really is no time to
created for each part from theClassification as well as have an actual training
route to a leaf. The leaf nodeRegression problems. phase of some sort.
holds the class prediction However, primarily, it is used Instance-based learning,
forming the rule consequent. for Classification problems in local regression, K-
Rules are mutually exclusive. Machine Learning. The goal of Nearest Neighbours (K-
The following rules can be the SVM algorithm is to create NN), and Lazy Bayesian
extracted from the best line or decision Rules are some examples
buys_computer decision tree. boundary that can segregate of lazy learning
n-dimensional space into KNN algorithm
IF age = young AND student = classes so that we can easily 1. Calculate “d(x, xi)”
no THEN buys_computer= no put the new data point in the i=1,2,3…..,n; where d
correct category in the future. denotes the Euclidean
IF age = young AND student = This best decision boundary is distance between the
yes THEN buys_computer= called a hyperplane. SVM points
yes chooses the extreme . 2. Arrange the calculated
points/vectors that help in n Euclidean distance in
IF age = middle-aged THEN creating the hyperplane. These non-decreasing order.
buys_computer= yes extreme cases are called as 3. Let k be a positive
support vectors, and hence integer, take the first
IF age = senior AND credit algorithm is termed as Support positive distance from the
rating = excellent THEN Vector Machine. Consider the sorted list.
buys_computer= yes below diagram in which there 4. Find those k-points
are two different categories corresponding to these k-
IF age = senior AND credit that are classified using a distance.
rating = fair THEN decision boundary or Advantages 5. Let ki denotes the
buys_computer= no hyperplane ➢ The algorithm is simple number of points
Example: SVM can be and easy to implement. belonging to the ith class
understood with the example ➢ There is no need to among k points for k≥0.
that we have used in the KNN build a model, tune several 6. If ki>kj, for all i≠j, then
classifier. Suppose we see a parameters to make put x in class i.
strange cat that also has some additional assumptions.
features of dogs, so if we want ➢ The algorithm is
a model that can accurately versatile. It can be used for
identify whether it is a cat or classification, regression
dog, so such a model can be and search
created by using the SVM
Cluster analysis Cluster Analysis Characteristics of Cluster Outlier Detection In
Cluster analysis is a technique Cluster is a group of objects Analysis Clustering.
used in data mining to that belongs to the same 1. Scalability: Scalability in Database may contain data
discover meaningful groups, class. In other words, similar clustering implies that as we objects that do not comply
or clusters, within a dataset. objects are grouped in one boost the amount of data with the general behaviour or
It is an unsupervised learning cluster and dissimilar objects objects, the time to perform model of the data. This data
method that aims to partition are grouped in another clustering should objects are outliers most data
data points into clusters cluster. Clustering is the approximately scale to the mining methods discard
based on their similarities or process of making a group of complexity order of the outliers as noise for
dissimilarities. The goal of abstract objects into classes algorithm. For example,. If we exceptions. However, in some
cluster analysis is to find of similar objects. raise the number of data applications such as fraud
natural groupings in the data ➢ A cluster of data objects objects 10 folds, then the detection the rare event can
where items within the same can be treated as one group. time taken to cluster them be more interesting than the
cluster are more similar to ➢ While doing cluster should also approximately more regularly occurring
each other than to those in analysis, we first partition the increase 10 times. ones. The analysis of outlier
other clusters. It helps in set of data into groups based 2. Interpretability: The data is referred to as outlier
identifying patterns, on data similarity and then outcomes of clustering should mining. Outliers may be
similarities, and structures in assign the labels to the be interpretable, detected using statistical test
the data without any prior groups. comprehensible, and usable. that a distribution or
knowledge or labels. ➢ The main advantage of 3. Discovery of clusters with probability model for the
The process of cluster clustering over classification attribute shape: The data, or using distant
analysis typically involves is that it is adaptable to clustering algorithm should measures where objects that
the following steps: changes and helps single out be able to find arbitrary shape are substantial distance from
1. Data Preparation: useful features that clusters. They should not be any other clusters are
Preprocess the dataset by distinguish different group limited to only distance considered outlier. Rather
handling missing values, Requirements of Clustering measurements that tend to than using statistical or
normalizing or standardizing Clustering is a challenging discover a spherical cluster of distance measures, deviation
variables, and removing field of research in which its small sizes. based method identify outlier
outliers if necessary. potential applications pose 4. Ability to deal with by examining differences in
2. Similarity Measurement: their own special different types of attributes: the main characteristics of
Define a distance or similarity requirements. The following Algorithms should be capable the object in a group Outliers
measure to determine the are typical requirements of of being applied to any data are extreme values that
similarity between data clustering in data mining: • such as data based on deviate from other
points. Common measures Scalability: Many clustering intervals (numeric), binary observations on data, they
include Euclidean distance, algorithms work well on small data, and categorical data. may indicate a variability in a
Manhattan distance, or data sets containing fewer 5. Ability to deal with noisy measurement, experimental
correlation coefficients. than several hundred data data: Databases contain data errors, or a uniqueness. In
3. Choosing a Clustering objects; however, a large that is noisy, missing, or other words, an outlier is an
Algorithm: Select an database may contain incorrect. Few algorithms are observation that diverges
appropriate clustering millions of objects. Clustering sensitive to such data and from an overall pattern on a
algorithm based on the nature on a sample of a given large may result in poor quality sample. Outlier detection is a
of the data and the desired data set may lead to biased clusters. fundamental task in data
outcomes. Popular algorithms results. Highly scalable 6. High dimensionality: The mining and involves
include k-means, hierarchical clustering algorithms are clustering tools should not identifying data points that
clustering, DBSCAN, and needed. • Ability to deal with only able to handle high significantly deviate from the
Gaussian mixture models. • K- different types of attributes: dimensional data space but normal behaviour or patterns
means: It partitions the data Many algorithms are designed also the low-dimensional exhibited by the majority of
into k clusters, where k is pre- to cluster interval-based space Databases contain data the data. These data points
defined. It minimizes the sum (numerical) data. However, that is noisy, missing, or are often referred to as
of squared distances between applications may require incorrect. Few algorithms are outliers or anomalies. Outlier
data points and their cluster clustering other types of data, sensitive to such data and detection is used in various
centroids. • Hierarchical such as binary, categorical may result in poor quality domains, including fraud
Clustering: It builds a (nominal), and ordinal data, or clusters. detection, network intrusion
hierarchy of clusters by mixtures of these data types. Types of Data/ Variables detection, sensor data
iteratively merging or splitting • Discovery of clusters with Used in Cluster Analysis analysis, and quality control.
existing clusters based on arbitrary shape: Many 1. Interval-Scaled variables
similarity or distance clustering algorithms 2. Binary variables
measures. It can result in a determine clusters based on 3. Nominal variables
tree-like structure called a Euclidean or Manhattan 4. Categorical variable
dendrogram. distance measures. 5. Variables of mixed types
• DBSCAN: It identifies Algorithms based on such
clusters based on density distance measures tend to
connectivity. It groups find spherical clusters with
together data points similar size and density
1. Statistical Methods: compare their results to gain Outlier Detection Methods- significantly different from the
Statistical approaches a more comprehensive Supervised, Semi- majority of the data points.
assume that outliers are understanding of the outliers Supervised, and They aim to model the normal
generated by a different in the data. Unsupervised Methods behaviour of the data and
statistical process than the Outlier Detection Outlier detection methods identify instances that do not
majority of the data. These Applications can be categorized into three conform to this model.
methods typically involve Outlier detection has many main types based on the Popular unsupervised outlier
calculating statistical applications in various availability of labelled data: detection methods include
measures such as mean, domains. Some of the most supervised, semi-supervised, clustering algorithms like k-
standard deviation, and Z- common applications are and unsupervised methods. means, density-based
scores to identify data points - • Fraud detection - Outlier 1. Supervised Outlier methods such as Local Outlier
that fall outside a specified detection is widely used in Detection: Supervised Factor (LOF) and DBSCAN,
threshold. 2. Distance-based finance and banking to methods require labelled and statistical approaches
Methods: Distance-based identify fraudulent data, where each data point is like the Z-score and Gaussian
methods measure the transactions, such as credit annotated as either normal or distribution modelling.
dissimilarity or distance card fraud or money outlier. These methods Z-Score Method
between data points and their laundering involve training a classifier on The Z-score of a data point is
neighbours. Outliers are often . • Health monitoring - Outlier the labelled data to learn the calculated as the number of
defined as data points that detection is used in boundary between normal standard deviations it falls
have a significantly greater healthcare to identify rare or and outlier instances. During away from the mean of the
distance from their unusual medical conditions or testing, the classifier predicts dataset. Z-score represents a
neighbours. One popular anomalies in medical data, the class labels of new data point’s distance from the
distance-based method is the such as ECG readings, blood instances, identifying outliers mean in terms of the standard
knearest neighbours (k-NN) test results, or medical based on their deviation from deviation. Mathematically, the
algorithm, where outliers images the learned normal patterns. Z-score for a data point x is
have a large number of . • Quality control - Outlier Examples of supervised calculated as
distant neighbours. 3. detection is used in outlier detection methods
Density-based Methods: manufacturing and production include Support Vector Z−score= (x−mean) /
Density-based approaches to identify defective products, Machines (SVM), Decision standard deviation
identify outliers as data points such as faulty components or Trees etc.
that have a significantly low equipment failures 2. Semi-Supervised Outlier Using Z-Score, we can define
density compared to their . • Network security - Outlier Detection: Semi-supervised the upper and lower bounds
neighbouring points. One detection is used in network methods make use of both of a dataset. A data point with
widely used densitybased security to identify unusual or labelled and unlabelled data. a Z-score greater than a
method is the Local Outlier suspicious network traffic In these methods, a portion of certain threshold (usually 2.5
Factor (LOF) algorithm, which patterns, such as intrusion the data is labelled as normal or 3) is considered an outlier
computes the density of a attempts or malware or outlier, while the remaining
data point based on the infections data is left unlabelled. The
density of its neighbours. . • Anomaly detection - Outlier labelled data is used to train a
Outliers are points with detection is used in various model, similar to supervised
significantly lower densities. domains to identify anomalies methods, and the unlabelled
4. Clustering-based or unusual patterns in data, data is utilized to capture the
Methods: Clustering-based such as environmental underlying distribution of the
methods aim to group similar monitoring, sensor data normal instances. Semi-
data points together and analysis, or social media supervised techniques aim to
identify outliers as data points analysis. find regions in the data space
that do not belong to any that contain mostly normal
cluster or belong to small, instances and identify outliers
sparse clusters. One common outside of these regions. One
clustering algorithm used for popular semi-supervised
outlier detection is the method is the Self-Training
DBSCAN (Density-Based algorithm, where the classifier
Spatial Clustering of iteratively learns from the
Applications with Noise) labelled and unlabelled data.
algorithm. It's worth 3. Unsupervised Outlier
mentioning that the choice of Detection: Unsupervised
outlier detection method methods do not require any
depends on the labelled data. These methods
characteristics of the data, focus on identifying
the specific problem domain, anomalies solely based on the
and the available resources inherent patterns and
and expertise. It is often structures within the data.
recommended to apply Unsupervised techniques
multiple techniques and assume that outliers are
What Is Data Mining? What is an outlier?
Data mining is the process of Outlier analysis in data mining leaf node. Leaf nodes cannot
searching and analyzing a is the process of identifying be pruned. A decision tree
large batch of raw data in and examining data points consists of a root node,
order to identify patterns and that significantly differ from several branch nodes, and
extract useful information. the rest of the dataset. An several leaf nodes. The root
Companies use data mining outlier can be defined as a node represents the top of the
software to learn more about data point that deviates tree.
their customers. It can help significantly from the normal What is numerosity
them to develop more pattern or behavior of the reduction?
effective marketing data. (2) Numerosity
strategies, increase sales, and What do you mean by reductionAnother
decrease costs. Data mining interestingness? methodology in data
relies on effective data Interestingness measures reduction in data mining is
collection, warehousing, and play an important role in data numerosity reduction in which
computer processing. mining, regardless of the kind the volume of the data is
What is a relational of patterns being mined. reduced by representing it in
database? These measures are intended a lower format. There are two
A relational database is a for selecting and ranking types of this technique:
collection of information that patterns according to their parametric and non-
organizes data in predefined potential interest to the user. parametric numerosity
relationships where data is Good measures also allow the reduction.
stored in one or more tables time and space costs of the what is fact constatellation ?
(or "relations") of columns mining process to be reduced a measure of online
and rows, making it easy to What is the purpose of data analytical processing, which is
see and understand how cleaning? a collection of multiple fact
different data structures Data cleaning is the process tables sharing dimension
relate to each other. of preparing raw data for tables, viewed as a collection
What is a multimedia analysis by removing bad of stars. It can be seen as an
database data, organizing the raw data, extension of the star schema.
: A multimedia database is a and filling in the null values. A fact constellation schema
controlled collection of Ultimately, cleaning data has multiple fact tables. It is
multimedia data items such prepares the data for the also known as galaxy schema.
as text, images, graphic process of data mining when
objects, video and audio. A the most valuable information
multimedia database can be pulled from the data
management system (DBMS) set.
provides support for the What is a fact table?
creation, storage, access, A fact table or a fact entity is a
querying and control of a table or entity in a star or
multimedia database. snowflake schema that stores
What is a frequent itemset? measures that measure the
Give an example. business, such as sales, cost
For example, if we have a of goods, or profit. Fact tables
dataset of customer and entities aggregate
purchases, an itemset could measures , or the numerical
be {bread, cake, orange}, data of a business.
meaning that these three What is ROLAP?
items were bought together ROLAP stands for Relational
by several customers. A Online Analytical Processing.
frequent itemset is an itemset ROLAP stores data in columns
that occurs at least a certain and rows (also known as
number of times (or relational tables) and
percentage) in the dataset. retrieves the information on
What is support? demand through user
Support indicates how submitted queries. A ROLAP
frequently an item appears in database can be accessed
the data. Confidence through complex SQL queries
indicates the number of times to calculate information.
the if-then statement is found What is tree pruning?
to be true. A third metric, Pruning means to change the
called lift, can be used to model by deleting the child
compare observed confidence nodes of a branch node. The
with expected confidence pruned node is regarded as a