0% found this document useful (0 votes)

11 views95 pages

Classification

Uploaded by

Lân

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views95 pages

Classification

Uploaded by

Lân

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 95

Classification

Concepts and Techniques

(3rd ed.)

1
Outlines

◼ Classification: Basic concepts

◼ Decision tree
◼ Bayes classifier method
◼ K-Nearest Neighbors (KNN)
◼ Evaluate and select models
◼ Techniques to improve classification performance

2
Supervised learning and unsupervised
learning
◼ Supervised learning (classification)
◼ Monitoring: Training data (Observations, measurements,
…) are labeled to show the observation class.
◼ The new data is classified based on the training set..
◼ Unsupervised learning (clustering)
◼ The class labels of the training data are unknown.
◼ Provide a set of measurements, observations, etc. with the
purpose of establishing the existence of classes or clusters
in the data.

3
The prediction problem: Classification and.
Numeric Prediction
◼ Classification
◼ Predicting classification class labels

◼ Classify data (build a model) based on the training set and

values (class labels) in a classification attribute and use it to

classify new data
◼ Numeric Prediction
◼ Modeling continuously valued functions, i.e. predicting

unknown or missing values

◼ Some popular applications
◼ Credit/loan approval:

◼ Medical diagnosis: if a tumor is cancerous or benign

◼ Fraud detection: if a transaction is fraudulent

◼ Site classification: which category is it?

4
Classification – 2 step process
◼ Model building: Describes a set of predefined classes
◼ Each tuple/pattern is assumed to belong to a predefined class, identified

by the class label attribute

◼ The set of values used to build the model is the training set

◼ The model is represented as a classification rule, decision tree, or

mathematical formula
◼ Model usage: to classify future or unknown objects
◼ Estimate the model's accuracy

◼ The known label of the test sample is compared with the classified

results from the model

◼ The accuracy rate is the percentage of test samples that are correctly

classified by the model

◼ The testing set is independent from the training set

◼ If accuracy is acceptable, use the model to classify new data

5
Procedure (1): Build the model

Classification Algorithm
Training
data

NAME RANK YEARS TENURED Classifier

M ike A ssistan t P ro f 3 no (Model)
M ary A ssistan t P ro f 7 yes
B ill P ro fesso r 2 yes
Jim A sso ciate P ro f 7 yes
IF rank = ‘professor’
D ave A ssistan t P ro f 6 no
OR years > 6
Anne A sso ciate P ro f 3 no
THEN tenured = ‘yes’
6
Process (2): Using the model in prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 yes
George Professor 5 yes
Joseph Assistant Prof 7 yes
7
Phân loại - Classification

◼ Classification: Basic concepts

◼ Decision tree
◼ Bayes classifier method
◼ Rule-based classification
◼ Evaluate and select models
◼ Techniques to improve classification performance

8
Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

9
Decision tree
age income student credit_rating buys_computer
<=30 high no fair no
❑ Training set: Buys_computer <=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
❑ Result tree: >40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair

no yes yes
10
Algorithm for decision tree
◼ Simple Algorithm (Greedy Algorithm)
◼ The tree is built in a top-down recursive divide-and-conquer

manner.
◼ At the start, all training examples are at the root

◼ Categorized attributes (if there are continuous values, they are

discretized first)
◼ The examples are recursively partitioned based on the selected

attributes
◼ Test attributes are selected on the basis of statistical or

empirical measurements (information gain).

◼ Condition to stop partitioning
◼ All samples for a given node belong to the same class.

◼ There are no properties left for further zoning

◼ There are no models left

11
Algorithm for decision tree
◼ Example: Use your algorithm, to build a decision tree from the
following employee data, knowing “status” is the class label
attribute.

12
Attribute selection measures
◼ Problem: How to choose the best attributes for the root node and
its children?
◼ There are two common techniques for attribute selection(Attribute
Selection Measure – ASM):
◼ Information Gain
◼ Gini Index

13
Attribute selection measures
◼ Problem: How to choose the best attributes for the root node and
its children?
◼ There are two common techniques for attribute selection(Attribute
Selection Measure – ASM)
◼ Information Gain
◼ Gini Index

14
Attribute selection measures:
Information Gain (ID3/C4.5)
◼ Information Gain (IG) Information gain measures the
reduction in entropy or uncertainty after a dataset is split on an
attribute.
◼ IG calculates how much information a feature gives us about a
class.
◼ Based on the obtained value of ID, divide the nodes and build a
decision tree.
◼ The decision tree algorithm always tries to maximize the value
of Information Gain, and a node/attribute with the highest
information gain (IG) is extracted first.

15
Attribute selection measures:
Information Gain (ID3/C4.5)
◼ Example: Decision tree

16
Attribute selection measures:
Information Gain (ID3/C4.5)
◼ Choose attributes with a high level of information

17
Attribute selection measures
Information Gain
 Class P: buys_computer = “yes” 5 4
Infoage ( D ) = I (2,3) + I (4,0)
 Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D) = I (9,5) = − log 2 ( ) − log 2 ( ) =0.940 + I (3,2) = 0.694
14 14 14 14 14
age pi ni I(pi, ni) 5
I (2,3) means “age <=30” has 5 out of
<=30 2 3 0.971 14
14 samples, with 2 yes’es and 3
31…40 4 0 0
>40 3 2 0.971 no’s.
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain(age) = Info( D) − Infoage ( D) = 0.246
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
Similarly, calculate: Gain(income),
>40 low yes excellent no
31…40 low yes excellent yes Gain (student), Gain
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes (credit_rating)?
<=30 medium yes excellent yes
31…40 medium no excellent yes From there, select the best attribute as
31…40 high yes fair yes
>40 medium no excellent no the root node 18
Attribute selection measures
Information Gain
 Class P: buys_computer = “yes” Infoage ( D) =
5
I (2,3) +
4
I (4,0)
 Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D) = I (9,5) = − log 2 ( ) − log 2 ( ) =0.940 + I (3,2) = 0.694
14 14 14 14 14
age pi ni I(pi, ni) 5
I (2,3) means “age <=30” has 5 out of
<=30 2 3 0.971 14
14 samples, with 2 yes’es and 3
31…40 4 0 0
>40 3 2 0.971 no’s.
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain(age) = Info( D) − Infoage ( D) = 0.246
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes Similarly,
>40 low yes fair yes

Gain(income) = 0.029
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30
>40
low
medium
yes fair
yes fair
yes
yes
Gain( student) = 0.151
Gain(credit _ rating ) = 0.048
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no 19
Attribute selection measures
Information Gain
◼ The Age attribute with the highest Information Gain value should
be selected as the root node for the decision tree.

20
Calculating Information-Gain for
attributes with continuous values
◼ Let A be an attribute with continuous value
◼ Determine the best separation point for A
◼ Sort the values of attribute A in ascending order
◼ The point between each pair of adjacent values can be
considered the split point
◼ (ai+ai+1)/2 is the point between the pair (ai , ai+1)
◼ The point with the smallest expected information is called the
separation point for A
◼ Split :
◼ D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
21
Calculating Information-Gain for
attributes with continuous values

22
Gain Ratio for attribute selection (C4.5)
◼ Gain ratio is a modification of information gain that reduces its
bias towards attributes with many values.

23
Gain Ratio for attribute selection (C4.5)
◼ Example: Calculate the Gain Ratio value for the income
attribute. An income test divides the data into three partitions:
low, medium, high, and contains 4, 6, and 4 sets of values,
respectively. Calculate the value Gain_Ratio(income) = ?

Gain(income) = 0.0029 (Calculated in previous example)

GainRatio (income) = 0.029/1.557 = 0.019
◼ The attribute with the maximum gain ratio is chosen as the
separation attribute
◼ Requirement: Gain Ratio(student) = ? 24
Gini Index (CART, IBM IntelligentMiner)
◼ The Gini index measures the impurity or purity of a dataset; it is
used in algorithms like CART (Classification and Regression
Trees).

25
Gini Index (CART, IBM IntelligentMiner)
◼ Ví dụ: Initialize the decision tree using the Gini index. Call D the
training data, of which 9 data sets belong to the buys_computer =
yes class and the remaining 5 data sets belong to the
buys_computer = no class. Node (root) N is created for the tuples
in D.

26
Calculate Gini index
◼ To find the separation criteria for tuples in D, we need to
calculate the Gini index for each attribute.
◼ Suppose the income attribute divides D into 2 subsets D1 {low,
medium} and D2 , in which D1 includes 10 records, D2 includes 4
the remaining 4 records.

◼ Similarly, Gini{low,high} = 0.458; Gini{medium,high} = 0.450.

◼ As a result, separate {low,medium} and {high} because it has
the smallest Gini index
27
Comparison of attribute selection measures

◼ There are 3 common ways to select attributes that give good

results, however:
◼ Information gain:
◼ Favors multi-valued attributes
◼ Gain ratio:
◼ There is a tendency for unbalanced partitions, where 1
partition is much smaller than another
◼ Gini index:
◼ Favors multi-valued attributes
◼ Difficulty when the number of layers/partitions is large
◼ There is a tendency to favor tests that create equal sized
partitions 28
Some other attribute selection methods
◼ CHAID: A popular decision tree algorithm, measured against the χ2 test for
independence.
◼ C-SEP: Performs better than gain and gini in certain cases
◼ G-statistic: Has a value that approximates the distribution χ2
◼ MDL (Minimal Description Length):
◼ The best tree is the one that requires the least number of bits to both (1)
encode the tree and (2) encode exceptions to the tree.
◼ Multivariate separation (partitioning based on combination of multiple
variables)
◼ CART: find multivariate decompositions based on linear combs of
attributes
◼ Which attribute selection method is best?
◼ Most give good results, no product is superior. 29
Overfitting and Tree Pruning
◼ Overfitting : A tree can be equipped with too much training
data

30
Overfitting and Tree Pruning
◼ Overfitting: A tree can be equipped with too much training
data
◼ Too many branches, some of which may reflect anomalies

due to noise or outliers

◼ Poor accuracy for new unclassified samplesHai cách tiếp cận

tránh overfitting
◼ Prepruning: Pause tree building prematurely, do not split a
node if this will result in the goodness dropping below the
threshold.
◼ Problem: Difficult to choose an appropriate threshold.

◼ Postpruning: Remove branches from a "fully grown" tree,

taking a series of gradually pruned trees
◼ Use a different set of data than the training data to decide

which is the "best pruned tree" 31

Overfitting and Tree Pruning

32
Improvements to the basic decision tree

◼ Allows attributes to have continuous values

◼ Automatically identify new discrete-valued attributes that
partition continuous attribute values into a set of discrete
intervals
◼ Handle missing attribute values
◼ Assigns the most common value of the attribute
◼ Assign a probability to each possible value
◼ Build properties
◼ Create new properties based on existing properties
◼ This reduces fragmentation, repetition, and duplication

33
Classification in Large Databases
◼ Classification - a classic problem widely studied by statisticians
and machine learning researchers
◼ Scalable: Classify datasets with millions of examples and
hundreds of attributes at reasonable speeds
◼ Why are decision trees popular?
◼ Relatively faster learning rate (compared to other
classification methods)
◼ Can be converted into simple and understandable
classification rules
◼ Can use SQL queries to access the database
◼ Has classification accuracy comparable to other methods
◼ RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
34
Scalability framework for RainForest
◼ Separate the scalability aspects from the criteria that determine
the quality of the tree
◼ Create AVC list: AVC (Attribute, Value, Class Label)
◼ AVC-set (of an attribute X)
◼ Project the training data set onto the X attribute and class
labels where the number of individual class labels is
aggregated
◼ AVC-group (of a node n)
◼ Set of AVC tuples of all forecast attributes at node n

35
Rainforest: Training set and AVC set

Tập ví dụ huấn luyện AVC-set on Age AVC-set on income

age income studentcredit_rating
buys_computerAge Buy_Computer income Buy_Computer

<=30 high no fair no yes no

<=30 high no excellent no yes no
high 2 2
31…40 high no fair yes <=30 2 3
31..40 4 0 medium 4 2
>40 medium no fair yes
>40 low yes fair yes >40 3 2 low 3 1
>40 low yes excellent no
31…40 low yes excellent yes
AVC-set on
<=30 medium no fair no AVC-set on Student
credit_rating
<=30 low yes fair yes
student Buy_Computer
>40 medium yes fair yes Credit
Buy_Computer

<=30 medium yes excellent yes yes no rating yes no

31…40 medium no excellent yes yes 6 1 fair 6 2
31…40 high yes fair yes no 3 4 excellent 3 3
>40 medium no excellent no
36
BOAT (Bootstrapped Optimistic
Algorithm for Tree Construction)
◼ Uses a statistical technique called bootstrapping to create a
number of smaller samples (subsets), each of which fits in
memory
◼ Each subset is used to create a tree, which results in a number
of trees
◼ These trees are examined and used to create a new tree T'
◼ It turns out that T' is very close to the tree that would be
produced using the entire data set together
◼ Advantage: It only requires 2 DB scans

37
Decision tree

38
Classification

◼ Classification: Basic concepts

◼ Decision tree
◼ Bayes classifier method
◼ Rule-based classification
◼ Evaluate and select models
◼ Techniques to improve classification performance

39
Bayesian Classification

40
Bayesian Classification: Why?
◼ Statistical classifier: Performs probabilistic prediction, i.e.
predicts the probability of class membership
◼ Theoretical basis: Based on Bayes Theorem.
◼ Performance: Simple Bayesian classifier - Bayes naïve, has
comparable performance to decision tree and selected neural
network classifiers.
◼ Incremental: Each training example can gradually
increase/decrease the probability that the hypothesis is correct -
prior knowledge can be combined with observed data
◼ Standard: Even if Bayesian methods are not computable, they
can provide an optimal decision-making standard against which
other methods can be measured
41
Naïve Bayes Classifier
◼ Naive Bayes classification is a probabilistic classifier based on
Bayes' theorem, with the assumption of independence among
features. Despite its simplicity, it often performs surprisingly
well in many real-world applications, especially for text
classification tasks such as spam detection, sentiment analysis,
and document categorization.

42
Naïve Bayes Classifier
◼ Naive Bayes classification is a probabilistic classifier based on
Bayes' theorem, with the assumption of independence among
features.

43
Naïve Bayes Classifier
◼ Naive Bayes classification is a probabilistic classifier based on
Bayes' theorem, with the assumption of independence among
features.

44
Naïve Bayes Classifier: Training data set
Class: age income studentcredit_rating
buys_compu
C1:buys_computer = ‘yes’ <=30 high no fair no
<=30 high no excellent no
C2:buys_computer = ‘no’
31…40 high no fair yes
>40 medium no fair yes
Example: Predicting class >40 low yes fair yes
labels using naïve >40 low yes excellent no
Bayesian classifier. Data 31…40 low yes excellent yes
to be classified: <=30 medium no fair no
<=30 low yes fair yes
X = (age <=30,
>40 medium yes fair yes
Income = medium,
<=30 medium yes excellent yes
Student = yes 31…40 medium no excellent yes
Credit_rating = Fair) 31…40 high yes fair yes
>40 medium no excellent no
45
Naïve Bayes Classifier: Training data set age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes

◼ P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 >40

>40
medium
low
no fair
yes fair
yes
yes
>40 low yes excellent no

P(buys_computer = “no”) = 5/14= 0.357 31…40

<=30
low
medium
yes excellent
no fair
yes
no
<=30 low yes fair yes
◼ Calculate P(X|Ci) for each class >40
<=30
medium yes fair
medium yes excellent
yes
yes
31…40 medium no excellent yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 31…40
>40
high
medium
yes fair
no excellent
yes
no

P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6

P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
◼ X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
As a result, X belongs to class (“buys_computer = yes”) 46
Naïve Bayes Classifier: Training data set
Practice: Predicting class age income studentcredit_rating
buys_compu
labels using naïve <=30 high no fair no
<=30 high no excellent no
Bayesian classifiers. Data
31…40 high no fair yes
to be classified: >40 medium no fair yes
X = (age =31…40, >40 low yes fair yes
Income = medium, >40 low yes excellent no
Student = no 31…40 low yes excellent yes
Credit_rating = Fair) <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
47
Avoid the zero probability problem
◼ Naïve Bayesian prediction requires that each conditional
probability be nonzero. Otherwise, the predicted probability will
be zero. n
P( X | C i ) =  P( x k | C i )
k =1
◼ Example: Suppose a data set has 1000 sets of values,
income=low (0), income= medium (990), and income = high
(10)
◼ Use Laplacian correction
◼ Adding 1 to each case, we have:

Prob(income = low) = 1/1003

Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
◼ The edited probability is approximated by the unedited set
48
Naïve Bayes classification
◼ Advantage
◼ Easy to do

◼ Good results are obtained in most cases

◼ Disadvantage
◼ Assumption: class is conditionally independent, hence loss of

precision
◼ In fact, there exists a dependency between variables

◼ For example: Hospital – Patient – Profile – Age, family

history, symptoms: Fever, cough – Disease: Lung cancer,

diabetes
◼ Dependencies among these cannot be modeled by Naïve Bayes
classifier
◼ How to deal with these dependencies? Hint: Bayesian belief
network 49
Outlines

◼ Classification: Basic concepts

◼ Decision tree
◼ Bayes classifier method
◼ K-Nearest Neighbors (KNN)
◼ Evaluate and select models
◼ Techniques to improve classification performance

50
K-Nearest Neighbors (KNN)
◼ The K-nearest neighbors algorithm (KNN) is a very simple yet
powerful machine learning model. It assigns a label to a new
sample based on the labels of its k closest samples in the
training set.

51
K-Nearest Neighbors (KNN)
◼ KNN is a lazy learner: it does not build a model from the
training data, and all of its computation is deferred until the
prediction time.
◼ KNN is also a very flexible model: it can find decision
boundaries of any shape between the classes, and can be used
for both classification and regression problems.

52
K-Nearest Neighbors Classification
◼ The idea of the KNN classification algorithm: given a new
sample, assign it to the class that is most common among its k
nearest neighbors.
◼ The training phase of the algorithm consists of only storing the
training samples and their labels, no model is built from the data.
◼ In the prediction phase, we compute the distances between the
test (query) point x and all the stored samples, find the k samples
with the smallest distances, and assign the majority label of these
samples to x.

53
K-Nearest Neighbors Classification
◼ The pseudocode of this algorithm

◼ Note that since the algorithm relies on distance computations, it

is important to normalize the data set in order to make sure that
all the features have the same scale.
54
K-Nearest Neighbors Classification
◼ How KNN Works
◼ Step 1: Data Preparation
◼ Step 2: Choosing K: Decide the number of nearest neighbors,
denoted as 𝐾, which is a hyperparameter of the model.
◼ Step 3: Distance Metric: Select a distance metric to measure the
similarity between data points. Common choices are Euclidean
distance, Manhattan distance, or Minkowski distance.
◼ Step 4: Finding Neighbors: For a new data point that you want to
classify or predict, calculate its distance to all data points in the
training set.
◼ Step 5: Sorting Neighbors: Sort the distances and identify the 𝐾
nearest neighbors.

55
K-Nearest Neighbors Classification
◼ How KNN Works
◼ Step 6: Making Predictions:
◼ Classification: The class of the new data point is determined

by the majority class among its 𝐾nearest neighbors.

◼ Regression: The value of the new data point is usually the

average of the values of its 𝐾 nearest neighbors.

56
K-Nearest Neighbors Classification
◼ Example: Given the following samples, classify the vector (1, 0,
1) using KNN with k = 3 and Euclidean distance.

57
K-Nearest Neighbors Classification
◼ Example: Given the following samples, classify the vector (1, 0,
1) using KNN with k = 3 and Euclidean distance.
◼ Guide: Compute the distance from (1, 0, 1) to each one of the
given points. The Euclidean distance between two vectors x =
(x₁, …, xₙ) and y = (y₁, …, yₙ) is defined as:

58
K-Nearest Neighbors Classification
◼ Example: Given the following samples, classify the vector (1, 0,
1) using KNN with k = 3 and Euclidean distance.
◼ Since we are only interested to find the closest points to (1, 0, 1),
we can save some computations by computing the Euclidean
distance squared

59
K-Nearest Neighbors Classification
◼ Example: Given the following samples, classify the vector (1, 0,
1) using KNN with k = 3 and Euclidean distance.
◼ From the table above we can see that the three closest points to
(1, 0, 1) are points 3, 5 and 6. The majority class of these points
is 1, therefore (1, 0, 1) will be classified as 1.

60
Choosing the Number of Neighbors
◼ The choice of the number of nearest neighbors k is crucial as it
can have a large impact on the performance of the algorithm.
◼ If k is too small, then the algorithm may be sensitive to noise
(mislabeled samples in the training set), hence it will have high
variance.
◼ If k is too large, the neighborhood of the query point may include
too many points from irrelevant classes, and as a result the model
will have high bias.
◼ In the extreme case, when k = n, then the prediction of the
algorithm will always be the same (the majority of the labels in
the entire training set).

61
Choosing the Number of Neighbors
◼ The choice of the number of nearest neighbors k is crucial as it
can have a large impact on the performance of the algorithm.

◼ Solution: Using grid search to find the optimal k

62
Adv and disadvantages of KNN
◼ Advantages ◼ Disadvantages
◼ Simplicity: Easy to understand ◼ Computationally Intensive:
and implement. Can be slow for large datasets,
◼ Non-parametric: Makes no as it requires distance
assumptions about the calculations for all data points.
underlying data distribution. ◼ Memory Intensive: Requires
◼ Versatile: Can be used for both storing the entire training
classification and regression. dataset.
◼ Curse of Dimensionality:
Performance can degrade with
high-dimensional data since
distances can become less
meaningful.

63
Use cases of KNN
◼ Recommendation Systems: Recommending items to users based
on the preferences of similar users.
◼ Image Recognition: Classifying images based on the similarity
to known images.
◼ Anomaly Detection: Identifying unusual data points by
comparing them to the rest of the dataset.
◼ Medical Diagnosis: Predicting disease outcomes based on patient
similarities

64
Tips for using KNN
◼ Scaling: Standardize or normalize your features, as KNN is
sensitive to the scale of data.
◼ Choosing K: Use techniques like cross-validation to select the
optimal value of 𝐾K.
◼ Dimensionality Reduction: Apply methods like PCA (Principal
Component Analysis) to reduce dimensionality if you have high-
dimensional data.

65
Outlines

◼ Classification: Basic concepts

◼ Decision tree
◼ Bayes classifier method
◼ K-Nearest Neighbors (KNN)
◼ Evaluate and select models
◼ Techniques to improve classification performance

66
Evaluate and select models
◼ Evaluation metrics: How can we measure accuracy? Other
metrics to consider?
◼ Use a validation test set of class-labeled value sets instead of a
training set when evaluating accuracy
◼ Methods for estimating classifier accuracy
◼ Holdout method, random subsampling
◼ Cross-validation
◼ Bootstrap
◼ Compare classifiers:
◼ Confidence interval
◼ Cost-benefit analysis and ROC Curve
67
Classifier evaluation metrics: Confusion
matrix
◼ True Positives (TP): These refer to sets of positive values that
have been correctly labeled by the classifier.
◼ True negatives (TN): These are sets of negatives that have been
correctly labeled by the classifier.
◼ False Positives (FP): These are negative tuples incorrectly
labeled as positives (e.g. tuples of class buys_computer = no for
which the classifier predicts buys_computer = yes)
◼ False negatives (FN): These are tuples of positive values that
have been incorrectly labeled as negatives (e.g. class tuples
buys_computer = yes for which the classifier predicts
buys_computer = no)

68
Classifier evaluation metrics: Confusion
matrix

69
Classifier evaluation metrics: Confusion
matrix
Confusion Matrix:
Actual class \ Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix

Actual class \ buy_computer buy_computer Total
Predicted class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

◼ Given m classes, CMi,j in the confusion matrix represents the

number of tuples in class i assigned by the classifier as label j
◼ There are additional rows/columns to represent totals
70
Classification evaluation indexes:
Accuracy, Error rate, Sensitivity

71
Classification evaluation indexes:
Accuracy, Error rate, Sensitivity
A\P C ¬C ◼ Class imbalance problem::
C TP FN P
◼ A class can be few/rare
¬C FP TN N
(cheating)
P’ N’ All
◼ Most are negative classes and

◼ Classifier Accuracy, or a few are positive classes

recognition rate: the percentage of ◼ Sensitivity: True Positive
test sets that are correctly recognition rate
classified ◼ Sensitivity = TP/P

Accuracy = (TP + TN)/All ◼ Specificity: True Negative

◼ Error rate: 1 – accuracy, or recognition rate

Error rate = (FP + FN)/All ◼ Specificity = TN/N

72
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
◼ Precision - Precision: the percentage of tuples that the classifier
labels as positive are actually positive

◼ Recall: completeness - % of positive tuples that labeled the

classifier as positive?
◼ Perfect point is 1.0
◼ Inverse relationship between precision & recall
◼ F measure (F1 or F-score): a harmonious average between
accuracy and completeness

◼ Fß: weighted measure of accuracy and completeness

73
Classifier Evaluation Metrics

Actual class \ Predicted class cancer = yes cancer = no Total Recognition(%)

cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)

◼ Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

74
Evaluating classifier accuracy: Holdout &
Cross-Validation methods
◼ Holdout method

75
Evaluating classifier accuracy: Holdout &
Cross-Validation methods
◼ Holdout method

◼ Data were randomly divided into two independent sets

◼ Training set (2/3) for model building

◼ Test set ( 1/3) to estimate accuracy

◼ Random sampling: 1 variable of holdout

◼ Repeat the retention k times, precision = average of the

obtained precision

76
Evaluating classifier accuracy: Holdout &
Cross-Validation methods
◼ Cross-validation (k-fold, k = 10 is most common)

77
Evaluating classifier accuracy: Holdout &
Cross-Validation methods
◼ Cross-validation (k-fold, k = 10 is most common)
◼ Partition the random data into k mutually exclusive subsets,

each of approximately equal size

◼ In the ith iteration, use Di as the test set and the other sets as the

training set
◼ Leave 1 fold: k fold, k = number of tuples, for small sized data

◼ stratified cross-validation: Folds are stratified for

classification, each fold is approximately the same as in the
original data

78
Estimating confidence intervals:
Classification model M1 vs. M2
◼ Suppose there are 2 classifiers: M1 and M2which model is
better?
◼ Use 10-fold cross-validation to obtain and
◼ These average error rates are only estimates of the error over the
real set of future data instances
◼ What if the difference between the two error rates is just due to
chance?
◼ Use statistical significance testing

◼ Obtain confidence limits for error estimates

79
Estimating confidence intervals: NULL
hypothesis
◼ Perform 10-fold cross-validation
◼ Assume the samples follow a t-distribution with k -1 degrees of
freedom (k=10)
◼ Use t-test (or Student’s t-test)
◼ Null Hypothesis: M1 & M2 are the same
◼ If we reject null hypothesis, then
◼ It is concluded that the difference between M1 and M2 is
statistically significant
◼ Choose the model with a lower error rate

80
Estimation of confidence interval: t-test

◼ If there is only 1 test set: Pairwise comparison

◼ For the ith 10-fold cross-validation round, the same cross-
partition is used to obtain err(M1)i and err(M2)i
◼ Average the 10 rounds to get: và
◼ t-test calculates t-statistic with k-1 degrees of freedom:

◼ If two test sets are available: use an unpaired t-test

k1 & k2 is the number of cross-validation samples for M1 & M2.

81
Estimated confidence interval: Table for t-
distribution

◼ Symmetry
◼ Significance level,
sig = 0.05 or 5%
means M1 & M2
different to 95%
◼ Confidence limit –
Confidence limits,
z = sig/2

82
Estimating confidence intervals: Statistical
significance
◼ Are M1 & M2 significantly different?
◼ Calculate significance level (sig = 5%)

◼ t-distribution reference table (t-distribution): Find the t value

corresponding to k-1 degrees of freedom ( 9)

◼ t-distribution is symmetric: usually displays the upper % of

the distribution → look for value for confidence limit z=sig/2

(0.025)
◼ If t > z or t < -z, then the value t is in the rejection region:

◼ Reject the null hypothesis that the average error rates of

M1 & M2 are the same

◼ Conclusion: Statistically significant difference between

M1 & M 2
◼ Otherwise, conclude that any differences are coincidental
83
Model selection: ROC Curves
◼ ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
◼ Derived from signal detection theory
◼ Shows the balance between true
positive rate and false positive rate ◼ The vertical axis
◼ The area under the ROC curve is a represents the true
measure of model accuracy positive rate
◼ Rank the test tuples in descending ◼ The horizontal axis -
order: the tuple most likely to belong to the false positive rate
the positive class appears at the top of ◼ The chart also shows a
the list diagonal line
◼ The closer to the diagonal (i.e. the ◼ A model with perfect
closer the area is to 0.5), the less accuracy would have
accurate the model is an area of 1.0 84
Model selection: ROC Curves

85
Model selection: ROC Curves

The table of probability values (column 3) returned by the

probability classifier for each of the 10 tuples in the test set, sorted
in descending order of probability. Request, draw ROC curve? 86
Model selection: ROC Curves

ROC curve for the above data

87
Model selection: ROC Curves

ROC curves of two classification models M1 and M2 88

Issues affecting model selection
◼ Accuracy – Accuracy
◼ Classifier accuracy: Class label prediction
◼ Speed – Speed
◼ Model building time (training time)
◼ Model usage time (classification/prediction time)
◼ Robustness: Handles noise and missing values
◼ Scalability: Efficiency in disk-resident databases
◼ Interpretability – The ability to interpret
◼ Understanding and insight are provided by the model
◼ Other measures: How good the rules are, e.g. decision tree size,
compactness of classification rules

89
Summary
◼ Classification is a form of data analysis that extracts models that
describe important classes of data
◼ Efficient and scalable methods have been developed for decision
tree induction, Naive Bayesian classification, rule-based
classification, and many other classification methods.
◼ Evaluation indicators include: accuracy, sensitivity, specificity,
precision, recall, F measure, and Fß measure.
◼ Stratified k-fold cross-validation is recommended for estimating
accuracy. Bagging and boosting can be used to increase overall
accuracy by learning and combining a series of individual models.

90
Summary
◼ Classification is a form of data analysis that extracts models that
describe classes of data. A classifier, or classification model,
predicts classification labels (classes). Numerical prediction
models continuously valued functions. Numerical classification
and prediction are two main types of prediction problems.
◼ Decision tree induction is a top-down recursive tree induction
algorithm that uses an attribute selection metric to select the
attribute to be tested for each non-leaf node in the tree. ID3, C4.5
and CART are algorithms that use different attribute selection
measures. Tree pruning algorithms attempt to improve accuracy
by removing tree branches that reflect noise in the data. Scalable
algorithms, like RainForest. 91
Summary
◼ Na¨ıve Bayesian Classification is based on Bayes' theorem of
posterior probability. It assumes class conditional independence
— that the influence of an attribute value on a given class is
independent of the values of other attributes.

92
Summary
◼ The confusion matrix can be used to evaluate the quality of the
classifier. For a two-class problem, it indicates true positives,
true negatives, false positives, and false negatives. Metrics to
evaluate the predictive ability of the classifier include accuracy,
sensitivity, specificity, precision, F and Fβ.
◼ Building and evaluating a classifier requires partitioning the
labeled data into a training set and a testing set. Holdout, random
sampling, cross-validation and bootstrapping are typical methods
used for such partitioning.

93
Summary
◼ Significance testing and ROC curves are useful tools for model
selection. Significance testing can be used to evaluate whether
the difference in accuracy between two classifiers is due to
chance. The ROC curve plots the true positive rate (or
sensitivity) against the false positive rate (or 1 - specificity) of
one or more classifiers.

DM Unit-4
No ratings yet
DM Unit-4
75 pages
8 Classification
No ratings yet
8 Classification
82 pages
08 Class Basic
No ratings yet
08 Class Basic
81 pages
Concepts and Techniques
No ratings yet
Concepts and Techniques
53 pages
Chapter3 Classification and Prediction
No ratings yet
Chapter3 Classification and Prediction
63 pages
Module 4
No ratings yet
Module 4
99 pages
Classification - Basic Concepts
No ratings yet
Classification - Basic Concepts
35 pages
08 Class Basic
No ratings yet
08 Class Basic
81 pages
Unit 3
No ratings yet
Unit 3
98 pages
08ClassBasic L
No ratings yet
08ClassBasic L
78 pages
Classification
No ratings yet
Classification
27 pages
08ClassBasic v1
No ratings yet
08ClassBasic v1
46 pages
Lecture 8
No ratings yet
Lecture 8
81 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
83 pages
Concepts and Techniques: Data Mining
100% (1)
Concepts and Techniques: Data Mining
81 pages
ECE Classification Concepts
No ratings yet
ECE Classification Concepts
69 pages
Unit 4 DM
No ratings yet
Unit 4 DM
88 pages
Classification
No ratings yet
Classification
73 pages
Unit 3-Classification
No ratings yet
Unit 3-Classification
71 pages
DM 4
No ratings yet
DM 4
68 pages
Classification Intr DT
No ratings yet
Classification Intr DT
31 pages
Module - 4.1-DM-1
No ratings yet
Module - 4.1-DM-1
63 pages
DM 3
No ratings yet
DM 3
37 pages
Classification
No ratings yet
Classification
45 pages
05 Classification
No ratings yet
05 Classification
79 pages
08 Class Basic
No ratings yet
08 Class Basic
76 pages
Chap4 Classification Lecture 5
No ratings yet
Chap4 Classification Lecture 5
74 pages
VII - CS8031 - DMDW - Module 6 - Classification - VBP
No ratings yet
VII - CS8031 - DMDW - Module 6 - Classification - VBP
99 pages
Chapter 6 Classification and Prediction25.10.13
No ratings yet
Chapter 6 Classification and Prediction25.10.13
43 pages
Lecture 4
No ratings yet
Lecture 4
79 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
88 pages
CH 5
No ratings yet
CH 5
81 pages
04 Classification
No ratings yet
04 Classification
72 pages
Classification Basics & Decision Trees
No ratings yet
Classification Basics & Decision Trees
82 pages
Decision Tree Course Guide
No ratings yet
Decision Tree Course Guide
37 pages
08 Class Basic
No ratings yet
08 Class Basic
86 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
87 pages
Class Basic
No ratings yet
Class Basic
75 pages
Classification and Prediction
100% (1)
Classification and Prediction
31 pages
Classification DecisionTreesNaiveBayeskNN
No ratings yet
Classification DecisionTreesNaiveBayeskNN
75 pages
49 Machine Learning
No ratings yet
49 Machine Learning
300 pages
Classification
100% (1)
Classification
37 pages
05classification Rule Mining
No ratings yet
05classification Rule Mining
56 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
Class Basic
No ratings yet
Class Basic
67 pages
Classification Ppts 2021
No ratings yet
Classification Ppts 2021
80 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
Classification & Prediction Techniques
No ratings yet
Classification & Prediction Techniques
71 pages
Classification and Prediction: Data Mining 이복주 단국대학교 컴퓨터공학과
No ratings yet
Classification and Prediction: Data Mining 이복주 단국대학교 컴퓨터공학과
75 pages
Chap 3 - Costs Concepts
100% (1)
Chap 3 - Costs Concepts
33 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
70 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
42 pages
Classification and Prediction
No ratings yet
Classification and Prediction
40 pages
Exploratory Data Analysis of Titanic Survival Prediction Using Machine Learning Techniques
No ratings yet
Exploratory Data Analysis of Titanic Survival Prediction Using Machine Learning Techniques
5 pages
Module 1 ML Mumbai University
No ratings yet
Module 1 ML Mumbai University
47 pages
Concept of A Technology Classification For Country Comparisons
No ratings yet
Concept of A Technology Classification For Country Comparisons
15 pages
Titanic Survival Prediction Assignment
No ratings yet
Titanic Survival Prediction Assignment
3 pages
MLT - Lab - Manual FINAL
No ratings yet
MLT - Lab - Manual FINAL
38 pages
Assignment 3-PDS Python-24S3
No ratings yet
Assignment 3-PDS Python-24S3
5 pages
A Soil Moisture Classification Model Based On SVM Used in Agricultural WSN
No ratings yet
A Soil Moisture Classification Model Based On SVM Used in Agricultural WSN
5 pages
6 C3 M4 L1-RecurrentNeuralNetwork1
No ratings yet
6 C3 M4 L1-RecurrentNeuralNetwork1
29 pages
Lab Manual Ann
No ratings yet
Lab Manual Ann
12 pages
Set-2 (AIML202)
No ratings yet
Set-2 (AIML202)
6 pages
Heart Disease Prediction Using Decision Tree Analysis
No ratings yet
Heart Disease Prediction Using Decision Tree Analysis
10 pages
Role of Data Mining in Education For Improving Students Performance For Social Change
No ratings yet
Role of Data Mining in Education For Improving Students Performance For Social Change
2 pages
Explainable Machine Learning For Land Cover Classification An Introductory Guide - Final
No ratings yet
Explainable Machine Learning For Land Cover Classification An Introductory Guide - Final
54 pages
Past Papers Imperial College London MSC Hydrology - HWRM
No ratings yet
Past Papers Imperial College London MSC Hydrology - HWRM
6 pages
Predicting Heart Disease with ML
No ratings yet
Predicting Heart Disease with ML
4 pages
Chap 2
No ratings yet
Chap 2
105 pages
1 s2.0 S2095311920633299 Main
No ratings yet
1 s2.0 S2095311920633299 Main
14 pages
Apache Mahout for Data Scientists
100% (1)
Apache Mahout for Data Scientists
16 pages
Research Article: Improved KNN Algorithm Based On Preprocessing of Center in Smart Cities
No ratings yet
Research Article: Improved KNN Algorithm Based On Preprocessing of Center in Smart Cities
10 pages
Ids Unit 2
No ratings yet
Ids Unit 2
5 pages
10 Heart Disease Prediction Kiranjit Kaur
No ratings yet
10 Heart Disease Prediction Kiranjit Kaur
15 pages
Session 2 Intro AI ML ITiE
No ratings yet
Session 2 Intro AI ML ITiE
23 pages
BCA Deep Learning Practical Guide
No ratings yet
BCA Deep Learning Practical Guide
17 pages
Machine Learning: Supervised vs Unsupervised
No ratings yet
Machine Learning: Supervised vs Unsupervised
14 pages
A Comparative Study On Mushroom Classification Using Supervised Machine Learning Algorithms
No ratings yet
A Comparative Study On Mushroom Classification Using Supervised Machine Learning Algorithms
8 pages
DBB2102 Unit-02
No ratings yet
DBB2102 Unit-02
23 pages
Bayesian Data Analysis for Scientists
No ratings yet
Bayesian Data Analysis for Scientists
45 pages
Viva Presentation On IoT
No ratings yet
Viva Presentation On IoT
41 pages

Classification

Uploaded by

Classification

Uploaded by

Classification

Concepts and Techniques

◼ Classification: Basic concepts

◼ Classify data (build a model) based on the training set and

values (class labels) in a classification attribute and use it to

unknown or missing values

◼ Medical diagnosis: if a tumor is cancerous or benign

◼ Fraud detection: if a transaction is fraudulent

◼ Site classification: which category is it?

by the class label attribute

◼ The model is represented as a classification rule, decision tree, or

results from the model

classified by the model

◼ If accuracy is acceptable, use the model to classify new data

NAME RANK YEARS TENURED Classifier

◼ Classification: Basic concepts

student? yes credit rating?

no yes excellent fair

◼ Categorized attributes (if there are continuous values, they are

empirical measurements (information gain).

◼ There are no properties left for further zoning

◼ There are no models left

Gain(income) = 0.0029 (Calculated in previous example)

◼ Similarly, Gini{low,high} = 0.458; Gini{medium,high} = 0.450.

◼ There are 3 common ways to select attributes that give good

due to noise or outliers

◼ Postpruning: Remove branches from a "fully grown" tree,

which is the "best pruned tree" 31

◼ Allows attributes to have continuous values

Tập ví dụ huấn luyện AVC-set on Age AVC-set on income

<=30 high no fair no yes no

<=30 medium yes excellent yes yes no rating yes no

◼ Classification: Basic concepts

◼ P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 >40

P(buys_computer = “no”) = 5/14= 0.357 31…40

P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6

Prob(income = low) = 1/1003

◼ Good results are obtained in most cases

◼ For example: Hospital – Patient – Profile – Age, family

history, symptoms: Fever, cough – Disease: Lung cancer,

◼ Classification: Basic concepts

◼ Note that since the algorithm relies on distance computations, it

by the majority class among its 𝐾nearest neighbors.

average of the values of its 𝐾 nearest neighbors.

◼ Solution: Using grid search to find the optimal k

◼ Classification: Basic concepts

Example of Confusion Matrix

◼ Given m classes, CMi,j in the confusion matrix represents the

◼ Classifier Accuracy, or a few are positive classes

Accuracy = (TP + TN)/All ◼ Specificity: True Negative

◼ Error rate: 1 – accuracy, or recognition rate

◼ Recall: completeness - % of positive tuples that labeled the

◼ Fß: weighted measure of accuracy and completeness

Actual class \ Predicted class cancer = yes cancer = no Total Recognition(%)

◼ Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

◼ Data were randomly divided into two independent sets

◼ Training set (2/3) for model building

◼ Test set ( 1/3) to estimate accuracy

◼ Random sampling: 1 variable of holdout

◼ Repeat the retention k times, precision = average of the

each of approximately equal size

◼ *stratified cross-validation*: Folds are stratified for

◼ Obtain confidence limits for error estimates

◼ If there is only 1 test set: Pairwise comparison

◼ If two test sets are available: use an unpaired t-test

k1 & k2 is the number of cross-validation samples for M1 & M2.

◼ t-distribution reference table (t-distribution): Find the t value

corresponding to k-1 degrees of freedom ( 9)

the distribution → look for value for confidence limit z=sig/2

◼ Reject the null hypothesis that the average error rates of

M1 & M2 are the same

The table of probability values ​(column 3) returned by the

ROC curve for the above data

ROC curves of two classification models M1 and M2 88

You might also like

◼ stratified cross-validation: Folds are stratified for

The table of probability values (column 3) returned by the