Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views95 pages

Classification

Uploaded by

Lân
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views95 pages

Classification

Uploaded by

Lân
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

Classification

Concepts and Techniques


(3rd ed.)

1
Outlines

◼ Classification: Basic concepts


◼ Decision tree
◼ Bayes classifier method
◼ K-Nearest Neighbors (KNN)
◼ Evaluate and select models
◼ Techniques to improve classification performance

2
Supervised learning and unsupervised
learning
◼ Supervised learning (classification)
◼ Monitoring: Training data (Observations, measurements,
…) are labeled to show the observation class.
◼ The new data is classified based on the training set..
◼ Unsupervised learning (clustering)
◼ The class labels of the training data are unknown.
◼ Provide a set of measurements, observations, etc. with the
purpose of establishing the existence of classes or clusters
in the data.

3
The prediction problem: Classification and.
Numeric Prediction
◼ Classification
◼ Predicting classification class labels

◼ Classify data (build a model) based on the training set and

values (class labels) in a classification attribute and use it to


classify new data
◼ Numeric Prediction
◼ Modeling continuously valued functions, i.e. predicting

unknown or missing values


◼ Some popular applications
◼ Credit/loan approval:

◼ Medical diagnosis: if a tumor is cancerous or benign

◼ Fraud detection: if a transaction is fraudulent

◼ Site classification: which category is it?


4
Classification – 2 step process
◼ Model building: Describes a set of predefined classes
◼ Each tuple/pattern is assumed to belong to a predefined class, identified

by the class label attribute


◼ The set of values used to build the model is the training set

◼ The model is represented as a classification rule, decision tree, or

mathematical formula
◼ Model usage: to classify future or unknown objects
◼ Estimate the model's accuracy

◼ The known label of the test sample is compared with the classified

results from the model


◼ The accuracy rate is the percentage of test samples that are correctly

classified by the model


◼ The testing set is independent from the training set

◼ If accuracy is acceptable, use the model to classify new data

5
Procedure (1): Build the model

Classification Algorithm
Training
data

NAME RANK YEARS TENURED Classifier


M ike A ssistan t P ro f 3 no (Model)
M ary A ssistan t P ro f 7 yes
B ill P ro fesso r 2 yes
Jim A sso ciate P ro f 7 yes
IF rank = ‘professor’
D ave A ssistan t P ro f 6 no
OR years > 6
Anne A sso ciate P ro f 3 no
THEN tenured = ‘yes’
6
Process (2): Using the model in prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 yes
George Professor 5 yes
Joseph Assistant Prof 7 yes
7
Phân loại - Classification

◼ Classification: Basic concepts


◼ Decision tree
◼ Bayes classifier method
◼ Rule-based classification
◼ Evaluate and select models
◼ Techniques to improve classification performance

8
Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

9
Decision tree
age income student credit_rating buys_computer
<=30 high no fair no
❑ Training set: Buys_computer <=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
❑ Result tree: >40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair

no yes yes
10
Algorithm for decision tree
◼ Simple Algorithm (Greedy Algorithm)
◼ The tree is built in a top-down recursive divide-and-conquer

manner.
◼ At the start, all training examples are at the root

◼ Categorized attributes (if there are continuous values, they are

discretized first)
◼ The examples are recursively partitioned based on the selected

attributes
◼ Test attributes are selected on the basis of statistical or

empirical measurements (information gain).


◼ Condition to stop partitioning
◼ All samples for a given node belong to the same class.

◼ There are no properties left for further zoning

◼ There are no models left

11
Algorithm for decision tree
◼ Example: Use your algorithm, to build a decision tree from the
following employee data, knowing “status” is the class label
attribute.

12
Attribute selection measures
◼ Problem: How to choose the best attributes for the root node and
its children?
◼ There are two common techniques for attribute selection(Attribute
Selection Measure – ASM):
◼ Information Gain
◼ Gini Index

13
Attribute selection measures
◼ Problem: How to choose the best attributes for the root node and
its children?
◼ There are two common techniques for attribute selection(Attribute
Selection Measure – ASM)
◼ Information Gain
◼ Gini Index

14
Attribute selection measures:
Information Gain (ID3/C4.5)
◼ Information Gain (IG) Information gain measures the
reduction in entropy or uncertainty after a dataset is split on an
attribute.
◼ IG calculates how much information a feature gives us about a
class.
◼ Based on the obtained value of ID, divide the nodes and build a
decision tree.
◼ The decision tree algorithm always tries to maximize the value
of Information Gain, and a node/attribute with the highest
information gain (IG) is extracted first.

15
Attribute selection measures:
Information Gain (ID3/C4.5)
◼ Example: Decision tree

16
Attribute selection measures:
Information Gain (ID3/C4.5)
◼ Choose attributes with a high level of information

17
Attribute selection measures
Information Gain
 Class P: buys_computer = “yes” 5 4
Infoage ( D ) = I (2,3) + I (4,0)
 Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D) = I (9,5) = − log 2 ( ) − log 2 ( ) =0.940 + I (3,2) = 0.694
14 14 14 14 14
age pi ni I(pi, ni) 5
I (2,3) means “age <=30” has 5 out of
<=30 2 3 0.971 14
14 samples, with 2 yes’es and 3
31…40 4 0 0
>40 3 2 0.971 no’s.
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain(age) = Info( D) − Infoage ( D) = 0.246
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
Similarly, calculate: Gain(income),
>40 low yes excellent no
31…40 low yes excellent yes Gain (student), Gain
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes (credit_rating)?
<=30 medium yes excellent yes
31…40 medium no excellent yes From there, select the best attribute as
31…40 high yes fair yes
>40 medium no excellent no the root node 18
Attribute selection measures
Information Gain
 Class P: buys_computer = “yes” Infoage ( D) =
5
I (2,3) +
4
I (4,0)
 Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D) = I (9,5) = − log 2 ( ) − log 2 ( ) =0.940 + I (3,2) = 0.694
14 14 14 14 14
age pi ni I(pi, ni) 5
I (2,3) means “age <=30” has 5 out of
<=30 2 3 0.971 14
14 samples, with 2 yes’es and 3
31…40 4 0 0
>40 3 2 0.971 no’s.
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain(age) = Info( D) − Infoage ( D) = 0.246
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes Similarly,
>40 low yes fair yes

Gain(income) = 0.029
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30
>40
low
medium
yes fair
yes fair
yes
yes
Gain( student) = 0.151
Gain(credit _ rating ) = 0.048
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no 19
Attribute selection measures
Information Gain
◼ The Age attribute with the highest Information Gain value should
be selected as the root node for the decision tree.

20
Calculating Information-Gain for
attributes with continuous values
◼ Let A be an attribute with continuous value
◼ Determine the best separation point for A
◼ Sort the values of attribute A in ascending order
◼ The point between each pair of adjacent values can be
considered the split point
◼ (ai+ai+1)/2 is the point between the pair (ai , ai+1)
◼ The point with the smallest expected information is called the
separation point for A
◼ Split :
◼ D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
21
Calculating Information-Gain for
attributes with continuous values

22
Gain Ratio for attribute selection (C4.5)
◼ Gain ratio is a modification of information gain that reduces its
bias towards attributes with many values.

23
Gain Ratio for attribute selection (C4.5)
◼ Example: Calculate the Gain Ratio value for the income
attribute. An income test divides the data into three partitions:
low, medium, high, and contains 4, 6, and 4 sets of values,
respectively. Calculate the value Gain_Ratio(income) = ?

Gain(income) = 0.0029 (Calculated in previous example)


GainRatio (income) = 0.029/1.557 = 0.019
◼ The attribute with the maximum gain ratio is chosen as the
separation attribute
◼ Requirement: Gain Ratio(student) = ? 24
Gini Index (CART, IBM IntelligentMiner)
◼ The Gini index measures the impurity or purity of a dataset; it is
used in algorithms like CART (Classification and Regression
Trees).

25
Gini Index (CART, IBM IntelligentMiner)
◼ Ví dụ: Initialize the decision tree using the Gini index. Call D the
training data, of which 9 data sets belong to the buys_computer =
yes class and the remaining 5 data sets belong to the
buys_computer = no class. Node (root) N is created for the tuples
in D.

26
Calculate Gini index
◼ To find the separation criteria for tuples in D, we need to
calculate the Gini index for each attribute.
◼ Suppose the income attribute divides D into 2 subsets D1 {low,
medium} and D2 , in which D1 includes 10 records, D2 includes 4
the remaining 4 records.

◼ Similarly, Gini{low,high} = 0.458; Gini{medium,high} = 0.450.


◼ As a result, separate {low,medium} and {high} because it has
the smallest Gini index
27
Comparison of attribute selection measures

◼ There are 3 common ways to select attributes that give good


results, however:
◼ Information gain:
◼ Favors multi-valued attributes
◼ Gain ratio:
◼ There is a tendency for unbalanced partitions, where 1
partition is much smaller than another
◼ Gini index:
◼ Favors multi-valued attributes
◼ Difficulty when the number of layers/partitions is large
◼ There is a tendency to favor tests that create equal sized
partitions 28
Some other attribute selection methods
◼ CHAID: A popular decision tree algorithm, measured against the χ2 test for
independence.
◼ C-SEP: Performs better than gain and gini in certain cases
◼ G-statistic: Has a value that approximates the distribution χ2
◼ MDL (Minimal Description Length):
◼ The best tree is the one that requires the least number of bits to both (1)
encode the tree and (2) encode exceptions to the tree.
◼ Multivariate separation (partitioning based on combination of multiple
variables)
◼ CART: find multivariate decompositions based on linear combs of
attributes
◼ Which attribute selection method is best?
◼ Most give good results, no product is superior. 29
Overfitting and Tree Pruning
◼ Overfitting : A tree can be equipped with too much training
data

30
Overfitting and Tree Pruning
◼ Overfitting: A tree can be equipped with too much training
data
◼ Too many branches, some of which may reflect anomalies

due to noise or outliers


◼ Poor accuracy for new unclassified samplesHai cách tiếp cận

tránh overfitting
◼ Prepruning: Pause tree building prematurely, do not split a
node if this will result in the goodness dropping below the
threshold.
◼ Problem: Difficult to choose an appropriate threshold.

◼ Postpruning: Remove branches from a "fully grown" tree,


taking a series of gradually pruned trees
◼ Use a different set of data than the training data to decide

which is the "best pruned tree" 31


Overfitting and Tree Pruning

32
Improvements to the basic decision tree

◼ Allows attributes to have continuous values


◼ Automatically identify new discrete-valued attributes that
partition continuous attribute values into a set of discrete
intervals
◼ Handle missing attribute values
◼ Assigns the most common value of the attribute
◼ Assign a probability to each possible value
◼ Build properties
◼ Create new properties based on existing properties
◼ This reduces fragmentation, repetition, and duplication

33
Classification in Large Databases
◼ Classification - a classic problem widely studied by statisticians
and machine learning researchers
◼ Scalable: Classify datasets with millions of examples and
hundreds of attributes at reasonable speeds
◼ Why are decision trees popular?
◼ Relatively faster learning rate (compared to other
classification methods)
◼ Can be converted into simple and understandable
classification rules
◼ Can use SQL queries to access the database
◼ Has classification accuracy comparable to other methods
◼ RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
34
Scalability framework for RainForest
◼ Separate the scalability aspects from the criteria that determine
the quality of the tree
◼ Create AVC list: AVC (Attribute, Value, Class Label)
◼ AVC-set (of an attribute X)
◼ Project the training data set onto the X attribute and class
labels where the number of individual class labels is
aggregated
◼ AVC-group (of a node n)
◼ Set of AVC tuples of all forecast attributes at node n

35
Rainforest: Training set and AVC set

Tập ví dụ huấn luyện AVC-set on Age AVC-set on income


age income studentcredit_rating
buys_computerAge Buy_Computer income Buy_Computer

<=30 high no fair no yes no


<=30 high no excellent no yes no
high 2 2
31…40 high no fair yes <=30 2 3
31..40 4 0 medium 4 2
>40 medium no fair yes
>40 low yes fair yes >40 3 2 low 3 1
>40 low yes excellent no
31…40 low yes excellent yes
AVC-set on
<=30 medium no fair no AVC-set on Student
credit_rating
<=30 low yes fair yes
student Buy_Computer
>40 medium yes fair yes Credit
Buy_Computer

<=30 medium yes excellent yes yes no rating yes no


31…40 medium no excellent yes yes 6 1 fair 6 2
31…40 high yes fair yes no 3 4 excellent 3 3
>40 medium no excellent no
36
BOAT (Bootstrapped Optimistic
Algorithm for Tree Construction)
◼ Uses a statistical technique called bootstrapping to create a
number of smaller samples (subsets), each of which fits in
memory
◼ Each subset is used to create a tree, which results in a number
of trees
◼ These trees are examined and used to create a new tree T'
◼ It turns out that T' is very close to the tree that would be
produced using the entire data set together
◼ Advantage: It only requires 2 DB scans

37
Decision tree

38
Classification

◼ Classification: Basic concepts


◼ Decision tree
◼ Bayes classifier method
◼ Rule-based classification
◼ Evaluate and select models
◼ Techniques to improve classification performance

39
Bayesian Classification

40
Bayesian Classification: Why?
◼ Statistical classifier: Performs probabilistic prediction, i.e.
predicts the probability of class membership
◼ Theoretical basis: Based on Bayes Theorem.
◼ Performance: Simple Bayesian classifier - Bayes naïve, has
comparable performance to decision tree and selected neural
network classifiers.
◼ Incremental: Each training example can gradually
increase/decrease the probability that the hypothesis is correct -
prior knowledge can be combined with observed data
◼ Standard: Even if Bayesian methods are not computable, they
can provide an optimal decision-making standard against which
other methods can be measured
41
Naïve Bayes Classifier
◼ Naive Bayes classification is a probabilistic classifier based on
Bayes' theorem, with the assumption of independence among
features. Despite its simplicity, it often performs surprisingly
well in many real-world applications, especially for text
classification tasks such as spam detection, sentiment analysis,
and document categorization.

42
Naïve Bayes Classifier
◼ Naive Bayes classification is a probabilistic classifier based on
Bayes' theorem, with the assumption of independence among
features.

43
Naïve Bayes Classifier
◼ Naive Bayes classification is a probabilistic classifier based on
Bayes' theorem, with the assumption of independence among
features.

44
Naïve Bayes Classifier: Training data set
Class: age income studentcredit_rating
buys_compu
C1:buys_computer = ‘yes’ <=30 high no fair no
<=30 high no excellent no
C2:buys_computer = ‘no’
31…40 high no fair yes
>40 medium no fair yes
Example: Predicting class >40 low yes fair yes
labels using naïve >40 low yes excellent no
Bayesian classifier. Data 31…40 low yes excellent yes
to be classified: <=30 medium no fair no
<=30 low yes fair yes
X = (age <=30,
>40 medium yes fair yes
Income = medium,
<=30 medium yes excellent yes
Student = yes 31…40 medium no excellent yes
Credit_rating = Fair) 31…40 high yes fair yes
>40 medium no excellent no
45
Naïve Bayes Classifier: Training data set age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes

◼ P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 >40


>40
medium
low
no fair
yes fair
yes
yes
>40 low yes excellent no

P(buys_computer = “no”) = 5/14= 0.357 31…40


<=30
low
medium
yes excellent
no fair
yes
no
<=30 low yes fair yes
◼ Calculate P(X|Ci) for each class >40
<=30
medium yes fair
medium yes excellent
yes
yes
31…40 medium no excellent yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 31…40
>40
high
medium
yes fair
no excellent
yes
no

P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6


P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
◼ X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
As a result, X belongs to class (“buys_computer = yes”) 46
Naïve Bayes Classifier: Training data set
Practice: Predicting class age income studentcredit_rating
buys_compu
labels using naïve <=30 high no fair no
<=30 high no excellent no
Bayesian classifiers. Data
31…40 high no fair yes
to be classified: >40 medium no fair yes
X = (age =31…40, >40 low yes fair yes
Income = medium, >40 low yes excellent no
Student = no 31…40 low yes excellent yes
Credit_rating = Fair) <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
47
Avoid the zero probability problem
◼ Naïve Bayesian prediction requires that each conditional
probability be nonzero. Otherwise, the predicted probability will
be zero. n
P( X | C i ) =  P( x k | C i )
k =1
◼ Example: Suppose a data set has 1000 sets of values,
income=low (0), income= medium (990), and income = high
(10)
◼ Use Laplacian correction
◼ Adding 1 to each case, we have:

Prob(income = low) = 1/1003


Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
◼ The edited probability is approximated by the unedited set
48
Naïve Bayes classification
◼ Advantage
◼ Easy to do

◼ Good results are obtained in most cases

◼ Disadvantage
◼ Assumption: class is conditionally independent, hence loss of

precision
◼ In fact, there exists a dependency between variables

◼ For example: Hospital – Patient – Profile – Age, family

history, symptoms: Fever, cough – Disease: Lung cancer,


diabetes
◼ Dependencies among these cannot be modeled by Naïve Bayes
classifier
◼ How to deal with these dependencies? Hint: Bayesian belief
network 49
Outlines

◼ Classification: Basic concepts


◼ Decision tree
◼ Bayes classifier method
◼ K-Nearest Neighbors (KNN)
◼ Evaluate and select models
◼ Techniques to improve classification performance

50
K-Nearest Neighbors (KNN)
◼ The K-nearest neighbors algorithm (KNN) is a very simple yet
powerful machine learning model. It assigns a label to a new
sample based on the labels of its k closest samples in the
training set.

51
K-Nearest Neighbors (KNN)
◼ KNN is a lazy learner: it does not build a model from the
training data, and all of its computation is deferred until the
prediction time.
◼ KNN is also a very flexible model: it can find decision
boundaries of any shape between the classes, and can be used
for both classification and regression problems.

52
K-Nearest Neighbors Classification
◼ The idea of the KNN classification algorithm: given a new
sample, assign it to the class that is most common among its k
nearest neighbors.
◼ The training phase of the algorithm consists of only storing the
training samples and their labels, no model is built from the data.
◼ In the prediction phase, we compute the distances between the
test (query) point x and all the stored samples, find the k samples
with the smallest distances, and assign the majority label of these
samples to x.

53
K-Nearest Neighbors Classification
◼ The pseudocode of this algorithm

◼ Note that since the algorithm relies on distance computations, it


is important to normalize the data set in order to make sure that
all the features have the same scale.
54
K-Nearest Neighbors Classification
◼ How KNN Works
◼ Step 1: Data Preparation
◼ Step 2: Choosing K: Decide the number of nearest neighbors,
denoted as 𝐾, which is a hyperparameter of the model.
◼ Step 3: Distance Metric: Select a distance metric to measure the
similarity between data points. Common choices are Euclidean
distance, Manhattan distance, or Minkowski distance.
◼ Step 4: Finding Neighbors: For a new data point that you want to
classify or predict, calculate its distance to all data points in the
training set.
◼ Step 5: Sorting Neighbors: Sort the distances and identify the 𝐾
nearest neighbors.

55
K-Nearest Neighbors Classification
◼ How KNN Works
◼ Step 6: Making Predictions:
◼ Classification: The class of the new data point is determined

by the majority class among its 𝐾nearest neighbors.


◼ Regression: The value of the new data point is usually the

average of the values of its 𝐾 nearest neighbors.

56
K-Nearest Neighbors Classification
◼ Example: Given the following samples, classify the vector (1, 0,
1) using KNN with k = 3 and Euclidean distance.

57
K-Nearest Neighbors Classification
◼ Example: Given the following samples, classify the vector (1, 0,
1) using KNN with k = 3 and Euclidean distance.
◼ Guide: Compute the distance from (1, 0, 1) to each one of the
given points. The Euclidean distance between two vectors x =
(x₁, …, xₙ) and y = (y₁, …, yₙ) is defined as:

58
K-Nearest Neighbors Classification
◼ Example: Given the following samples, classify the vector (1, 0,
1) using KNN with k = 3 and Euclidean distance.
◼ Since we are only interested to find the closest points to (1, 0, 1),
we can save some computations by computing the Euclidean
distance squared

59
K-Nearest Neighbors Classification
◼ Example: Given the following samples, classify the vector (1, 0,
1) using KNN with k = 3 and Euclidean distance.
◼ From the table above we can see that the three closest points to
(1, 0, 1) are points 3, 5 and 6. The majority class of these points
is 1, therefore (1, 0, 1) will be classified as 1.

60
Choosing the Number of Neighbors
◼ The choice of the number of nearest neighbors k is crucial as it
can have a large impact on the performance of the algorithm.
◼ If k is too small, then the algorithm may be sensitive to noise
(mislabeled samples in the training set), hence it will have high
variance.
◼ If k is too large, the neighborhood of the query point may include
too many points from irrelevant classes, and as a result the model
will have high bias.
◼ In the extreme case, when k = n, then the prediction of the
algorithm will always be the same (the majority of the labels in
the entire training set).

61
Choosing the Number of Neighbors
◼ The choice of the number of nearest neighbors k is crucial as it
can have a large impact on the performance of the algorithm.

◼ Solution: Using grid search to find the optimal k


62
Adv and disadvantages of KNN
◼ Advantages ◼ Disadvantages
◼ Simplicity: Easy to understand ◼ Computationally Intensive:
and implement. Can be slow for large datasets,
◼ Non-parametric: Makes no as it requires distance
assumptions about the calculations for all data points.
underlying data distribution. ◼ Memory Intensive: Requires
◼ Versatile: Can be used for both storing the entire training
classification and regression. dataset.
◼ Curse of Dimensionality:
Performance can degrade with
high-dimensional data since
distances can become less
meaningful.

63
Use cases of KNN
◼ Recommendation Systems: Recommending items to users based
on the preferences of similar users.
◼ Image Recognition: Classifying images based on the similarity
to known images.
◼ Anomaly Detection: Identifying unusual data points by
comparing them to the rest of the dataset.
◼ Medical Diagnosis: Predicting disease outcomes based on patient
similarities

64
Tips for using KNN
◼ Scaling: Standardize or normalize your features, as KNN is
sensitive to the scale of data.
◼ Choosing K: Use techniques like cross-validation to select the
optimal value of 𝐾K.
◼ Dimensionality Reduction: Apply methods like PCA (Principal
Component Analysis) to reduce dimensionality if you have high-
dimensional data.

65
Outlines

◼ Classification: Basic concepts


◼ Decision tree
◼ Bayes classifier method
◼ K-Nearest Neighbors (KNN)
◼ Evaluate and select models
◼ Techniques to improve classification performance

66
Evaluate and select models
◼ Evaluation metrics: How can we measure accuracy? Other
metrics to consider?
◼ Use a validation test set of class-labeled value sets instead of a
training set when evaluating accuracy
◼ Methods for estimating classifier accuracy
◼ Holdout method, random subsampling
◼ Cross-validation
◼ Bootstrap
◼ Compare classifiers:
◼ Confidence interval
◼ Cost-benefit analysis and ROC Curve
67
Classifier evaluation metrics: Confusion
matrix
◼ True Positives (TP): These refer to sets of positive values that
have been correctly labeled by the classifier.
◼ True negatives (TN): These are sets of negatives that have been
correctly labeled by the classifier.
◼ False Positives (FP): These are negative tuples incorrectly
labeled as positives (e.g. tuples of class buys_computer = no for
which the classifier predicts buys_computer = yes)
◼ False negatives (FN): These are tuples of positive values that
have been incorrectly labeled as negatives (e.g. class tuples
buys_computer = yes for which the classifier predicts
buys_computer = no)

68
Classifier evaluation metrics: Confusion
matrix

69
Classifier evaluation metrics: Confusion
matrix
Confusion Matrix:
Actual class \ Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix


Actual class \ buy_computer buy_computer Total
Predicted class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

◼ Given m classes, CMi,j in the confusion matrix represents the


number of tuples in class i assigned by the classifier as label j
◼ There are additional rows/columns to represent totals
70
Classification evaluation indexes:
Accuracy, Error rate, Sensitivity

71
Classification evaluation indexes:
Accuracy, Error rate, Sensitivity
A\P C ¬C ◼ Class imbalance problem::
C TP FN P
◼ A class can be few/rare
¬C FP TN N
(cheating)
P’ N’ All
◼ Most are negative classes and

◼ Classifier Accuracy, or a few are positive classes


recognition rate: the percentage of ◼ Sensitivity: True Positive
test sets that are correctly recognition rate
classified ◼ Sensitivity = TP/P

Accuracy = (TP + TN)/All ◼ Specificity: True Negative

◼ Error rate: 1 – accuracy, or recognition rate


Error rate = (FP + FN)/All ◼ Specificity = TN/N

72
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
◼ Precision - Precision: the percentage of tuples that the classifier
labels as positive are actually positive

◼ Recall: completeness - % of positive tuples that labeled the


classifier as positive?
◼ Perfect point is 1.0
◼ Inverse relationship between precision & recall
◼ F measure (F1 or F-score): a harmonious average between
accuracy and completeness

◼ Fß: weighted measure of accuracy and completeness

73
Classifier Evaluation Metrics

Actual class \ Predicted class cancer = yes cancer = no Total Recognition(%)


cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)

◼ Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

74
Evaluating classifier accuracy: Holdout &
Cross-Validation methods
◼ Holdout method

75
Evaluating classifier accuracy: Holdout &
Cross-Validation methods
◼ Holdout method

◼ Data were randomly divided into two independent sets

◼ Training set (2/3) for model building

◼ Test set ( 1/3) to estimate accuracy

◼ Random sampling: 1 variable of holdout

◼ Repeat the retention k times, precision = average of the


obtained precision

76
Evaluating classifier accuracy: Holdout &
Cross-Validation methods
◼ Cross-validation (k-fold, k = 10 is most common)

77
Evaluating classifier accuracy: Holdout &
Cross-Validation methods
◼ Cross-validation (k-fold, k = 10 is most common)
◼ Partition the random data into k mutually exclusive subsets,

each of approximately equal size


◼ In the ith iteration, use Di as the test set and the other sets as the

training set
◼ Leave 1 fold: k fold, k = number of tuples, for small sized data

◼ *stratified cross-validation*: Folds are stratified for


classification, each fold is approximately the same as in the
original data

78
Estimating confidence intervals:
Classification model M1 vs. M2
◼ Suppose there are 2 classifiers: M1 and M2which model is
better?
◼ Use 10-fold cross-validation to obtain and
◼ These average error rates are only estimates of the error over the
real set of future data instances
◼ What if the difference between the two error rates is just due to
chance?
◼ Use statistical significance testing

◼ Obtain confidence limits for error estimates

79
Estimating confidence intervals: NULL
hypothesis
◼ Perform 10-fold cross-validation
◼ Assume the samples follow a t-distribution with k -1 degrees of
freedom (k=10)
◼ Use t-test (or Student’s t-test)
◼ Null Hypothesis: M1 & M2 are the same
◼ If we reject null hypothesis, then
◼ It is concluded that the difference between M1 and M2 is
statistically significant
◼ Choose the model with a lower error rate

80
Estimation of confidence interval: t-test

◼ If there is only 1 test set: Pairwise comparison


◼ For the ith 10-fold cross-validation round, the same cross-
partition is used to obtain err(M1)i and err(M2)i
◼ Average the 10 rounds to get: và
◼ t-test calculates t-statistic with k-1 degrees of freedom:

◼ If two test sets are available: use an unpaired t-test

k1 & k2 is the number of cross-validation samples for M1 & M2.


81
Estimated confidence interval: Table for t-
distribution

◼ Symmetry
◼ Significance level,
sig = 0.05 or 5%
means M1 & M2
different to 95%
◼ Confidence limit –
Confidence limits,
z = sig/2

82
Estimating confidence intervals: Statistical
significance
◼ Are M1 & M2 significantly different?
◼ Calculate significance level (sig = 5%)

◼ t-distribution reference table (t-distribution): Find the t value

corresponding to k-1 degrees of freedom ( 9)


◼ t-distribution is symmetric: usually displays the upper % of

the distribution → look for value for confidence limit z=sig/2


(0.025)
◼ If t > z or t < -z, then the value t is in the rejection region:

◼ Reject the null hypothesis that the average error rates of

M1 & M2 are the same


◼ Conclusion: Statistically significant difference between

M1 & M 2
◼ Otherwise, conclude that any differences are coincidental
83
Model selection: ROC Curves
◼ ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
◼ Derived from signal detection theory
◼ Shows the balance between true
positive rate and false positive rate ◼ The vertical axis
◼ The area under the ROC curve is a represents the true
measure of model accuracy positive rate
◼ Rank the test tuples in descending ◼ The horizontal axis -
order: the tuple most likely to belong to the false positive rate
the positive class appears at the top of ◼ The chart also shows a
the list diagonal line
◼ The closer to the diagonal (i.e. the ◼ A model with perfect
closer the area is to 0.5), the less accuracy would have
accurate the model is an area of ​1.0 84
Model selection: ROC Curves

85
Model selection: ROC Curves

The table of probability values ​(column 3) returned by the


probability classifier for each of the 10 tuples in the test set, sorted
in descending order of probability. Request, draw ROC curve? 86
Model selection: ROC Curves

ROC curve for the above data


87
Model selection: ROC Curves

ROC curves of two classification models M1 and M2 88


Issues affecting model selection
◼ Accuracy – Accuracy
◼ Classifier accuracy: Class label prediction
◼ Speed ​– Speed
◼ Model building time (training time)
◼ Model usage time (classification/prediction time)
◼ Robustness: Handles noise and missing values
◼ Scalability: Efficiency in disk-resident databases
◼ Interpretability – The ability to interpret
◼ Understanding and insight are provided by the model
◼ Other measures: How good the rules are, e.g. decision tree size,
compactness of classification rules

89
Summary
◼ Classification is a form of data analysis that extracts models that
describe important classes of data
◼ Efficient and scalable methods have been developed for decision
tree induction, Naive Bayesian classification, rule-based
classification, and many other classification methods.
◼ Evaluation indicators include: accuracy, sensitivity, specificity,
precision, recall, F measure, and Fß measure.
◼ Stratified k-fold cross-validation is recommended for estimating
accuracy. Bagging and boosting can be used to increase overall
accuracy by learning and combining a series of individual models.

90
Summary
◼ Classification is a form of data analysis that extracts models that
describe classes of data. A classifier, or classification model,
predicts classification labels (classes). Numerical prediction
models continuously valued functions. Numerical classification
and prediction are two main types of prediction problems.
◼ Decision tree induction is a top-down recursive tree induction
algorithm that uses an attribute selection metric to select the
attribute to be tested for each non-leaf node in the tree. ID3, C4.5
and CART are algorithms that use different attribute selection
measures. Tree pruning algorithms attempt to improve accuracy
by removing tree branches that reflect noise in the data. Scalable
algorithms, like RainForest. 91
Summary
◼ Na¨ıve Bayesian Classification is based on Bayes' theorem of
posterior probability. It assumes class conditional independence
— that the influence of an attribute value on a given class is
independent of the values ​of other attributes.

92
Summary
◼ The confusion matrix can be used to evaluate the quality of the
classifier. For a two-class problem, it indicates true positives,
true negatives, false positives, and false negatives. Metrics to
evaluate the predictive ability of the classifier include accuracy,
sensitivity, specificity, precision, F and Fβ.
◼ Building and evaluating a classifier requires partitioning the
labeled data into a training set and a testing set. Holdout, random
sampling, cross-validation and bootstrapping are typical methods
used for such partitioning.

93
Summary
◼ Significance testing and ROC curves are useful tools for model
selection. Significance testing can be used to evaluate whether
the difference in accuracy between two classifiers is due to
chance. The ROC curve plots the true positive rate (or
sensitivity) against the false positive rate (or 1 - specificity) of
one or more classifiers.

94

You might also like