Classification
Classification
1
Outlines
2
Supervised learning and unsupervised
learning
◼ Supervised learning (classification)
◼ Monitoring: Training data (Observations, measurements,
…) are labeled to show the observation class.
◼ The new data is classified based on the training set..
◼ Unsupervised learning (clustering)
◼ The class labels of the training data are unknown.
◼ Provide a set of measurements, observations, etc. with the
purpose of establishing the existence of classes or clusters
in the data.
3
The prediction problem: Classification and.
Numeric Prediction
◼ Classification
◼ Predicting classification class labels
mathematical formula
◼ Model usage: to classify future or unknown objects
◼ Estimate the model's accuracy
◼ The known label of the test sample is compared with the classified
5
Procedure (1): Build the model
Classification Algorithm
Training
data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 yes
George Professor 5 yes
Joseph Assistant Prof 7 yes
7
Phân loại - Classification
8
Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
9
Decision tree
age income student credit_rating buys_computer
<=30 high no fair no
❑ Training set: Buys_computer <=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
❑ Result tree: >40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
no yes yes
10
Algorithm for decision tree
◼ Simple Algorithm (Greedy Algorithm)
◼ The tree is built in a top-down recursive divide-and-conquer
manner.
◼ At the start, all training examples are at the root
discretized first)
◼ The examples are recursively partitioned based on the selected
attributes
◼ Test attributes are selected on the basis of statistical or
11
Algorithm for decision tree
◼ Example: Use your algorithm, to build a decision tree from the
following employee data, knowing “status” is the class label
attribute.
12
Attribute selection measures
◼ Problem: How to choose the best attributes for the root node and
its children?
◼ There are two common techniques for attribute selection(Attribute
Selection Measure – ASM):
◼ Information Gain
◼ Gini Index
13
Attribute selection measures
◼ Problem: How to choose the best attributes for the root node and
its children?
◼ There are two common techniques for attribute selection(Attribute
Selection Measure – ASM)
◼ Information Gain
◼ Gini Index
14
Attribute selection measures:
Information Gain (ID3/C4.5)
◼ Information Gain (IG) Information gain measures the
reduction in entropy or uncertainty after a dataset is split on an
attribute.
◼ IG calculates how much information a feature gives us about a
class.
◼ Based on the obtained value of ID, divide the nodes and build a
decision tree.
◼ The decision tree algorithm always tries to maximize the value
of Information Gain, and a node/attribute with the highest
information gain (IG) is extracted first.
15
Attribute selection measures:
Information Gain (ID3/C4.5)
◼ Example: Decision tree
16
Attribute selection measures:
Information Gain (ID3/C4.5)
◼ Choose attributes with a high level of information
17
Attribute selection measures
Information Gain
Class P: buys_computer = “yes” 5 4
Infoage ( D ) = I (2,3) + I (4,0)
Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D) = I (9,5) = − log 2 ( ) − log 2 ( ) =0.940 + I (3,2) = 0.694
14 14 14 14 14
age pi ni I(pi, ni) 5
I (2,3) means “age <=30” has 5 out of
<=30 2 3 0.971 14
14 samples, with 2 yes’es and 3
31…40 4 0 0
>40 3 2 0.971 no’s.
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain(age) = Info( D) − Infoage ( D) = 0.246
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
Similarly, calculate: Gain(income),
>40 low yes excellent no
31…40 low yes excellent yes Gain (student), Gain
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes (credit_rating)?
<=30 medium yes excellent yes
31…40 medium no excellent yes From there, select the best attribute as
31…40 high yes fair yes
>40 medium no excellent no the root node 18
Attribute selection measures
Information Gain
Class P: buys_computer = “yes” Infoage ( D) =
5
I (2,3) +
4
I (4,0)
Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D) = I (9,5) = − log 2 ( ) − log 2 ( ) =0.940 + I (3,2) = 0.694
14 14 14 14 14
age pi ni I(pi, ni) 5
I (2,3) means “age <=30” has 5 out of
<=30 2 3 0.971 14
14 samples, with 2 yes’es and 3
31…40 4 0 0
>40 3 2 0.971 no’s.
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain(age) = Info( D) − Infoage ( D) = 0.246
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes Similarly,
>40 low yes fair yes
Gain(income) = 0.029
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30
>40
low
medium
yes fair
yes fair
yes
yes
Gain( student) = 0.151
Gain(credit _ rating ) = 0.048
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no 19
Attribute selection measures
Information Gain
◼ The Age attribute with the highest Information Gain value should
be selected as the root node for the decision tree.
20
Calculating Information-Gain for
attributes with continuous values
◼ Let A be an attribute with continuous value
◼ Determine the best separation point for A
◼ Sort the values of attribute A in ascending order
◼ The point between each pair of adjacent values can be
considered the split point
◼ (ai+ai+1)/2 is the point between the pair (ai , ai+1)
◼ The point with the smallest expected information is called the
separation point for A
◼ Split :
◼ D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
21
Calculating Information-Gain for
attributes with continuous values
22
Gain Ratio for attribute selection (C4.5)
◼ Gain ratio is a modification of information gain that reduces its
bias towards attributes with many values.
23
Gain Ratio for attribute selection (C4.5)
◼ Example: Calculate the Gain Ratio value for the income
attribute. An income test divides the data into three partitions:
low, medium, high, and contains 4, 6, and 4 sets of values,
respectively. Calculate the value Gain_Ratio(income) = ?
25
Gini Index (CART, IBM IntelligentMiner)
◼ Ví dụ: Initialize the decision tree using the Gini index. Call D the
training data, of which 9 data sets belong to the buys_computer =
yes class and the remaining 5 data sets belong to the
buys_computer = no class. Node (root) N is created for the tuples
in D.
26
Calculate Gini index
◼ To find the separation criteria for tuples in D, we need to
calculate the Gini index for each attribute.
◼ Suppose the income attribute divides D into 2 subsets D1 {low,
medium} and D2 , in which D1 includes 10 records, D2 includes 4
the remaining 4 records.
30
Overfitting and Tree Pruning
◼ Overfitting: A tree can be equipped with too much training
data
◼ Too many branches, some of which may reflect anomalies
tránh overfitting
◼ Prepruning: Pause tree building prematurely, do not split a
node if this will result in the goodness dropping below the
threshold.
◼ Problem: Difficult to choose an appropriate threshold.
32
Improvements to the basic decision tree
33
Classification in Large Databases
◼ Classification - a classic problem widely studied by statisticians
and machine learning researchers
◼ Scalable: Classify datasets with millions of examples and
hundreds of attributes at reasonable speeds
◼ Why are decision trees popular?
◼ Relatively faster learning rate (compared to other
classification methods)
◼ Can be converted into simple and understandable
classification rules
◼ Can use SQL queries to access the database
◼ Has classification accuracy comparable to other methods
◼ RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
34
Scalability framework for RainForest
◼ Separate the scalability aspects from the criteria that determine
the quality of the tree
◼ Create AVC list: AVC (Attribute, Value, Class Label)
◼ AVC-set (of an attribute X)
◼ Project the training data set onto the X attribute and class
labels where the number of individual class labels is
aggregated
◼ AVC-group (of a node n)
◼ Set of AVC tuples of all forecast attributes at node n
35
Rainforest: Training set and AVC set
37
Decision tree
38
Classification
39
Bayesian Classification
40
Bayesian Classification: Why?
◼ Statistical classifier: Performs probabilistic prediction, i.e.
predicts the probability of class membership
◼ Theoretical basis: Based on Bayes Theorem.
◼ Performance: Simple Bayesian classifier - Bayes naïve, has
comparable performance to decision tree and selected neural
network classifiers.
◼ Incremental: Each training example can gradually
increase/decrease the probability that the hypothesis is correct -
prior knowledge can be combined with observed data
◼ Standard: Even if Bayesian methods are not computable, they
can provide an optimal decision-making standard against which
other methods can be measured
41
Naïve Bayes Classifier
◼ Naive Bayes classification is a probabilistic classifier based on
Bayes' theorem, with the assumption of independence among
features. Despite its simplicity, it often performs surprisingly
well in many real-world applications, especially for text
classification tasks such as spam detection, sentiment analysis,
and document categorization.
42
Naïve Bayes Classifier
◼ Naive Bayes classification is a probabilistic classifier based on
Bayes' theorem, with the assumption of independence among
features.
43
Naïve Bayes Classifier
◼ Naive Bayes classification is a probabilistic classifier based on
Bayes' theorem, with the assumption of independence among
features.
44
Naïve Bayes Classifier: Training data set
Class: age income studentcredit_rating
buys_compu
C1:buys_computer = ‘yes’ <=30 high no fair no
<=30 high no excellent no
C2:buys_computer = ‘no’
31…40 high no fair yes
>40 medium no fair yes
Example: Predicting class >40 low yes fair yes
labels using naïve >40 low yes excellent no
Bayesian classifier. Data 31…40 low yes excellent yes
to be classified: <=30 medium no fair no
<=30 low yes fair yes
X = (age <=30,
>40 medium yes fair yes
Income = medium,
<=30 medium yes excellent yes
Student = yes 31…40 medium no excellent yes
Credit_rating = Fair) 31…40 high yes fair yes
>40 medium no excellent no
45
Naïve Bayes Classifier: Training data set age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
◼ Disadvantage
◼ Assumption: class is conditionally independent, hence loss of
precision
◼ In fact, there exists a dependency between variables
50
K-Nearest Neighbors (KNN)
◼ The K-nearest neighbors algorithm (KNN) is a very simple yet
powerful machine learning model. It assigns a label to a new
sample based on the labels of its k closest samples in the
training set.
51
K-Nearest Neighbors (KNN)
◼ KNN is a lazy learner: it does not build a model from the
training data, and all of its computation is deferred until the
prediction time.
◼ KNN is also a very flexible model: it can find decision
boundaries of any shape between the classes, and can be used
for both classification and regression problems.
52
K-Nearest Neighbors Classification
◼ The idea of the KNN classification algorithm: given a new
sample, assign it to the class that is most common among its k
nearest neighbors.
◼ The training phase of the algorithm consists of only storing the
training samples and their labels, no model is built from the data.
◼ In the prediction phase, we compute the distances between the
test (query) point x and all the stored samples, find the k samples
with the smallest distances, and assign the majority label of these
samples to x.
53
K-Nearest Neighbors Classification
◼ The pseudocode of this algorithm
55
K-Nearest Neighbors Classification
◼ How KNN Works
◼ Step 6: Making Predictions:
◼ Classification: The class of the new data point is determined
56
K-Nearest Neighbors Classification
◼ Example: Given the following samples, classify the vector (1, 0,
1) using KNN with k = 3 and Euclidean distance.
57
K-Nearest Neighbors Classification
◼ Example: Given the following samples, classify the vector (1, 0,
1) using KNN with k = 3 and Euclidean distance.
◼ Guide: Compute the distance from (1, 0, 1) to each one of the
given points. The Euclidean distance between two vectors x =
(x₁, …, xₙ) and y = (y₁, …, yₙ) is defined as:
58
K-Nearest Neighbors Classification
◼ Example: Given the following samples, classify the vector (1, 0,
1) using KNN with k = 3 and Euclidean distance.
◼ Since we are only interested to find the closest points to (1, 0, 1),
we can save some computations by computing the Euclidean
distance squared
59
K-Nearest Neighbors Classification
◼ Example: Given the following samples, classify the vector (1, 0,
1) using KNN with k = 3 and Euclidean distance.
◼ From the table above we can see that the three closest points to
(1, 0, 1) are points 3, 5 and 6. The majority class of these points
is 1, therefore (1, 0, 1) will be classified as 1.
60
Choosing the Number of Neighbors
◼ The choice of the number of nearest neighbors k is crucial as it
can have a large impact on the performance of the algorithm.
◼ If k is too small, then the algorithm may be sensitive to noise
(mislabeled samples in the training set), hence it will have high
variance.
◼ If k is too large, the neighborhood of the query point may include
too many points from irrelevant classes, and as a result the model
will have high bias.
◼ In the extreme case, when k = n, then the prediction of the
algorithm will always be the same (the majority of the labels in
the entire training set).
61
Choosing the Number of Neighbors
◼ The choice of the number of nearest neighbors k is crucial as it
can have a large impact on the performance of the algorithm.
63
Use cases of KNN
◼ Recommendation Systems: Recommending items to users based
on the preferences of similar users.
◼ Image Recognition: Classifying images based on the similarity
to known images.
◼ Anomaly Detection: Identifying unusual data points by
comparing them to the rest of the dataset.
◼ Medical Diagnosis: Predicting disease outcomes based on patient
similarities
64
Tips for using KNN
◼ Scaling: Standardize or normalize your features, as KNN is
sensitive to the scale of data.
◼ Choosing K: Use techniques like cross-validation to select the
optimal value of 𝐾K.
◼ Dimensionality Reduction: Apply methods like PCA (Principal
Component Analysis) to reduce dimensionality if you have high-
dimensional data.
65
Outlines
66
Evaluate and select models
◼ Evaluation metrics: How can we measure accuracy? Other
metrics to consider?
◼ Use a validation test set of class-labeled value sets instead of a
training set when evaluating accuracy
◼ Methods for estimating classifier accuracy
◼ Holdout method, random subsampling
◼ Cross-validation
◼ Bootstrap
◼ Compare classifiers:
◼ Confidence interval
◼ Cost-benefit analysis and ROC Curve
67
Classifier evaluation metrics: Confusion
matrix
◼ True Positives (TP): These refer to sets of positive values that
have been correctly labeled by the classifier.
◼ True negatives (TN): These are sets of negatives that have been
correctly labeled by the classifier.
◼ False Positives (FP): These are negative tuples incorrectly
labeled as positives (e.g. tuples of class buys_computer = no for
which the classifier predicts buys_computer = yes)
◼ False negatives (FN): These are tuples of positive values that
have been incorrectly labeled as negatives (e.g. class tuples
buys_computer = yes for which the classifier predicts
buys_computer = no)
68
Classifier evaluation metrics: Confusion
matrix
69
Classifier evaluation metrics: Confusion
matrix
Confusion Matrix:
Actual class \ Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
71
Classification evaluation indexes:
Accuracy, Error rate, Sensitivity
A\P C ¬C ◼ Class imbalance problem::
C TP FN P
◼ A class can be few/rare
¬C FP TN N
(cheating)
P’ N’ All
◼ Most are negative classes and
72
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
◼ Precision - Precision: the percentage of tuples that the classifier
labels as positive are actually positive
73
Classifier Evaluation Metrics
74
Evaluating classifier accuracy: Holdout &
Cross-Validation methods
◼ Holdout method
75
Evaluating classifier accuracy: Holdout &
Cross-Validation methods
◼ Holdout method
76
Evaluating classifier accuracy: Holdout &
Cross-Validation methods
◼ Cross-validation (k-fold, k = 10 is most common)
77
Evaluating classifier accuracy: Holdout &
Cross-Validation methods
◼ Cross-validation (k-fold, k = 10 is most common)
◼ Partition the random data into k mutually exclusive subsets,
training set
◼ Leave 1 fold: k fold, k = number of tuples, for small sized data
78
Estimating confidence intervals:
Classification model M1 vs. M2
◼ Suppose there are 2 classifiers: M1 and M2which model is
better?
◼ Use 10-fold cross-validation to obtain and
◼ These average error rates are only estimates of the error over the
real set of future data instances
◼ What if the difference between the two error rates is just due to
chance?
◼ Use statistical significance testing
79
Estimating confidence intervals: NULL
hypothesis
◼ Perform 10-fold cross-validation
◼ Assume the samples follow a t-distribution with k -1 degrees of
freedom (k=10)
◼ Use t-test (or Student’s t-test)
◼ Null Hypothesis: M1 & M2 are the same
◼ If we reject null hypothesis, then
◼ It is concluded that the difference between M1 and M2 is
statistically significant
◼ Choose the model with a lower error rate
80
Estimation of confidence interval: t-test
◼ Symmetry
◼ Significance level,
sig = 0.05 or 5%
means M1 & M2
different to 95%
◼ Confidence limit –
Confidence limits,
z = sig/2
82
Estimating confidence intervals: Statistical
significance
◼ Are M1 & M2 significantly different?
◼ Calculate significance level (sig = 5%)
M1 & M 2
◼ Otherwise, conclude that any differences are coincidental
83
Model selection: ROC Curves
◼ ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
◼ Derived from signal detection theory
◼ Shows the balance between true
positive rate and false positive rate ◼ The vertical axis
◼ The area under the ROC curve is a represents the true
measure of model accuracy positive rate
◼ Rank the test tuples in descending ◼ The horizontal axis -
order: the tuple most likely to belong to the false positive rate
the positive class appears at the top of ◼ The chart also shows a
the list diagonal line
◼ The closer to the diagonal (i.e. the ◼ A model with perfect
closer the area is to 0.5), the less accuracy would have
accurate the model is an area of 1.0 84
Model selection: ROC Curves
85
Model selection: ROC Curves
89
Summary
◼ Classification is a form of data analysis that extracts models that
describe important classes of data
◼ Efficient and scalable methods have been developed for decision
tree induction, Naive Bayesian classification, rule-based
classification, and many other classification methods.
◼ Evaluation indicators include: accuracy, sensitivity, specificity,
precision, recall, F measure, and Fß measure.
◼ Stratified k-fold cross-validation is recommended for estimating
accuracy. Bagging and boosting can be used to increase overall
accuracy by learning and combining a series of individual models.
90
Summary
◼ Classification is a form of data analysis that extracts models that
describe classes of data. A classifier, or classification model,
predicts classification labels (classes). Numerical prediction
models continuously valued functions. Numerical classification
and prediction are two main types of prediction problems.
◼ Decision tree induction is a top-down recursive tree induction
algorithm that uses an attribute selection metric to select the
attribute to be tested for each non-leaf node in the tree. ID3, C4.5
and CART are algorithms that use different attribute selection
measures. Tree pruning algorithms attempt to improve accuracy
by removing tree branches that reflect noise in the data. Scalable
algorithms, like RainForest. 91
Summary
◼ Na¨ıve Bayesian Classification is based on Bayes' theorem of
posterior probability. It assumes class conditional independence
— that the influence of an attribute value on a given class is
independent of the values of other attributes.
92
Summary
◼ The confusion matrix can be used to evaluate the quality of the
classifier. For a two-class problem, it indicates true positives,
true negatives, false positives, and false negatives. Metrics to
evaluate the predictive ability of the classifier include accuracy,
sensitivity, specificity, precision, F and Fβ.
◼ Building and evaluating a classifier requires partitioning the
labeled data into a training set and a testing set. Holdout, random
sampling, cross-validation and bootstrapping are typical methods
used for such partitioning.
93
Summary
◼ Significance testing and ROC curves are useful tools for model
selection. Significance testing can be used to evaluate whether
the difference in accuracy between two classifiers is due to
chance. The ROC curve plots the true positive rate (or
sensitivity) against the false positive rate (or 1 - specificity) of
one or more classifiers.
94