Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
13 views32 pages

CSC 325 AI Lecture08 Supervised Learning

This lecture covers supervised learning techniques, focusing on Bayesian classification and K-Nearest Neighbors (KNN). It explains the principles of the Naïve Bayesian classifier, including its assumptions and applications in text classification, as well as the KNN algorithm for classification based on neighbor voting. Additionally, it discusses model evaluation metrics such as confusion matrix, accuracy, precision, and recall.

Uploaded by

Abdul Basit Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views32 pages

CSC 325 AI Lecture08 Supervised Learning

This lecture covers supervised learning techniques, focusing on Bayesian classification and K-Nearest Neighbors (KNN). It explains the principles of the Naïve Bayesian classifier, including its assumptions and applications in text classification, as well as the KNN algorithm for classification based on neighbor voting. Additionally, it discusses model evaluation metrics such as confusion matrix, accuracy, precision, and recall.

Uploaded by

Abdul Basit Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

CSC-325

Artificial Intelligence

Lecture - 8:
Supervised Learning

Dr. Muhammad Tariq Siddique


Department of Computer Sciences
Bahria University, Karachi Campus, Pakistan
[email protected]
Faculty room 14, 2nd Floor, Iqbal Block
Topics to be covered
Looping
1 Bayesian Classification

2 Bayes Classifier

3 K-Nearest Neighbors

4 Classifier Evaluation

7 Tutorial

8
Bayesian Classification
Section - 1
Bayesian Classifier: Why?
▪ Bayesian classifier are statistical classifier:
• predicts class membership probabilities.
• Based on Bayes’ Theorem.
• Exhibits high accuracy and speed when applied to large databases.
▪ A simple Bayesian classifier is known as naïve Bayesian classifier
• Assumes that the effect of an attribute value on a given class is independent of
the values of the other attribute: class condition independence.
▪ Bayesian belief networks are graphical models, allow the representation
of dependencies among subsets of attributes.

© Dr. Tariq 2024 Department of Computer Sciences | Bahria University 4


Bayesian Theorem
▪ Let X be a data tuple (X is considered as “evidence”)
▪ Let H be some hypothesis, such as that the data tuple X belongs to specified class C.
▪ Determine P(H|X), the probability that the hypothesis H holds given the “evidence” or
observed data tuple X.
▪ P(H|X) is the prior probability of H conditioned on X
▪ P(X) is the prior probability of X
▪ P(X|H) is the posteriori probability of X conditioned on H.
▪ P(X) is the prior probability of X.
▪ How are these probabilities estimated?

P( H | X) = P(X | H ) P( H )
P(X)
© Dr. Tariq 2024 Department of Computer Sciences | Bahria University 5
Naïve Bayesian Classifier
▪ A simple Bayesian classifier known as the naïve Bayesian classifier
▪ Assumes that the effect of an attribute value on a given class is
independent of the values of the other attributes: class conditional
independence.
▪ It is made to simplify the computation involved and, in this sense, is
considered “naïve”.

© Dr. Tariq 2024 Department of Computer Sciences | Bahria University 6


Naïve Bayesian Classifier
▪ Let D be a training set of tuples and their associated class labels.
▪ Suppose there are m classes C1, C2, …, Cm.
▪ Given a tuple, X = (x1, x2, …, xn) depicting n measurements made on the tuple from n
attribute, the classifier will predict that X belongs to the class having the highest
posterior probability, conditioned on X.
▪ The naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and on if
P(Ci|X) > P(Cj|X) for 1 ≤ J ≤ m, j ≠ 1
▪ We maximize P(Ci|X)
▪ The class Ci for which (Ci|X) is maximized is called the maximum posteriori
hypothesis.

© Dr. Tariq 2024 Department of Computer Sciences | Bahria University 7


Naïve Bayes Classifier
• By Bayes’s Theorem
P(Ci|X) = P(X|Ci)P(Ci)/P(X)
• As P(X) is constant for all classes, only P(X|Ci)P(Ci) need to be maximized.
• The naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and on if
P(Ci|X) > P(Cj|X) for 1 ≤ j ≤ m, j ≠ i
• The class Ci for which (Ci|X) is maximized is called the maximum posteriori
hypothesis.
• The class prior probabilities may be estimated by
P(Ci) = |Ci,d|/|D| where |Ci,d| is the number of training tuples of class Ci in D.
• In order to reduce computation in evaluateing P(X|Ci), the naïve assumption of class
conditional independence is made. n
P ( X | C i ) =  P ( x | C i ) = P ( x | C i )  P ( x | C i )  ...  P ( x | C i )
k 1 2 n
k =1
• Bayesian classifiers have the minimum error rate in comparison to all other classifiers.
© Dr. Tariq 2024 Department of Computer Sciences | Bahria University 8
Naïve Bayesian Classifier: Training Dataset
Class: age income student credit_rating buys_computer
<=30 high no fair no
C1:buys_computer = ‘yes’
<=30 high no excellent no
C2:buys_computer = ‘no’ 31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
Test Data to be classified: >40 low yes excellent no
X = (age <=30, 31…40 low yes excellent yes
income = medium, <=30 medium no fair no
student = yes, <=30 low yes fair yes
>40 medium yes fair yes
credit_rating = fair)
<=30 medium yes excellent yes
31…40 medium no excellent yes
X → buy computer? 31…40 high yes fair yes
>40 medium no excellent no

© Dr. Tariq 2024 Department of Computer Sciences | Bahria University 9


Naïve Bayesian Classifier: An Example
age income student credit_rating buys_computer ▪ X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
<=30 high no fair no
<=30 high no excellent no ▪ Likelihood of “yes” = 2/9 x 4/9 x 6/9 x 6/9 X 9/14
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
= 0.02821
>40 low yes excellent no
31…40 low yes excellent yes ▪ Likelihood of “no” = 3/5 x 2/5 x 1/5 x 2/5 X 5/14
<=30 medium no fair no = 0.0068
<=30 low yes fair yes
>40 medium yes fair yes
▪ Turn into probability by normalizing likelihood values
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes ▪ probability of “yes” = 0.02821 / (0.02821 + 0.0068) = 0.8057
>40 medium no excellent no
▪ probability of “no” = 0.0068 / (0.02821 + 0.0068) = 0.1942
age Income Student Credit Rating Buys_Computer
Yes No Yes No Yes No Yes No Yes No
<=30 2/9 3/5 High 2/9 2/5 Yes 6/9 1/5 Excellent 3/9 3/5 9/14 5/14
31..40 4/9 0/5 Medium 4/9 2/5 No 3/9 4/5 Fair 6/9 2/5
>40 3/9 2/5 Low 3/9 1/5
probability of “yes” is 0.8057 therefore X belongs to (“buy_computer = yes”)
© Dr. Tariq 2024 Department of Computer Sciences | Bahria University 10
Naïve Bayesian Classifier: Text Classification
NB classifier is used extensive in text categorization. The text document is represented using a bag of word
representation that contains a set of words, without specifying their order of occurrence. A tiny training data set
containing 4 document from two classes, namely “Bio” and “CS”, is shown in Table below:

Table: A small collection of tiny documents


DocID Words Label
1 Algorithms, Tree, Graph CS
2 Tree, Life, Gene, Algorithms Bio
3 Graphs, NP, Algorithms, Tree CS
4 Protein, Assay, Cell Bio

© Dr. Tariq 2024 Department of Computer Sciences | Bahria University 11


Naïve Bayesian Classifier: Text Classification
Here, we assumed that P(word | label) is distributed according to multinomial distribution.
For a new document Xnew containing the words (Algorithms, Tree), the class label is assigned based on
the highest probability of belonging to the class. The probability can be calculated using the Estimated
prior and class-condition probabilities of each word.

© Dr. Tariq 2024 Department of Computer Sciences | Bahria University 12


Naïve Bayesian Classifier: Text Classification
Now normalization

© Dr. Tariq 2024 Department of Computer Sciences | Bahria University 13


Naïve Bayesian Classifier: Exercise 1
Suppose you have given dataset. Apply Naïve Bayes classifier with complete steps to assign class label to the
given instance .

© Dr. Tariq 2024 Department of Computer Sciences | Bahria University 14


Naïve Bayesian Classifier: Exercise 2
The human resources department at company B is seeking a higher efficiency in selecting suitable employees.
To do so, naïve Bayes technique was employed to classify a new job candidate whether she/he is
“employable” or “not employable”.
The goal is the classify a job candidate for employment. This candidate has the following predictors:
Experience over 5 years, his major is Engineering, and his ILETS score is “medium”.

© Dr. Tariq 2024 Department of Computer Sciences | Bahria University 15


Avoiding the Zero-Probability Problem
• Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the predicted prob.
will be zero
n
P( X | C i) =  P( x k | C i)
k =1

• Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high
(10),
• Use Laplacian correction (or Laplacian estimator)
– Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
– The “corrected” prob. estimates are close to their “uncorrected” counterparts

© Dr. Tariq 2024 Department of Computer Sciences | Bahria University 16


Naïve Bayes Classifier: Comments
▪ Advantages
• Easy to implement
• Good results obtained in most of the cases
▪ Disadvantages
• Assumption: class conditional independence, therefore loss of accuracy
• Practically, dependencies exist among variables, e.g.:
❑ Hospitals Patients Profile: age, family history, etc.
❑ Symptoms: fever, cough etc.,
❑ Disease: lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayes Classifier
▪ How to deal with these dependencies? Bayesian Belief Networks

© Dr. Tariq 2024 Department of Computer Sciences | Bahria University 17


K-Nearest Neighbor (kNN)
Section - 2
K-Nearest Neighbour (KNN)
• K-Nearest Neighbors is one of the simplest supervised machine learning
algorithms used for classification.
• It classifies a data point based on its neighbors’ classifications.
• It stores all available cases and classifies new cases based on similar
features.
• K in KNN is a parameter that refers to the number of nearest neighbors in
the majority voting process.

© Dr. Tariq 2024 Department of Computer Sciences | Bahria University 19


kNN: Example
Consider a dataset that contains two variables: height (cm) & weight (kg). Each point is classified as
normal or underweight.
Weight (X2) Height (Y2) Class
51 167 Underweight

Training Data
62 182 Normal
69 176 Normal
64 173 Normal
65 172 Normal
56 174 Underweight
58 169 Normal
57 173 Normal
55 170 Normal

Now, we have a new data point (x1, y1), and we need to determine its class.

Test Data 57 kg 170 cm ?


© Dr. Tariq 2024 Department of Computer Sciences | Bahria University 20
kNN: Example (Solution)
P(57, 170)
Weight Height Class Euclidean Rank
(X2) (Y2) Distance
51 167 Underweight 6.7 5
62 182 Normal 13 8
69 176 Normal 13.4 9
64 173 Normal 7.6 6
65 172 Normal 8.2 7
56 174 Underweight 4.1 4
58 169 Normal 1.4 1 K=3
57 173 Normal 3 3
55 170 Normal 2 2

If k = 3, since the majority of neighbors are classified as normal as per the KNN algorithm, the data
point (57, 170) should be Normal.
© Dr. Tariq 2024 Department of Computer Sciences | Bahria University 21
Model Evaluation
Section - 3
Model Evaluation: Confusion Matrix
Predicted
Yes No

Yes True Positive (TP) False Negative (FN)


Actual
No False Positive (FP) True Negative (TN)

True Positives (TP) - These are the correctly predicted positive values which means that the
value of actual class is positive, and the value of predicted class is also
positive.
True Negatives (TN) - These are the correctly predicted negative values which means that the
value of actual class is negative, and value of predicted class is also
negative.
False Positives (FP) – When actual class is negative and predicted class is positive.
False Negatives (FN) – When actual class is positive but predicted class is negative.

© Dr. Tariq 2024 Department of Computer Sciences | Bahria University 23


Model Evaluation: Confusion Matrix
▪ Accuracy is simply a ratio of correctly predicted observation to the total observations.
• Accuracy = (TP+TN)/(TP+FP+FN+TN)
▪ Precision is the ratio of correctly predicted positive observations to the total predicted positive
observations. High precision relates to the low false positive rate.
• Precision = TP/(TP+FP)
▪ Recall is the ratio of correctly predicted positive observations to the all observations in actual class.
• Recall = TP/(TP+FN)
▪ F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false
positives and false negatives into account.
▪ F1 is usually more useful than accuracy, especially if you have an uneven class distribution. Accuracy
works best if false positives and false negatives have similar cost.
• F1 Score = 2*(Recall * Precision) / (Recall + Precision)

© Dr. Tariq 2024 Department of Computer Sciences | Bahria University 24


How can we tabulate the results in a confusion matrix?
Student Actual Predicted Predicted
ID Class Class
1 P P TP P N
2 P P TP
3 P P TP P 12 1
4 N N TN Actual
5 N P FP
6 P P TP
N 2 5
7 P P TP
8 N N TN
9 P P TP Accuracy= (TP + TN) / (TP+TN+FP+TN)
10 P P TP = (12+5) / (12 + 5 + 2 + 1)
11 N P FP
12 P N FN = 17/20 = 0.85
13 N N TN Precision =TP/ (TP+FP) =12/(12 + 2)
14 P P TP =12/14= 0.857
15 P P TP
16 N N TN Recall = TP/(TP+FN) = 12 / (12 + 1)
17 P P TP = 12/13= 0.92
18 P P TP F1-Score = 2 * (Precision * Recall) / (Precision + Recall)
19 N N TN
20 P P TP
= 2 * 0.857 * 0.92 /(0.857+0.92) = 0.89

© Dr. Tariq 2024 Department of Computer Sciences | Bahria University 25


Classifier Accuracy Measures: Another Example
classes buy_computer = buy_computer = total Predicted
yes no Actual C1 C2

buy_computer = 6954 46 7000 C1 True positive False negative


(TP) (FN)
yes
C2 False positive True negative
buy_computer = 412 2588 3000 (FP) (TN)
no
total 7366 2634 10000

• Accuracy = (TP + TN) / (TP + TN + FP + TN)


= (6954+2588)/10000 = 0.9542
• Precision = TP / (TP + FP)
= 6954/7366 = 0.9440
• Recall = TP / (TP + FN)
= 6954/7000 = 0.9934
• f1-score = 2 * (Precision * Recall) / (Precision + Recall)
= 2 * 0.9440 * 0.9934 / (0.9440 + 0.9934)
= 0.9437
© Dr. Tariq 2024 Department of Computer Sciences | Bahria University 26
Classifier Accuracy Measures: Multi-class Example
Q1: compute the accuracy score
Q2: compute F-Measure (F1) of class b

Solution:

accuracy = (10+4+10+6)/(30+4+4+2) = 30/40 = 0.75

precision(b) = 4/(4+4+2) = 0.4 [precion = p]

recall(b) = 4/(4+4) = 0.5 [recall = r]

F1(b) = 2*(p*r)/(p+r) = 2*(0.2)/(0.9) = 0.44

© Dr. Tariq 2024 Department of Computer Sciences | Bahria University 27


Model Selection: ROC Curves
• ROC (Receiver Operating Characteristics) curves:
for visual comparison of classification models
• Originated from signal detection theory
• Shows the trade-off between the true positive
rate and the false positive rate
• The area under the ROC curve is a measure
• of the accuracy of the model
• Vertical axis represents the true positive rate
• Horizontal axis represents the false positive rate
• The plot also shows a diagonal line
• A model with perfect accuracy will have an
• area of 1.0

© Dr. Tariq 2024 Department of Computer Sciences | Bahria University 28


How to Plot ROC Curve?
• An ROC (receiver operating
characteristic) curve is a
graph showing the
performance of
classification model at all
classification thresholds.
• This curve plots two
parameters:
• True Positive Rate:

TP
TPR =
TP + FN

• False Positive Rate:


FP
FPR =
FP + TN

© Dr. Tariq 2024 Department of Computer Sciences | Bahria University 29


Tutorial
Section - 4
TUTORIAL 08
1. Why we need statistical classifier?
2. What is Bayesian Theorem? How is it related to Naïve Bayes?
3. Derive the Naïve Bayes theorem.
4. What are the limitation of Naïve Bayes.
5. Explain the steps of k-nearest neighbors' algorithm.
6. How is the accuracy of classifier measured?
7. What is the role of ROC curve in model selection?

© Dr. Tariq 2024 Department of Computer Sciences | Bahria University 31


Artificial Intelligence CSC-325

Lecture – 8 Thank Any


Supervised Learning Questions ?
You

Dr. Muhammad Tariq Siddique


Department of Computer Sciences
Bahria University, Karachi Campus, Pakistan
[email protected]
Faculty room 14, 2nd Floor, Iqbal Block

You might also like