Classification
I) Instance-based methods:
1) Nearest neighbour
II) Probabilistic models:
1) Naïve Bayes
2) Logistic Regression
III) Linear Models:
1) Perceptron
2) Support Vector Machine
IV) Decision Models:
1) Decision Trees
2) Boosted Decision Trees
3) Random Forest
Data sets
• Each training data point to be classified is represented as a
pair (x, y):
• where x is a description of the object : feature vector
• where y is a label (assumed binary for now y ={+1, -1})
Nearest neighbour classifier
• Remember all training data
• Decide a distance function
• When a test data point comes find the nearest
neighbour of the closest point from training data
• Assign the label of nearest neighbour to the test data
Nearest neighbour classification boundary
Disadvantage:
Overfitting, memorising all points
K-NN Classifier
• Find the k nearest neighbours
• Have them vote
• Has a smoothing effect
• This is especially good when there is noise
in the class labels.
As K increases:
• Classification boundary becomes smoother
• Training error can increase What happen if K=N?
Select K by cross-validation
Probabilistic Model
class C1= Green
class C2 = Red
Decision Rule with prior probability :
Label =1 [class C1]
if P(C1) > P(C2)
Label =-1 [class C2]
if P(C2) > P(C1)
P(error)= min { P(C1), P(C2) }
We are not using feature information
Using x information
C2
class conditional density New point (xt)
p(x/y=-1) or p(x/C2)
class conditional density C1
p(x/y=1) or p(x/C1)
Bayes Classifier
p(x/C) C1 Classification on posterior
C2 Label =C1
if P(C1/xt) > P(C2/xt)
or p(xt/C1) P(C1) > p(xt/C2) P(C2)
Label =C2
if P(C2/xt) > P(C1/xt)
or p(xt/C2) P(C2) > p(xt/C1) P(C1)
Risk or generalisation loss
Expected cost(= average cost over many examples)
P(error/x)= min { P(C1/xt), P(C2/xt) } P(C1/xt) P(C2/xt)
posterior
p(xt/C1) P(C1)
The Bayes decision Boundary is optimum, that is, it
minimises the average probability error or risk ! p(xt/C2) P(C2)
P(error) = P(error, x)dx = P(error/x)p(x)dx
joint
p(xt/C2) P(C2) > p(xt/C1) P(C1) p(xt/C1) P(C1) > p(xt/C2) P(C2)
P(error)
Naive Bayes Classifier
Assumption : Features are independent
FLU chills headache fever
N Y M Y
Y Y N N
Y Y S Y
Y N M Y
N N N N
Y N S Y
N N S N
Y Y M Y
? Y M N
P(Ch =Y/Flu =Y)=3/5 P(Fe=Y/Flu=Y)=4/5
P(Ch =N/Flu =Y)=2/5 P(Fe=N/Flu=Y)=1/5
P(Ch =Y/Flu =N)=1/3 P(Fe=Y/Flu=N)=1/3
FLU chills headache fever P(Ch =N/Flu =N)=2/3 P(Fe=N/Flu=N)=2/3
N Y M Y P(Ha=M/Flu=Y)=2/5 P(Flu=Y)=5/8
P(Ha=N/Flu=Y)=1/5 P(Flu=N) =3/8
Y Y N N P(Ha=S/Flu=Y)=2/5
P(Ha=M/Flu=N)=1/3
Y Y S Y P(Ha=N/Flu=N)=1/3
P(Ha=S/Flu=N)=1/3
Y N M Y
N N N N z=P(Ch =Y,Ha=M,Fe=N)
P(Flu=Y/Ch =Y,Ha=M,Fe=N)
Y N S Y = P(Ch =Y,Ha=M,Fe=N /Flu=Y) P(Flu=Y)/z
= P(Ch =Y/Flu=Y)P(Ha=M//Flu=Y)P(Fe=N /Flu=Y)P(Flu=Y)/z
N N S N =3/5*2/5*1/5 *5/8=3/100z
Y Y M Y P(Flu=N/Ch =Y,Ha=M,Fe=N)
= P(Ch =Y,Ha=M,Fe=N /Flu=N) P(Flu=N)/z
? Y M N = P(Ch =Y/Flu=N)P(Ha=M//Flu=N)P(Fe=N /Flu=N)P(Flu=N)/z
=1/3*1/3*2/3 *3/8=1/36z
P(Flu=Y/Ch =Y,Ha=M,Fe=N) > P(Flu=N/Ch =Y,Ha=M,Fe=N)