Pattern Classification
03. Pattern Classification Methods
AbdElMoniem Bayoumi, PhD
Spring 2023
Acknowledgment
• These slides have been created relying on
lecture notes of Prof. Dr. Amir Atiya
Minimum Distance Classifier
• Choose a center or a representative pattern
from each class à 𝑉 𝑘 , where k is the
class index
𝑽 𝟑
𝑿𝟐
𝑽 𝟐 C1
C2
𝑽 𝟏 C3
𝑿𝟏
3
Minimum Distance Classifier
• Given a pattern 𝑋 that we would like to
classify
𝑽 𝟑
𝑿𝟐
𝑽 𝟐 C1
C2
𝑽 𝟏 C3
𝑿𝟏
4
Minimum Distance Classifier
• Compute the distance from 𝑋 to each
center 𝑉(𝑘):
𝑵
𝟐 𝟐
𝒅 𝒌 = )[𝑽𝒊 𝒌 − 𝑿𝒊 ] ≡ 𝑽 𝒌 −𝑿
𝒊"𝟏
𝑽 𝟑
𝑿𝟐
𝑽 𝟐 C1
C2
𝑽 𝟏 C3
𝑿𝟏 5
Recap: Euclidean Distance
• 2D:
(y1,y2)
(x1,x2)
𝒅𝟐 = (𝒚𝟐 − 𝒙𝟐 )𝟐 +(𝒚𝟏 − 𝒙𝟏 )𝟐
• N-dimensions:
𝑵
𝒅𝟐 𝑿, 𝒀 = )(𝒀𝒊 − 𝑿𝒊 )𝟐
𝒊"𝟏
6
Minimum Distance Classifier
• Find қ corresponding to the minimum
distance:
қ = 𝐚𝐫𝐠𝐦𝐢𝐧 𝒅(𝒌)
𝟏&𝒌&𝑲
• Then our classification of is 𝑋 class 𝐶қ
• 𝑋 is classified as belonging to the class
corresponding to the nearest class center
7
Class Center Estimation
• Let 𝑋 𝑚 ∈ 𝐶! , V(1)
$!
1
𝑉 1 = ) 𝑋(𝑚)
𝑀!
"#!
where, M1 is the number of training patterns from
class C1
• This corresponds to component-wise
averaging
$!
1
𝑉% 1 = ) 𝑋% (𝑚)
𝑀!
"#!
8
Minimum Distance Classifier
• Too simple to solve difficult problems
𝑿𝟐
𝑽 𝟏
𝒇 𝑿 >𝟎 C1
𝑽 𝟐 𝒇 𝑿 <𝟎 C2
𝑿𝟏
9
Minimum Distance Classifier
• Too simple to solve difficult problems
𝑿𝟐
𝑽 𝟏
𝒇 𝑿 >𝟎 C1
𝑽 𝟐 𝒇 𝑿 <𝟎 C2
𝑿𝟏
𝑿 will be classified as C2
10
Nearest Neighbor Classifier
• The class of the nearest pattern to 𝑋
determines its classification
𝑿𝟐
C1
𝑿 C2
𝑿𝟏
11
Nearest Neighbor Classifier
• Compute the distance between pattern 𝑋
and each pattern 𝑋(𝑚) in the training set
𝟐
𝒅 𝒎 = 𝑿 − 𝑿(𝒎)
• The class of the pattern m that corresponds
to the minimum distance is chosen as the
classification of 𝑋
12
Nearest Neighbor Classifier
• The advantage of the nearest neighbor
classifier is its simplicity
• However, a rouge pattern can affect the
classification negatively
𝑿𝟐
C1
𝑿 C2
𝑿𝟏 13
Nearest Neighbor Classifier
• Also, for patterns with large overlaps
between the classes, the overlapping
patterns can negatively affect performance
𝑿𝟐
C1
C2
𝑿𝟏
14
K-Nearest Neighbor Classifier
• To alleviate the problems of the NN
classifier there is the k-nearest neighbor
classifier
• Take the k-nearest points to point 𝑋
• Choose the classification of 𝑋 as the class
most often represented in these k points
15
K-Nearest Neighbor Classifier
• Take k = 5
𝑿𝟐
C1
𝑿 C2
𝑿𝟏
• One can see that C2 is the majority à classify 𝑋 as C2
• The KNN rule is less dependent on strange patterns
compared to the nearest neighbor classification rule
16
K-Nearest Neighbor Classifier
• The k-nearest neighbors could be a bit far
away from 𝑋
𝑿𝟐
C1
C2
𝑿 k = 10
𝑿𝟏
• Leading to using information that might not
be relevant to the considered point 𝑋
17
Bayes Classification Rule
• Recall: histogram for feature x from class
C1 (e.g., letter ‘A’)
Number of training patterns Number of training patterns
of letter ‘A’ having x = 3 of letter ‘I’ having x = 10
8 11
1 2 3 4 5 6 7 9 10
x
18
Bayes Classification Rule
P(x|class Ci) ≡ class conditional probability function
≡ probability density of feature x, given
that x comes from class Ci
P(x|C1) P(x|C2)
8 11
1 2 3 4 5 6 7 9 10
x
19
Bayes Classification Rule
𝑋)
𝑋*
• If 𝑋 =
⋮
is a feature vector then:
𝑋+
𝑷 𝑿 𝑪𝒊 = 𝑷(𝑿𝟏 , 𝑿𝟐 , ⋯ , 𝑿𝑵 |𝑪𝒊 )
𝑿𝟏
𝑿𝟐 2 features!
20
Bayes Classification Rule
• Given a pattern 𝑋 (with unknown class) that
we wish to classify:
– Compute 𝑃(𝐶&|𝑋), 𝑃(𝐶'|𝑋), … , 𝑃(𝐶( |𝑋)
– Find the k giving maximum 𝑃(𝐶) |𝑋)
• This is our classification according to the
Bayes classification rule
• We classify the data point (pattern) as
belonging to the most likely class
21
Bayes Classification Rule
• To compute 𝑃 𝐶, 𝑋 , we use Bayes rule:
𝑃(𝐶, , 𝑋)
𝑃 𝐶, 𝑋 =
𝑃(𝑋)
- .|0* -(0* )
=
-(.)
Bayes Rule:
P(A,B) = P(A|B)P(B) = P(B|A)P(A)
22
Bayes Classification Rule
• To compute 𝑃 𝐶" 𝑋 , we use Bayes rule:
𝑃 𝑋|𝐶" 𝑃(𝐶" )
𝑃 𝐶" 𝑋 =
𝑃(𝑋)
• 𝑃 𝑋|𝐶" ≡ Class-conditional density (defined before)
• 𝑃 𝐶" ≡ Probability of class Ci before or without observing
the features 𝑋
≡ a priori probability of class Ci
23
Bayes Classification Rule
• The a priori probabilities represent the
frequencies of the classes irrespective of the
observed features
• For example in OCR, the a priori probabilities
are taken as the frequency or fraction of
occurrence of the different letters in a typical
text
– For the letters E & A à P(Ci) will be higher
– For letters Q & X à P(Ci) will be low because they
are infrequent
24
Bayes Classification Rule
• Find 𝐶3 giving max 𝑃 𝐶3 𝑋
𝑃 𝑋|𝐶3 𝑃(𝐶3 )
𝑃 𝐶3 𝑋 =
𝑃(𝑋)
–𝑃 𝐶! 𝑋 ≡ posterior prob.
–𝑃 𝐶! ≡ a priori prob.
–𝑃 𝑋|𝐶! ≡ class-conditional densities
• 𝑃 𝑋 = ∑4
,") 𝑃(𝑋 , 𝐶, ) = ∑ 4
,") 𝑃 𝑋 𝐶, 𝑃(𝐶, )
25
Recap: Marginalization
• Discrete case:
&
𝑃 𝐴 = ) 𝑃(𝐴, 𝐵 = 𝐵% )
%#!
• Continuous case:
(
𝑃 𝑥 = 1 𝑃 𝑥, 𝑦 𝑑𝑦
'( Law of total probability
• So:
) )
𝑃 𝑋 = ) 𝑃(𝑋 , 𝐶% ) = ) 𝑃 𝑋 𝐶% 𝑃(𝐶% )
%#! %#!
Marginalization Bayes rule
26
Bayes Classification Rule
𝑃 𝑋|𝐶3 𝑃(𝐶3 )
𝑃 𝐶3 𝑋 =
∑4
,") 𝑃 𝑋 𝐶, 𝑃(𝐶, )
• In reality, we do not need to compute 𝑃 𝑋
because it is a common factor for all the
terms in the expression for 𝑃 𝐶3 𝑋
• Hence, it will not affect which terms will
end up being maximum
27
Bayes Classification Rule
• Classify 𝑋 to the class corresponding to
max 𝑃 𝑋|𝐶3 𝑃(𝐶3 )
P(x|C1)P(C1) P(x|C2) P(C2)
8 11
1 2 3 4 5 6 7 9 10
x
1-D example
28
Bayes Classification Rule
• Classify 𝑋 to the class corresponding to max 𝑃 𝑋|𝐶! 𝑃(𝐶! )
P(x|C1)P(C1) P(x|C2) P(C2)
8 11
1 2 3 4 5 6 7
1-D example
9 10
x
• For x=5, P(x|C1)P(C1) has a higher value compared to P(x|C2)P(C2)
à classify as C1
29
Classification Accuracy
𝑷 𝒄𝒐𝒓𝒓𝒆𝒄𝒕 𝒄𝒍𝒂𝒔𝒔𝒊𝒇𝒊𝒄𝒂𝒕𝒊𝒐𝒏 𝑿 = 𝐦𝐚𝐱 𝑷(𝑪𝒊 |𝑿)
𝟏&𝒊&𝑲
• Example: 3-class case:
–𝑃 𝐶( 𝑋 = 0.6, 𝑃 𝐶) 𝑋 = 0.3, 𝑃 𝐶* 𝑋 = 0.1
– You classified 𝑋 as 𝐶( à it has highest 𝑃(𝐶+ |𝑋)
– The probability that your classifier is correct
equals to the probability that 𝑋 belongs to the
same class of the classification (which is 0.6)
30
Classification Accuracy
• Overall P(correct) is:
𝑃 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 = 1 𝑃 𝑐𝑜𝑟𝑟𝑒𝑐𝑡, 𝑋 𝑑𝑋 Marginal prob.
= 1 𝑃 𝑐𝑜𝑟𝑟𝑒𝑐𝑡|𝑋 𝑃 𝑋 𝑑𝑋 Bayes rule
𝑃 𝑋|𝐶* 𝑃(𝐶* )
= 1 max 𝑃 𝑋 𝑑𝑋
* 𝑃(𝑋)
= 1 max 𝑃 𝑋|𝐶* 𝑃(𝐶* ) 𝑑𝑋
*
31
Classification Accuracy
• Overall P(correct) is:
𝑃 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 = V max 𝑃 𝑋|𝐶3 𝑃(𝐶3 ) 𝑑𝑋
3
P(x|C1)P(C1) P(x|C2) P(C2) P(x|C3)P(C3)
8 11
1 2 3 4 5 6
1-D example
7 9 10
x
32
Classification Accuracy
• Overall P(correct) is:
𝑃 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 = V max 𝑃 𝑋|𝐶3 𝑃(𝐶3 ) 𝑑𝑋
3
𝐦𝐚𝐱 P(x|𝑪𝒊 )P(𝑪𝒊 )
𝒊
8 11
1 2 3 4 5 6
1-D example
7 9 10
x
33
Classification Accuracy
• Overall P(correct) is:
𝑃 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 = V max 𝑃 𝑋|𝐶3 𝑃(𝐶3 ) 𝑑𝑋
3
P(correct) = areas[ + + ]
𝐦𝐚𝐱 P(x|𝑪𝒊 )P(𝑪𝒊 )
𝒊
8 11
1 2 3 4 5 6
1-D example
7 9 10
x
34
Classification Accuracy
𝑃 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 = ) max 𝑃 𝑋|𝐶0 𝑃(𝐶0 ) 𝑑𝑋
0
𝑃 𝑒𝑟𝑟𝑜𝑟 = 1 − 𝑃(𝑐𝑜𝑟𝑟𝑒𝑐𝑡)
35
Classification Accuracy
𝑃 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 = ) max 𝑃 𝑋|𝐶0 𝑃(𝐶0 ) 𝑑𝑋
0
𝑃 𝑒𝑟𝑟𝑜𝑟 = 1 − 𝑃(𝑐𝑜𝑟𝑟𝑒𝑐𝑡)
We can compute P(error) directly only for 2-class case!
area = P(error) 36
Acknowledgment
• These slides have been created relying on
lecture notes of Prof. Dr. Amir Atiya
37