0% found this document useful (0 votes)

17 views16 pages

Linear and Quadratic Discriminant Analysis: Tutorial: Benyamin Ghojogh

This tutorial provides an overview of Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA), two key classification methods in statistical learning. It covers the optimization of decision boundaries, derivation of LDA and QDA for binary and multi-class classifications, and their relationships to other machine learning concepts. The document also includes theoretical explanations and simulations to illustrate the methods and their applications.

Uploaded by

Lallogo Lassane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views16 pages

Linear and Quadratic Discriminant Analysis: Tutorial: Benyamin Ghojogh

Uploaded by

Lallogo Lassane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Linear and Quadratic Discriminant Analysis: Tutorial

Benyamin Ghojogh BGHOJOGH @ UWATERLOO . CA

Department of Electrical and Computer Engineering,
Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada
Mark Crowley MCROWLEY @ UWATERLOO . CA
Department of Electrical and Computer Engineering,
Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada
arXiv:1906.02590v1 [stat.ML] 1 Jun 2019

Abstract fundamental methods. Finally, some experiments on syn-

This tutorial explains Linear Discriminant Anal- thetic datasets are reported and analyzed for illustration.
ysis (LDA) and Quadratic Discriminant Analysis
(QDA) as two fundamental classification meth- 2. Optimization for the Boundary of Classes
ods in statistical and probabilistic learning. We First suppose the data is one dimensional, x ∈ R. As-
start with the optimization of decision boundary sume we have two classes with the Cumulative Distribu-
on which the posteriors are equal. Then, LDA tion Functions (CDF) F1 (x) and F2 (x), respectively. Let
and QDA are derived for binary and multiple the Probability Density Functions (PDF) of these CDFs be:
classes. The estimation of parameters in LDA
and QDA are also covered. Then, we explain ∂F1 (x)
f1 (x) = , (1)
how LDA and QDA are related to metric learn- ∂x
ing, kernel principal component analysis, Maha- ∂F2 (x)
lanobis distance, logistic regression, Bayes op- f2 (x) = , (2)
∂x
timal classifier, Gaussian naive Bayes, and like-
lihood ratio test. We also prove that LDA and respectively.
Fisher discriminant analysis are equivalent. We We assume that the two classes have normal (Gaussian)
finally clarify some of the theoretical concepts distribution which is the most common and default distri-
with simulations we provide. bution in the real-world applications. The mean of one of
the two classes is greater than the other one; we assume
µ1 < µ2 . An instance x ∈ R belongs to one of these two
1. Introduction classes:
Assume we have a dataset of instances {(xi , yi )}ni=1 with
N (µ1 , σ12 ), if x ∈ C1 ,

sample size n and dimensionality xi ∈ Rd and yi ∈ R. The x∼ (3)
yi ’s are the class labels. We would like to classify the space N (µ2 , σ22 ), if x ∈ C2 ,
of data using these instances. Linear Discriminant Analysis
where C1 and C2 denote the first and second class, respec-
(LDA) and Quadratic discriminant Analysis (QDA) (Fried-
tively.
man et al., 2009) are two well-known supervised classifica-
tion methods in statistical and probabilistic learning. This For an instance x, we may have an error in estimation of
paper is a tutorial for these two classifiers where the the- the class it belongs to. At a point, which we denote by x∗ ,
ory for binary and multi-class classification are detailed. the probability of the two classes are equal; therefore, the
Then, relations of LDA and QDA to metric learning, ker- point x∗ is on the boundary of the two classes. As we have
nel Principal Component Analysis (PCA), Fisher Discrim- µ1 < µ2 , we can say µ1 < x∗ < µ2 as shown in Fig. 1.
inant Analysis (FDA), logistic regression, Bayes optimal Therefore, if x < x∗ or x > x∗ the instance x belongs to
classifier, Gaussian naive Bayes, and Likelihood Ratio Test the first and second class, respectively. Hence, estimating
(LRT) are explained for better understanding of these two x < x∗ or x > x∗ to respectively belong to the second
and first class is an error in estimation of the class. This
probability of the error can be stated as:

P(error) = P(x > x∗ , x ∈ C1 ) + P(x < x∗ , x ∈ C2 ). (4)

Linear and Quadratic Discriminant Analysis: Tutorial 2

where |C| is the number of classes which is two here. The

f1 (x) and π1 are the likelihood (class conditional) and
prior probabilities, respectively, and the denominator is the
marginal probability.
Therefore, Eq. (13) becomes:

f1 (x) π1
P|C|
i=1 P(X = x | x ∈ Ci ) πi
Figure 1. Two Gaussian density functions where they are equal at
the point x∗ . set f2 (x) π2
= P|C|
i=1 P(X = x | x ∈ Ci ) πi
As we have P(A, B) = P(A|B) P(B), we can say: =⇒ f1 (x) π1 = f2 (x) π2 . (15)
∗
P(error) = P(x > x | x ∈ C1 ) P(x ∈ C1 )
(5)
+ P(x < x∗ | x ∈ C2 ) P(x ∈ C2 ),
which we want to minimize: Now let us think of data as multivariate data with dimen-
sionality d. The PDF for multivariate Gaussian distribu-
minimize
∗
P(error), (6) tion, x ∼ N (µ, Σ) is:
x

by finding the best boundary of classes, i.e., x∗ .

According to the definition of CDF, we have:
1 (x − µ)> Σ−1 (x − µ)
P(x < c, x ∈ C1 ) = F1 (c), f (x) = p exp − ,
(2π)d |Σ| 2
=⇒ P(x > x∗ , x ∈ C1 ) = 1 − F1 (x∗ ), (7)
∗ ∗ (16)
P(x < x , x ∈ C2 ) = F2 (x ). (8)
According to the definition of PDF, we have:
P(x ∈ C1 ) = f1 (x) = π1 , (9) where x ∈ Rd , µ ∈ Rd is the mean, Σ ∈ Rd×d is the
P(x ∈ C2 ) = f2 (x) = π2 , (10) covariance matrix, and |.| is the determinant of matrix. The
π ≈ 3.14 in this equation should not be confused with the
where we denote the priors f1 (x) and f2 (x) by π1 and π2 ,
πk (prior) in Eq. (12) or (15). Therefore, the Eq. (12) or
respectively.
(15) becomes:
Hence, Eqs. (5) and (6) become:
1 − F1 (x∗ ) π1 + F2 (x∗ ) π2 .

minimize
∗
(11)
x

1 (x − µ )> Σ−1 (x − µ )
We take derivative for the sake of minimization: 1 1 1
p exp − π1
∂ P(error) set (2π)d |Σ1 | 2
∗
= −f1 (x∗ ) π1 + f2 (x∗ ) π2 = 0, (x − µ )> Σ−1 (x − µ )
∂x 1 2 2 2
=⇒ f1 (x∗ ) π1 = f2 (x∗ ) π2 . (12) =p exp − π2 ,
(2π)d |Σ2 | 2
Another way to obtain this expression is equating the pos- (17)
terior probabilities to have the equation of the boundary of where the distributions of the first and second class are
classes: N (µ1 , Σ1 ) and N (µ2 , Σ2 ), respectively.
set
P(x ∈ C1 | X = x) = P(x ∈ C2 | X = x). (13) 3. Linear Discriminant Analysis for Binary
According to Bayes rule, the posterior is: Classification
P(X = x | x ∈ C1 ) P(x ∈ C1 ) In Linear Discriminant Analysis (LDA), we assume that the
P(x ∈ C1 | X = x) = two classes have equal covariance matrices:
P(X = x)
f1 (x) π1
= P|C| ,
k=1 P(X = x | x ∈ Ck ) πk
(14) Σ1 = Σ2 = Σ. (18)
Linear and Quadratic Discriminant Analysis: Tutorial 3

Therefore, the Eq. (17) becomes: the class of an instance x is estimated as:

(x − µ )> Σ−1 (x − µ ) 1, if δ(x) < 0,
1 1 1 C(x)
b = (22)
p exp − π1 2, if δ(x) > 0.
(2π)d |Σ| 2
1 (x − µ )> Σ−1 (x − µ ) If the priors of two classes are equal, i.e., π1 = π2 , the Eq.
2 2
=p exp − π2 , (20) becomes:
d
(2π) |Σ| 2
>
(x − µ )> Σ−1 (x − µ ) 2 Σ−1 (µ2 − µ1 ) x
=⇒ exp − 1 1
π1 (23)
2 + µ1 − µ2 )> Σ−1 (µ1 − µ2 ) = 0,
(x − µ )> Σ−1 (x − µ )
2 2 whose left-hand-side expression can be considered as δ(x)
= exp − π2 ,
2 in Eq. (22).
(a) 1
=⇒ − (x − µ1 )> Σ−1 (x − µ1 ) + ln(π1 )
2 4. Quadratic Discriminant Analysis for
1
= − (x − µ2 )> Σ−1 (x − µ2 ) + ln(π2 ), Binary Classification
2
In Quadratic Discriminant Analysis (QDA), we relax the
where (a) takes natural logarithm from the sides of equa- assumption of equality of the covariance matrices:
tion.
We can simplify this term as: Σ1 6= Σ2 , (24)
which means the covariances are not necessarily equal (if
(x − µ1 )> Σ−1 (x − µ1 ) = (x> − µ>
1 )Σ
−1
(x − µ1 )
they are actually equal, the decision boundary will be linear
= x> Σ−1 x − x> Σ−1 µ1 − µ>
1Σ
−1
x + µ>
1Σ
−1
µ1 and QDA reduces to LDA).
(a) Therefore, the Eq. (17) becomes:
= x> Σ−1 x + µ>
1Σ
−1
µ1 − 2 µ>
1Σ
−1
x, (19)
1 (x − µ )> Σ−1 (x − µ )
1 1 1
where (a) is because x> Σ−1 µ1 = µ> −1 exp − π1
1Σ x as it is a p
−1 −> (2π)d |Σ1 | 2
scalar and Σ is symmetric so Σ = Σ−1 . Thus, we (x − µ )> Σ−1 (x − µ )
have: 1 2 2 2
=p exp − π2 ,
(2π)d |Σ2 | 2
1 1
− x> Σ−1 x − µ> Σ−1 µ1 + µ>
1Σ
−1
x + ln(π1 ) d 1
2 2 1 (a)
=⇒ − ln(2π) − ln(|Σ1 |)
1 1 2 2
= − x> Σ−1 x − µ> Σ−1 µ2 + µ>
2Σ
−1
x + ln(π2 ). 1
2 2 2 − (x − µ1 )> Σ−1 1 (x − µ1 ) + ln(π1 )
2
Therefore, if we multiply the sides of equation by 2, we d 1
have: = − ln(2π) − ln(|Σ2 |)
2 2
> 1
2 Σ−1 (µ2 − µ1 ) x > −1
− (x − µ2 ) Σ2 (x − µ2 ) + ln(π2 ),
π2 2
+ µ1 − µ2 )> Σ−1 (µ1 − µ2 ) + 2 ln( ) = 0,

π1 where (a) takes natural logarithm from the sides of equa-
(20) tion. According to Eq. (19), we have:
which is the equation of a line in the form of a> x + b = 0. 1 1 1 > −1
Therefore, if we consider Gaussian distributions for the two − ln(|Σ1 |) − x> Σ−1 1 x − µ1 Σ1 µ1
2 2 2
classes where the covariance matrices are assumed to be −1
+ µ>1 Σ1 x + ln(π 1 )
equal, the decision boundary of classification is a line. Be-
1 1 1 > −1
cause of linearity of the decision boundary which discrimi- = − ln(|Σ2 |) − x> Σ−1 2 x − µ2 Σ2 µ2
nates the two classes, this method is named linear discrim- 2 2 2
−1
inant analysis. + µ>2 Σ2 x + ln(π2 ).
For obtaining Eq. (20), we brought the expressions to the Therefore, if we multiply the sides of equation by 2, we
right side which was corresponding to the second class; have:
therefore, if we use δ(x) : Rd → R as the left-hand-side
expression (function) in Eq. (20): x> (Σ1 − Σ2 )−1 x + 2 (Σ−1 −1
2 µ2 − Σ1 µ1 ) x
>

|Σ |
−1 > −1 1
> + (µ>1 Σ1 µ1 − µ2 Σ2 µ2 ) + ln
δ(x) := 2 Σ−1 (µ2 − µ1 ) x |Σ2 | (25)
π2 (21) π2
+ µ1 − µ2 )> Σ−1 (µ1 − µ2 ) + 2 ln( ),

+ 2 ln( ) = 0,
π1 π1
Linear and Quadratic Discriminant Analysis: Tutorial 4

which is in the quadratic form x> A x + b> x + c = 0. because it maximizes the posterior of that class. In this
Therefore, if we consider Gaussian distributions for the two expression, δ(x) is Eq. (28).
classes, the decision boundary of classification is quadratic. In LDA, we assume that the covariance matrices of the k
Because of quadratic decision boundary which discrimi- classes are equal:
nates the two classes, this method is named quadratic dis-
criminant analysis. Σ1 = · · · = Σ|C| = Σ. (30)
For obtaining Eq. (25), we brought the expressions to the Therefore, the Eq. (28) becomes:
right side which was corresponding to the second class;
1
therefore, if we use δ(x) : Rd → R as the left-hand-side δk (x) = − ln(|Σ|)
expression (function) in Eq. (25): 2
1 1
− (x − µk )> Σ−1 (x − µk ) + ln(πk ) = − ln(|Σ|)
δ(x) := x> (Σ1 − Σ2 )−1 x + 2 (Σ−1 −1
2 µ2 − Σ1 µ1 ) x
>
2 2
|Σ | π2 1 1
+ (µ> −1 > −1 1 − x> Σ−1 x − µ> Σ−1 µk + µ> kΣ
−1
x + ln(πk ).
1 Σ1 µ1 − µ2 Σ2 µ2 ) + ln + 2 ln( ), 2 2 k
|Σ2 | π1
(26) We drop the constant terms −(1/2) ln(|Σ|) and
the class of an instance x is estimated as the Eq. (22). −(1/2) x> Σ−1 x which are the same for all classes (note
If the priors of two classes are equal, i.e., π1 = π2 , the Eq. that before taking the logarithm, the term −(1/2) ln(|Σ|)
(20) becomes: is multiplied and the term −(1/2) x> Σ−1 x is multiplied
as an exponential term). Thus, the scaled posterior of the
x> (Σ1 − Σ2 )−1 x + 2 (Σ−1 −1
2 µ2 − Σ1 µ1 ) x
>
k-th class becomes:
−1 > −1
|Σ |
1 (27)
+ (µ>1 Σ1 µ1 − µ2 Σ2 µ2 ) + ln = 0, δk (x) := µ> −1 1
x − µ> Σ−1 µk + ln(πk ).
|Σ2 | kΣ (31)
2 k
whose left-hand-side expression can be considered as δ(x) In LDA, the class of the instance x is determined by Eq.
in Eq. (22). (29), where δ(x) is Eq. (31), because it maximizes the
posterior of that class.
5. LDA and QDA for Multi-class In conclusion, QDA and LDA deal with maximizing the
Classification posterior of classes but work with the likelihoods (class
Now we consider multiple classes, which can be more than conditional) and priors.
two, indexed by k ∈ {1, . . . , |C|}. Recall Eq. (12) or (15)
where we are using the scaled posterior, i.e., fk (x) πk . Ac- 6. Estimation of Parameters in LDA and QDA
cording to Eq. (16), we have: In LDA and QDA, we have several parameters which are
required in order to calculate the posteriors. These param-
fk (x) πk
eters are the means and the covariance matrices of classes
1 (x − µ )> Σ−1 (x − µ )
=p exp − k k k
πk . and the priors of classes.
(2π)d |Σk | 2 The priors of the classes are very tricky to calculate. It is
Taking natural logarithm gives: somewhat a chicken and egg problem because we want to
know the class probabilities (priors) to estimate the class of
d 1 an instance but we do not have the priors and should esti-
ln(fk (x) πk ) = − ln(2π) − ln(|Σk |)
2 2 mate them. Usually, the prior of the k-th class is estimated
1 according to the sample size of the k-th class:
− (x − µk )> Σ−1
k (x − µk ) + ln(πk ).
2 nk
π
bk = , (32)
We drop the constant term −(d/2) ln(2π) which is the n
same for all classes (note that this term is multiplied be- where nk and n are the number of training instances in the
fore taking the logarithm). Thus, the scaled posterior of the k-th class and in total, respectively. This estimation consid-
k-th class becomes: ers Bernoulli distribution for choosing every instance out of
1 the overall training set to be in the k-th class.
δk (x) := − ln(|Σk |)
2 (28) The mean of the k-th class can be estimated using the Max-
1 imum Likelihood Estimation (MLE), or Method of Mo-
− (x − µk )> Σ−1 k (x − µ k ) + ln(π k ).
2 ments (MOM), for the mean of a Gaussian distribution:
In QDA, the class of the instance x is estimated as: n
1 X
Rd 3 µ

bk = xi I C(xi ) = k , (33)
C(x)
b = arg max δk (x), (29) nk i=1
k
Linear and Quadratic Discriminant Analysis: Tutorial 5

where I(.) is the indicator function which is one and zero if

its condition is satisfied and not satisfied, respectively.
In QDA, the covariance matrix of the k-th class is estimated
using MLE:

Rd×d 3 Σ
bk =
n
1 X (34)
(xi − µ b k )> I C(xi ) = k .
b k )(xi − µ
nk i=1

Or we can use the unbiased estimation of the covariance

matrix: Figure 2. The QDA and LDA where the covariance matrices are
identity matrix. For equal priors, the QDA and LDA reduce
Rd×d 3 Σ
bk = to simple classification using Euclidean distance from means of
n classes. Changing the prior modifies the location of decision
1 X (35)
(xi − µ b k )> I C(xi ) = k .
b k )(xi − µ boundary where even one point can be classified differently for
nk − 1 i=1 different priors.

In LDA, we assume that the covariance matrices of the

classes are equal; therefore, we use the weighted average distance from the mean of classes is one of the simplest
of the estimated covariance matrices as the common co- classification methods where the used metric is Euclidean
variance matrix in LDA: distance.
P|C| P|C| The Eq. (39) has a very interesting message. We know that
k=1 nk Σk nk Σ in metric Multi-Dimensional Scaling (MDS) (Cox & Cox,
b bk
d×d
R 3 Σ = P|C|
b = k=1 , (36)
n 2000) and kernel Principal Component Analysis (PCA), we
r=1 nr
have (see (Ham et al., 2004) and Chapter 2 in (Strange &
where the weights are the cardinality of the classes. Zwiggelaar, 2014)):
1
7. LDA and QDA are Metric Learning! K = − HDH, (41)
2
Recall Eq. (28) which is the scaled posterior for the QDA.
First, assume that the covariance matrices are all equal (as where D ∈ Rn×n is the distance matrix whose elements
we have in LDA) and they all are the identity matrix: are the distances between the data instances, K ∈ Rn×n is
the kernel matrix over the data instances, Rn×n 3 H :=
Σ1 = · · · = Σ|C| = I, (37) I − (1/n)11> is the centering matrix, and Rn 3 1 :=
[1, 1, . . . , 1]> . If the elements of the distance matrix D are
which means that all the classes are assumed to be spheri- obtained using Euclidean distance, the MDS is equivalent
cally distributed in the d dimensional space. After this as- to Principal Component Analysis (PCA) (Jolliffe, 2011).
sumption, the Eq. (28) becomes: Comparing Eqs. (39) and (41) shows an interesting con-
1 nection between the posterior of a class in QDA and the
δk (x) = − (x − µk )> (x − µk ) + ln(πk ), (38) kernel over the the data instances of the class. In this com-
2
parison, the Eq. (41) should be considered for a class and
because |I| = 1, ln(1) = 0, and I −1 = I. If we assume not the entire data, so K ∈ Rnk ×nk , D ∈ Rnk ×nk , and
that the priors are all equal, the term ln(πk ) is constant and H ∈ Rnk ×nk .
can be dropped: Now, consider the case where still the covariance matrices
are all identity matrix but the priors are not equal. In this
1 1
δk (x) = − (x − µk )> (x − µk ) = − d2k , (39) case, we have Eq. (38). If we take an exponential (in-
2 2 verse of logarithm) from this expression, the πk becomes
where dk is the Euclidean distance from the mean of the a scale factor (weight). This means that we still are using
k-th class: distance metric to measure the distance of an instance from
q the means of classes but we are scaling the distances by the
dk = ||x − µk ||2 = (x − µk )> (x − µk ). (40) priors of classes. If a class happens more, i.e., its prior is
larger, it must have a larger posterior so we reduce the dis-
Thus, the QDA or LDA reduce to simple Euclidean distance from the mean of its class. In other words, we move
tance from the means of classes if the covariance matri- the decision boundary according to the prior of classes (see
ces are all identity matrix and the priors are equal. Simple Fig. 2).
Linear and Quadratic Discriminant Analysis: Tutorial 6

As the next step, consider a more general case where the & Jin, 2006; Kulis, 2013) in a perspective. Note that in
covariance matrices are not equal as we have in QDA. We metric learning, a valid distance metric is defined as (Yang
apply Singular Value Decomposition (SVD) to the covari- & Jin, 2006):
ance matrix of the k-th class:
d2A (x, µk ) := ||x − µk ||2A = (x − µk )> A (x − µk ),
Σk = U k Λk U >
k, (44)

where the left and right matrices of singular vectors are where A is a positive semi-definite matrix, i.e., A 0.
equal because the covariance matrix is symmetric. There- In QDA, we are also using (x − µk )> Σ−1 k (x − µk ). The
fore: covariance matrix is positive semi-definite according to the
characteristics of covariance matrix. Moreover, according
Σ−1 −1 >
k = U k Λk U k , to characteristics of a positive semi-definite matrix, the in-
verse of a positive semi-definite matrix is positive semi-
where U −1
k = U> k because it is an orthogonal matrix. definite so Σ−1 0. Therefore, QDA is using metric
k
Therefore, we can simplify the following term: learning (and as will be discussed in next section, it can
be seen as a manifold learning method, too).
(x − µk )> Σ−1
k (x − µk )
It is also noteworthy that the QDA and LDA can also
= (x − µk )> U k Λ−1 >
k U k (x − µk ) be seen as Mahalanobis distance (McLachlan, 1999;
= (U > > > −1 > >
k x − U k µk ) Λk (U k x − U k µk ).
De Maesschalck et al., 2000) which is also a metric:

As Λ−1 d2M (x, µ) := ||x − µ||2M = (x − µ)> Σ−1 (x − µ),

k is a diagonal matrix with non-negative elements
(because it is covariance), we can decompose it as: (45)

Λ−1
−1/2 −1/2 where Σ is the covariance matrix of the cloud of data whose
k = Λk Λk .
mean is µ. The intuition of Mahalanobis distance is that
Therefore: if we have several data clouds (e.g., classes), the distance
from the class with larger variance should be scaled down
(U > > > −1 > >
k x − U k µk ) Λk (U k x − U k µk ) because that class is taking more of the space so it is more
−1/2 −1/2 probable to happen. The scaling down shows in the inverse
= (U > > >
k x − U k µk ) Λk Λk (U > >
k x − U k µk )
of covariance matrix. Comparing (x − µk )> Σ−1 k (x − µk )
(a) −1/2 −1/2
= (Λk U>
k x − Λk U>
k µk )
> in QDA or LDA with Eq. (45) shows that QDA and LDA
−1/2 −1/2 are sort of using Mahalanobis distance.
(Λk U>
k x − Λk U>
k µk ),
?
where (a) is because Λk
−>/2
= Λk
−1/2
because it is diago- 8. LDA ≡ FDA
nal. We define the following transformation: In the previous section, we saw that LDA and QDA can
be seen as metric learning. We know that metric learning
−1/2
φk : x 7→ Λk U>
k x, (42) can be seen as a family of manifold learning methods. We
briefly explain the reason of this assertion: As A 0, we
which also results in the transformation of the mean: φk : can say A = U U > 0. Therefore, Eq. (44) becomes:
−1/2
µ 7→ Λk U > k µ. Therefore, the Eq. (28) can be restated
as: ||x − µk ||2A = (x − µk )> U U > (x − µk )
1 = (U > x − U > µk )> (U > x − U > µk ),
δk (x) = − ln(|Σk |)
2
1 > which means that metric learning can be seen as compari-
− φk (x) − φk (µk ) φk (x) − φk (µk ) + ln(πk ). son of simple Euclidean distances after the transformation
2
(43) φ : x 7→ U > x which is a projection into a subspace with
Ignoring the terms −(1/2) ln(|Σk |) and ln(πk ), we can see projection matrix U . Thus, metric learning is a manifold
that the transformation has changed the covariance matrix learning approach. This gives a hint that the Fisher Dis-
of the class to identity matrix. Therefore, the QDA (and criminant Analysis (FDA) (Fisher, 1936; Welling, 2005),
also LDA) can be seen as simple comparison of distances which is a manifold learning approach (Tharwat et al.,
from the means of classes after applying a transformation to 2017), might have a connection to LDA; especially, be-
the data of every class. In other words, we are learning the cause the names FDA and LDA are often used interchange-
metric using the SVD of covariance matrix of every class. ably in the literature. Actually, other names of FDA are
Thus, LDA and QDA can be seen as metric learning (Yang Fisher LDA (FLDA) and even LDA.
Linear and Quadratic Discriminant Analysis: Tutorial 7

We know that if we project (transform) the data of a class Comparing Eq. (53) with Eq. (23) shows that LDA
using a projection vector u ∈ Rp to a p dimensional sub- and FDA are equivalent up to a scaling factor µ1 −
space (p ≤ d), i.e.: µ2 )> Σ−1 (µ1 − µ2 ) (note that this term is multiplied as
an exponential factor before taking logarithm to obtain Eq.
x 7→ u> x, (46) (23), so this term a scaling factor). Hence, we can say:
for all data instances of the class, the mean and the covari-
LDA ≡ FDA. (54)
ance matrix of the class are transformed as:
In other words, FDA projects into a subspace. On the other
µ 7→ u> µ, (47)
hand, according to Section 7, LDA can be seen as a met-
Σ 7→ u> Σ u, (48) ric learning with a subspace where the Euclidean distance
is used after projecting onto that subspace. The two sub-
because of characteristics of mean and variance. spaces of FDA and LDA are the same subspace. It should
The Fisher criterion (Xu & Lu, 2006) is the ratio of the be noted that in manifold (subspace) learning, the scale
between-class variance, σb2 , and within-class variance, σw
2
: does not matter because all the distances scale similarly.
2 Note that LDA assumes one (and not several) Gaussian for
σb2 (u> µ2 − u> µ1 )2 u> (µ2 − µ1 ) every class and so does the FDA. That is why FDA faces
f := 2 = > = > .
σw u Σ2 u + u> Σ1 u u (Σ2 + Σ1 ) u problem for multi-modal data (Sugiyama, 2007).
(49)

The FDA maximizes the Fisher criterion: 9. Relation to Logistic Regression

2 According to Eqs. (16) and (32), Gaussian and Bernoulli
u> (µ2 − µ1 ) distributions are used for likelihood (class conditional) and
maximize , (50)
u u> (Σ2 + Σ1 ) u prior, respectively, in LDA and QDA. Thus, we are mak-
ing assumptions for the likelihood and prior, although we
which can be restated as: finally work with posterior in LDA and QDA according to
2 Eq. (15). Logistic regression (Kleinbaum et al., 2002) says
maximize u> (µ2 − µ1 ) ,
u (51) why do we make assumptions on the likelihood and prior
subject to u> (Σ2 + Σ1 ) u = 1, when we want to work on posterior finally. Let us make
assumption directly for the posterior.
according to Rayleigh-Ritz quotient method (Croot, 2005).
In logistic regression, first a linear function is applied to
The Lagrangian (Boyd & Vandenberghe, 2004) is:
the data to have β > x0 where Rd+1 3 x0 = [x> , 1]> and
2 β ∈ Rd+1 include the intercept. Then, logistic function
L = u> (µ2 − µ1 ) − λ u> (Σ2 + Σ1 ) u − 1 ,

is used in order to have a value in range (0, 1) to simulate
where λ is the Lagrange multiplier. Equating the derivative probability. Therefore, in logistic regression, the posterior
of L to zero gives: is assumed to be:

∂L set P(C(x) | X = x)
= 2 (µ2 − µ1 )2 u − 2 λ (Σ2 + Σ1 ) u = 0 exp(β > x0 ) C(x)
∂u 1 1−C(x)
=⇒ (µ2 − µ1 )2 u = λ (Σ2 + Σ1 ) u, = ,
1 + exp(β > x0 ) 1 + exp(β > x0 )
(55)
which is a generalized
eigenvalue problem (µ2 − where C(x) ∈ {−1, +1} for the two classes. Logistic
µ1 )2 , (Σ2 + Σ1 ) according to (Ghojogh et al., 2019b). regression considers the coefficient β as the parameter to
The projection vector is the eigenvector of (Σ2 + be optimized and uses Newton’s method (Boyd & Vanden-
Σ1 )−1 (µ2 − µ1 )2 ; therefore, we can say: berghe, 2004) for the optimization. Therefore, in summary,
logistic regression makes assumption on the posterior while
u ∝ (Σ2 + Σ1 )−1 (µ2 − µ1 )2 .
LDA and QDA make assumption on likelihood and prior.
In LDA, the equality of covariance matrices is assumed.
Thus, according to Eq. (18), we can say: 10. Relation to Bayes Optimal Classifier and
Gaussian Naive Bayes
u ∝ (2 Σ)−1 (µ2 − µ1 )2 ∝ Σ−1 (µ2 − µ1 )2 . (52)
The Bayes classifier maximizes the posteriors of the classes
According to Eq. (46), we have: (Murphy, 2012):

u> x ∝ Σ−1 (µ2 − µ1 )2

>
x. (53) C(x)
b = arg max P(x ∈ Ck | X = x). (56)
k
Linear and Quadratic Discriminant Analysis: Tutorial 8

According to Eq. (14) and Bayes rule, we have: Therefore, Eq. (59) becomes (Mitchell, 1997):

P(x ∈ Ck | X = x) ∝ P(X = x | x ∈ Ck ) P(x ∈ Ck ), X

| {z } C(x)
b = arg max P(Ck | hj ) P(hj | D), (60)
πk Ck ∈C
hj ∈H
(57)

where the denominator of posterior (the marginal) which In conclusion, the Bayes classifier is optimal. Therefore, if
is: the likelihoods of classes are Gaussian, QDA is an optimal
classifier and if the likelihoods are Gaussian and the co-
|C|
X variance matrices are equal, the LDA is an optimal classi-
P(X = x) = P(X = x | x ∈ Cr ) πr , fier. Often, the distributions in the natural life are Gaussian;
r=1
especially, because of central limit theorem (Hazewinkel,
is ignored because it is not dependent on the classes C1 to 2001), the summation of independent and identically dis-
C|C| . tributed (iid) variables is Gaussian and the signals usually
According to Eq. (57), the posterior can be written in terms add in the real world. This explains why LDA and QDA
of likelihood and prior; therefore, Eq. (56) can be restated are very effective classifiers in machine learning. We also
as: saw that FDA is equivalent to LDA. Thus, the reason of ef-
fectiveness of the powerful FDA classifier becomes clear.
C(x)
b = arg max πk P(X = x | x ∈ Ck ). (58) We have seen the very successful performance of FDA
k
and LDA in different applications, such as face recogni-
Note that the Bayes classifier does not make any assumption (Belhumeur et al., 1997; Etemad & Chellappa, 1997;
tion on the posterior, prior, and likelihood, unlike LDA and Zhao et al., 1999), action recognition (Ghojogh et al., 2017;
QDA which assume the uni-modal Gaussian distribution Mokari et al., 2018), and EEG classification (Malekmo-
for the likelihood (and we may assume Bernoulli distribu- hammadi et al., 2019).
tion for the prior in LDA and QDA according to Eq. (32)). Implementing Bayes classifier is difficult in practice so we
Therefore, we can say the difference of Bayes and QDA approximate it by naive Bayes (Zhang, 2004). If xj denotes
is in assumption of uni-modal Gaussian distribution for the the j-th dimension (feature) of x = [x1 , . . . , xd ]> , Eq. (58)
likelihood (class conditional); hence, if the likelihoods are is restated as:
already uni-modal Gaussian, the Bayes classifier reduces to
QDA. Likewise, the difference of Bayes and LDA is in as- C(x)
b = arg max πk P(x1 , x2 , . . . , xd | x ∈ Ck ). (61)
k
sumption of Gaussian distribution for the likelihood (class
conditional) and equality of covariance matrices of classes;
The term P(x1 , x2 , . . . , xd | x ∈ Ck ) is very difficult to
thus, if the likelihoods are already Gaussian and the co-
compute as the features are possibly correlated. Naive
variance matrices are already equal, the Bayes classifier re-
Bayes relaxes this possibility and naively assumes that the
duces to LDA.
features are conditionally independent (⊥⊥) when they are
It is noteworthy that the Bayes classifier is an optimal clas- conditioned on the class:
sifier because it can be seen as an ensemble of hypothe-
ses (models) in the hypothesis (model) space and no other P(x1 , x2 , . . . , xd | x ∈ Ck )
ensemble of hypotheses can outperform it (see Chapter 6, d
Page 175 in (Mitchell, 1997)). In the literature, it is referred ⊥
⊥ Y
≈ P(x1 | Ck ) P(x2 | Ck ) · · · P(xd | Ck ) = P(xj | Ck ).
to as Bayes optimal classifier. To better formulate the ex-
j=1
plained statements, the Bayes optimal classifier estimates
the class as: Therefore, Eq. (61) becomes:
X
C(x)
b = arg max P(Ck | hj ) P(D | hj ) P(hj ), (59)
Ck ∈C d
hj ∈H Y
C(x)
b = arg max πk P(xj | Ck ). (62)
k
where C := {C1 , . . . , C|C| }, D := {xi }ni=1
is the training j=1
set, hj is a hypothesis for estimating the class of instances,
and H is the hypothesis space including all possible hy- In Gaussian naive Bayes, univariate Gaussian distribution
potheses. is assumed for the likelihood (class conditional) of every
According to Bayes rule, similar to what we had for Eq. feature:
(57), we have: (x − µ )2
1 j k
P(xj | Ck ) = p exp − , (63)
P(hj | D) ∝ P(D | hj ) P(hj ). 2πσk2 2σk2
Linear and Quadratic Discriminant Analysis: Tutorial 9

where the mean and unbiased variance are estimated as: where the degree of freedom of χ2 distribution is df :=
n dim(HA ) − dim(H0 ) and dim(.) is the number of unspeci-
1 X
R3µ
bk = xi,j I C(xi ) = k , (64) fied parameters in the hypothesis.
nk i=1
There is a connection between LDA or QDA and the LRT
n
1 X (Lachenbruch & Goldstein, 1979). Recall Eq. (12) or (15)
bk2 = bk )2 I C(xi ) = k , (65)

R3σ (xi,j − µ which can be restated as:
nk − 1 i=1

where xi,j denotes the j-th feature of the i-th training in- f2 (x) π2
= 1, (68)
stance. The prior can again be estimated using Eq. (32). f1 (x) π1
According to Eqs. (62) and (63), Gaussian naive Bayes is
equivalent to QDA where the covariance matrices are di- which is for the decision boundary. The Eq. (22) dealt with
agonal, i.e., the off-diagonal of the covariance matrices are the difference of f2 (x) π2 and f1 (x) π1 ; however, here we
ignored. Therefore, we can say that QDA is more pow- are dealing with their ratio. Recall Fig. 1 where if we move
erful than Gaussian naive Bayes because Gaussian naive x∗ to the right and left, the ratio f2 (x∗ ) π2 /f1 (x∗ ) π1 de-
Bayes is a simplified version of QDA. Moreover, it is obvi- creases and increases, respectively, because the probabili-
ous that Gaussian naive Bayes and QDA are equivalent for ties of the first and second class happening change. In other
one dimensional data. Comparing to LDA, the Gaussian words, moving the x∗ changes the significance level and
naive Bayes is equivalent to LDA if the covariance matrices power. Therefore, Eq. (68) can be used to have a statis-
are diagonal and they are all equal, i.e., σ12 = · · · = σ|C|
2
; tical test where the posteriors are used in the ratio, as we
therefore, LDA and Gaussian naive Bayes have their own also used posteriors in LDA and QDA. The null/alternative
assumptions, one on the off-diagonal of covariance matri- hypothesis an be considered to be the mean and covariance
ces and the other one on equality of the covariance matri- of the first/second class. In other words, the two hypothe-
ces. As Gaussian naive Bayes has some level of optimality ses say that the point belongs to a specific class. Hence, if
(Zhang, 2004), it becomes clear why LDA and QDA are the ratio is larger than a value t, the instance x is estimated
such effective classifiers. to belong to the second class; otherwise, the first class is
chosen. According to Eq. (16), the Eq. (68) becomes:
11. Relation to Likelihood Ratio Test
(|Σ2 |)−1/2 exp − 21 (x − µ2 )> Σ−1

Consider two hypotheses for estimating some parameter, 2 (x − µ2 ) π2
≥ t,
(|Σ1 |)−1/2 exp − 21 (x − µ1 )> Σ−1

a null hypothesis H0 and an alternative hypothesis HA . 1 (x − µ1 ) π1
The probability P(reject H0 | H0 ) is called type 1 error, (69)
false positive error, or false alarm error. The probabil-
ity P(accept H0 | HA ) is called type 2 error or false nega- for QDA. In LDA, the covariance matrices are equal, so:
tive error. The P(reject H0 | H0 ) is also called significance
exp − 12 (x − µ2 )> Σ−1 (x − µ2 ) π2

level, while 1 − P(accept H0 | HA ) = P(reject H0 | HA ) is
called power. ≥ t. (70)
exp − 21 (x − µ1 )> Σ−1 (x − µ1 ) π1

If L(θA ) and L(θ0 ) are the likelihoods (probabilities) for
the alternative and null hypotheses, the likelihood ratio is: As can be seen, changing the priors change impacts the ra-
L(θA ) f (x; θA ) tio as expected. Moreover, the value of t can be chosen
Λ= = . (66) according to the desired significance level in the χ2 distri-
L(θ0 ) f (x; θ0 )
bution using the χ2 table. The Eqs. (69) and (70) show
The Likelihood Ratio Test (LRT) (Casella & Berger, 2002) the relation of LDA and QDA with LRT. As the LRT has
rejects the H0 in favor of HA if the likelihood ratio is the largest power (Neyman & Pearson, 1933), the effective-
greater than a threshold, i.e., Λ ≥ t. The LRT is a very ness of LDA and QDA in classification is explained from a
effective statistical test because according to the Neyman- hypothesis testing point of view.
Pearson lemma (Neyman & Pearson, 1933), it has the
largest power among all statistical tests with the same sig- 12. Simulations
nificance level. In this section, we report some simulations which make the
If the sample size is large, n → ∞, and the θA is estimated concepts of tutorial clearer by illustration.
using MLE, the logarithm of the likelihood ratio asymptot-
ically has the distribution of χ2 under the null hypothesis 12.1. Experiments with Equal Class Sample Sizes
(White, 1984; Casella & Berger, 2002): We created a synthetic dataset of three classes each of
H which is a two dimensional Gaussian distribution. The
2 ln(Λ) ∼0 χ2(df ) , (67) means and covariance matrices of the three Gaussians from
Linear and Quadratic Discriminant Analysis: Tutorial 10

(a) (b) (c)

(d) (e) (f)

(g)

Figure 3. The synthetic dataset: (a) three classes each with size 200, (b) two classes each with size 200, (c) three classes each with size
10, (d) two classes each with size 10, (e) three classes with sizes 200, 100, and 10, (f) two classes with sizes 200 and 10, and (g) two
classes with sizes 400 and 200 where the larger class has two modes.

which the class samples were randomly drawn are: The three classes are shown in Fig. 3-a where each has
sample size 200. Experiments were performed on the three
µ1 = [−4, 4]> , µ2 = [3, −3]> , µ1 = [−3, 3]> , classes. We also performed experiments on two of the three
classes to test a binary classification. The two classes are
10 1 3 0 6 1.5
Σ1 = , Σ2 = , Σ3 = . shown in Fig. 3-b. The LDA, QDA, naive Bayes, and
1 5 0 4 1.5 4
Linear and Quadratic Discriminant Analysis: Tutorial 11

(a) (b) (c)

(d) (e) (f)

(g) (h)

Figure 4. Experiments with equal class sample sizes: (a) LDA for two classes, (b) QDA for two classes, (c) Gaussian naive Bayes for
two classes, (d) Bayes for two classes, (e) LDA for three classes, (f) QDA for three classes, (g) Gaussian naive Bayes for three classes,
and (h) Bayes for three classes.

Bayes classifications of the two and three classes are shown we used Eqs. (62) and (63) and estimated the parameters
in Fig. 4. For both binary and ternary classification with using Eqs. (64) and (65). For Bayes classifier, we used
LDA and QDA, we used Eqs. (31) and (28), respectively, Eq. (58) with Eq. (63) but we do not estimate the mean
with Eq. (29). We also estimated the mean and covariance and variance; except, in order to use the exact likelihoods
using Eqs. (33), (35), and (36). For Gaussian naive Bayes, in Eq. (58), we use the exact mean and covariance matrices
Linear and Quadratic Discriminant Analysis: Tutorial 12

(a) (b) (c)

(d) (e) (f)

(g) (h)

Figure 5. Experiments with small class sample sizes: (a) LDA for two classes, (b) QDA for two classes, (c) Gaussian naive Bayes for
two classes, (d) Bayes for two classes, (e) LDA for three classes, (f) QDA for three classes, (g) Gaussian naive Bayes for three classes,
and (h) Bayes for three classes.

of the distributions which we sampled from. We, however, LDA and QDA are linear and curvy (quadratic), respec-
estimated the priors. The priors were estimated using Eq. tively. The results of QDA, Gaussian naive Bayes, and
(32) for all the classifiers. Bayes are very similar although they have slight differ-
As can be seen in Fig. 4, the space is partitioned into ences. This is because the classes are already Gaussian so
two/three parts and this validates the assertion that LDA if the estimates of means and covariance matrices are accu-
and QDA can be considered as metric learning methods rate enough, QDA and Bayes are equivalent. The classes
as discussed in Section 7. As expected, the boundaries of are Gaussians and the off-diagonal elements of covariance
Linear and Quadratic Discriminant Analysis: Tutorial 13

(a) (b) (c)

(d) (e) (f)

(g) (h)

Figure 6. Experiments with different class sample sizes: (a) LDA for two classes, (b) QDA for two classes, (c) Gaussian naive Bayes for
two classes, (d) Bayes for two classes, (e) LDA for three classes, (f) QDA for three classes, (g) Gaussian naive Bayes for three classes,
and (h) Bayes for three classes.

matrices are also small compared to the diagonal; therefore, i.e., n → ∞. Therefore, if the sample size is small, we
naive Bayes is also behaving similarly. expect mode difference between QDA and Bayes classi-
fiers. We made a synthetic dataset with three or two classes
12.2. Experiments with Small Class Sample Sizes with the same mentioned means and covariance matrices.
According to Monte-Carlo approximation (Robert & The sample size of every class was 10. Figures 3-c and
Casella, 2013), the estimates in Eqs. (33), (35), (64) and 3-d show these datasets. The results of LDA, QDA, Gaus-
(65) are more accurate if the sample size goes to infinity, sian naive Bayes, and Bayes classifiers for this dataset are
Linear and Quadratic Discriminant Analysis: Tutorial 14

(a) (b)

Figure 7. Experiments with multi-modal data: (a) LDA, (b) QDA, (c) Gaussian naive Bayes, and (d) Bayes.

shown in Fig. 5. As can be seen, now, the results of QDA, or LDA faces problem for multi-modal data (Sugiyama,
Gaussian naive Bayes, and Bayes are different for the rea- 2007). For testing this, we made a synthetic dataset with
son explained. two classes, one with sample size 400 having two modes of
Gaussians and the other with sample size 200 having one
12.3. Experiments with Different Class Sample Sizes mode. We again used the same mentioned means and co-
According to Eq. (32) used in Eqs. (28), (31), (58), and variance matrices. The dataset is shown in Fig. 3-g.
(62), the prior of a class changes by the sample size of The results of the LDA, QDA, Gaussian naive Bayes, and
the class. In order to see the effect of sample size, we Bayes classifiers for this dataset are shown in Fig. 7. The
made a synthetic dataset with different class sizes, i.e., 200, mean and covariance matrix of the larger class, although
100, and 10, shown in Figs. 3-e, 3-f. We used the same it has two modes, were estimated using Eqs. (33), (35),
mentioned means and covariance matrices. The results are (64) and (65) in LDA, QDA, and Gaussian naive Bayes.
shown in Fig. 6. As can be seen, the class with small sam- However, for the likelihood used in Bayes classifier, i.e., in
ple size has covered a small portion of space in discrimina- Eq. (58), we need to know the exact multi-modal distribu-
tion which is expected because its prior is small according tion. Therefore, we fit a mixture of two Gaussians (Gho-
to Eq. (32); therefore, its posterior is small. On the other jogh et al., 2019a) to the data of the larger class:
hand, the class with large sample size has covered a larger
portion because of a larger prior.
2
X
12.4. Experiments with Multi-Modal Data P(X = x | x ∈ Ck ) = wk f (x; µk , Σk ), (71)
k=1
As mentioned in Section 8, LDA and QDA assume uni-
modal Gaussian distribution for every class and thus FDA
where f (x; µk , Σk ) is Eq. (16) and we the fitted parame-
Linear and Quadratic Discriminant Analysis: Tutorial 15

ters were: Casella, George and Berger, Roger L. Statistical inference,

volume 2. Duxbury Pacific Grove, CA, 2002.
µ1 = [−3.88, 4]> , µ2 = [3.04, −2.92]> ,
Cox, Trevor F and Cox, Michael AA. Multidimensional
9.27 0.79 2.87 0.03 scaling. Chapman and hall/CRC, 2000.
Σ1 = , Σ2 = ,
0.79 4.82 0.03 3.78
w1 = 0.49, w2 = 0.502. Croot, Ernie. The Rayleigh principle for finding
eigenvalues. Technical report, Georgia Institute of
As Fig. 7 shows, LDA has not performed well enough Technology, School of Mathematics, 2005. Online:
as expected. The performance of QDA is more accept- http://people.math.gatech.edu/∼ecroot/notes_linear.pdf,
able than LDA but still not good enough because QDA Accessed: March 2019.
also assumes a uni-modal Gaussian for every class. The
De Maesschalck, Roy, Jouan-Rimbaud, Delphine, and
result of Gaussian naive Bayes is very different from the
Massart, Désiré L. The Mahalanobis distance. Chemo-
Bayes here because the Gaussian naive Bayes assumes uni-
metrics and intelligent laboratory systems, 50(1):1–18,
modal Gaussian with diagonal covariance for every class.
2000.
Finally, the Bayes has the best result as it takes into account
the multi-modality of the data and it is optimum (Mitchell, Etemad, Kamran and Chellappa, Rama. Discriminant anal-
1997). ysis for recognition of human face images. Journal of the
Optical Society of America A, 14(8):1724–1733, 1997.
13. Conclusion and Future Work
Fisher, Ronald A. The use of multiple measurements in
This paper was a tutorial paper for LDA and QDA as two
taxonomic problems. Annals of eugenics, 7(2):179–188,
fundamental classification methods. We explained the rela-
1936.
tions of these two methods with some other methods in ma-
chine learning, manifold (subspace) learning, metric learn- Friedman, Jerome, Hastie, Trevor, and Tibshirani, Robert.
ing, statistics, and statistical testing. Some simulations The elements of statistical learning, volume 2. Springer
were also provided for better clarification. series in statistics, New York, NY, USA, 2009.
This paper focused on LDA and QDA which are discrim-
inators with one and two polynomial degrees of freedom, Ghodsi, Ali. Classification course, department of statis-
respectively. As the future work, we will work on a tuto- tics and actuarial science, university of Waterloo. Online
rial paper for non-linear discriminant analysis using kernels Youtube Videos, 2015. Accessed: January 2019.
(Baudat & Anouar, 2000; Li et al., 2003; Lu et al., 2003), Ghodsi, Ali. Data visualization course, department of
which is called kernel discriminant analysis, to have dis- statistics and actuarial science, university of Waterloo.
criminators with more than two degrees of freedom. Online Youtube Videos, 2017. Accessed: January 2019.

Acknowledgment Ghojogh, Benyamin, Mohammadzade, Hoda, and Mokari,

The authors hugely thank Prof. Ali Ghodsi (see his great Mozhgan. Fisherposes for human action recognition us-
online related courses (Ghodsi, 2015; 2017)), Prof. Mu ing Kinect sensor data. IEEE Sensors Journal, 18(4):
Zhu, Prof. Hoda Mohammadzade, and other professors 1612–1627, 2017.
whose courses have partly covered the materials mentioned Ghojogh, Benyamin, Ghojogh, Aydin, Crowley, Mark, and
in this tutorial paper. Karray, Fakhri. Fitting a mixture distribution to data:
Tutorial. arXiv preprint arXiv:1901.06708, 2019a.
References
Baudat, Gaston and Anouar, Fatiha. Generalized discrimi- Ghojogh, Benyamin, Karray, Fakhri, and Crowley, Mark.
nant analysis using a kernel approach. Neural computa- Eigenvalue and generalized eigenvalue problems: Tuto-
tion, 12(10):2385–2404, 2000. rial. arXiv preprint arXiv:1903.11240, 2019b.

Ham, Ji Hun, Lee, Daniel D, Mika, Sebastian, and

Belhumeur, Peter N, Hespanha, João P, and Kriegman,
Schölkopf, Bernhard. A kernel view of the dimensional-
David J. Eigenfaces vs. Fisherfaces: Recognition using
ity reduction of manifolds. In International Conference
class specific linear projection. IEEE Transactions on
on Machine Learning, 2004.
Pattern Analysis & Machine Intelligence, (7):711–720,
1997. Hazewinkel, Michiel. Central limit theorem. Encyclopedia
of Mathematics, Springer, 2001.
Boyd, Stephen and Vandenberghe, Lieven. Convex opti-
mization. Cambridge university press, 2004. Jolliffe, Ian. Principal component analysis. Springer, 2011.
Linear and Quadratic Discriminant Analysis: Tutorial 16

Kleinbaum, David G, Dietz, K, Gail, M, Klein, Mitchel, Welling, Max. Fisher linear discriminant analysis. Tech-
and Klein, Mitchell. Logistic regression. Springer, 2002. nical report, University of Toronto, Toronto, Ontario,
Canada, 2005.
Kulis, Brian. Metric learning: A survey. Foundations and
Trends R in Machine Learning, 5(4):287–364, 2013. White, Halbert. Asymptotic theory for econometricians.
Academic press, 1984.
Lachenbruch, Peter A and Goldstein, M. Discriminant
analysis. Biometrics, pp. 69–85, 1979. Xu, Yong and Lu, Guangming. Analysis on fisher discrim-
inant criterion and linear separability of feature space. In
Li, Yongmin, Gong, Shaogang, and Liddell, Heather.
2006 International Conference on Computational Intel-
Recognising trajectories of facial identities using kernel
ligence and Security, volume 2, pp. 1671–1676. IEEE,
discriminant analysis. Image and Vision Computing, 21
2006.
(13-14):1077–1086, 2003.
Lu, Juwei, Plataniotis, Konstantinos N, and Venetsanopou- Yang, Liu and Jin, Rong. Distance metric learning: A
los, Anastasios N. Face recognition using kernel direct comprehensive survey. Technical report, Department
discriminant analysis algorithms. IEEE Transactions on of Computer Science and Engineering, Michigan State
Neural Networks, 14(1):117–126, 2003. University, 2006.

Malekmohammadi, Alireza, Mohammadzade, Hoda, Zhang, Harry. The optimality of naive Bayes. In American
Chamanzar, Alireza, Shabany, Mahdi, and Ghojogh, Association for Artificial Intelligence (AAAI), 2004.
Benyamin. An efficient hardware implementation for a Zhao, Wenyi, Chellappa, Rama, and Phillips, P Jonathon.
motor imagery brain computer interface system. Scientia Subspace linear discriminant analysis for face recogni-
Iranica, 26:72–94, 2019. tion. Citeseer, 1999.
McLachlan, Goeffrey J. Mahalanobis distance. Resonance,
4(6):20–26, 1999.
Mitchell, Thomas. Machine learning. McGraw Hill Higher
Education, 1997.
Mokari, Mozhgan, Mohammadzade, Hoda, and Ghojogh,
Benyamin. Recognizing involuntary actions from 3d
skeleton data using body states. Scientia Iranica, 2018.
Murphy, Kevin P. Machine learning: a probabilistic per-
spective. MIT press, 2012.
Neyman, Jerzy and Pearson, Egon Sharpe. IX. On the prob-
lem of the most efficient tests of statistical hypotheses.
Philosophical Transactions of the Royal Society of Lon-
don. Series A, Containing Papers of a Mathematical or
Physical Character, 231(694-706):289–337, 1933.
Robert, Christian and Casella, George. Monte Carlo sta-
tistical methods. Springer Science & Business Media,
2013.
Strange, Harry and Zwiggelaar, Reyer. Open Problems in
Spectral Dimensionality Reduction. Springer, 2014.
Sugiyama, Masashi. Dimensionality reduction of multi-
modal labeled data by local fisher discriminant analysis.
Journal of machine learning research, 8(May):1027–
1061, 2007.
Tharwat, Alaa, Gaber, Tarek, Ibrahim, Abdelhameed, and
Hassanien, Aboul Ella. Linear discriminant analysis:
A detailed tutorial. AI communications, 30(2):169–190,
2017.

Pattern Recognition
No ratings yet
Pattern Recognition
104 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
Lec 9
No ratings yet
Lec 9
52 pages
Week2 Part1 Summer Partial Notes
No ratings yet
Week2 Part1 Summer Partial Notes
75 pages
2023 LSE MY474 Applied Machine Learning Social Science, Lecture3
No ratings yet
2023 LSE MY474 Applied Machine Learning Social Science, Lecture3
58 pages
Course MDA-12
No ratings yet
Course MDA-12
48 pages
C30 C35 LinearModelForClassification
No ratings yet
C30 C35 LinearModelForClassification
50 pages
(Ranjay Gulati) Managing Network Resources Allian
100% (1)
(Ranjay Gulati) Managing Network Resources Allian
342 pages
Linear Models for Classification
No ratings yet
Linear Models for Classification
72 pages
Transformers For Machine Learning A Deep Dive (Uday Kamath, Kenneth L. Graham, Wael Emara)
100% (12)
Transformers For Machine Learning A Deep Dive (Uday Kamath, Kenneth L. Graham, Wael Emara)
284 pages
Linear Classifiers
No ratings yet
Linear Classifiers
48 pages
Hota ML LDF
No ratings yet
Hota ML LDF
28 pages
Supervised Machine Learning Guide
No ratings yet
Supervised Machine Learning Guide
74 pages
Mathematics of Deep Learning Introduction - Leonid Berlyand
100% (4)
Mathematics of Deep Learning Introduction - Leonid Berlyand
134 pages
Notes Discriminant Analysis March 2021
No ratings yet
Notes Discriminant Analysis March 2021
59 pages
Week3 Summary Detail
No ratings yet
Week3 Summary Detail
13 pages
LDA and QDA in Classification
No ratings yet
LDA and QDA in Classification
55 pages
Bayesian Classifier Linear Disciminant Analysis (LDA) Quadratic Discriminant Analysis (QDA)
No ratings yet
Bayesian Classifier Linear Disciminant Analysis (LDA) Quadratic Discriminant Analysis (QDA)
18 pages
Lec5 Part1
No ratings yet
Lec5 Part1
42 pages
Module 2 - LDA
No ratings yet
Module 2 - LDA
28 pages
Linear Classifiers PPT 1
No ratings yet
Linear Classifiers PPT 1
14 pages
Week3 Summary Detail
No ratings yet
Week3 Summary Detail
9 pages
Classification Models
No ratings yet
Classification Models
95 pages
Asdfghjkl
No ratings yet
Asdfghjkl
22 pages
Lec-04 - Linear Discriminant Analysis
No ratings yet
Lec-04 - Linear Discriminant Analysis
23 pages
Bayesian
No ratings yet
Bayesian
21 pages
Lect 13 - Bayes Decistion Theory - Derivation
No ratings yet
Lect 13 - Bayes Decistion Theory - Derivation
25 pages
Apress Understanding Large Language Models B0CJ2C8TXQ
100% (11)
Apress Understanding Large Language Models B0CJ2C8TXQ
166 pages
MV DA Quiz Deepseek
No ratings yet
MV DA Quiz Deepseek
8 pages
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
94% (16)
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
334 pages
Linear Discriminant Functions Guide
No ratings yet
Linear Discriminant Functions Guide
41 pages
Intro to Pattern Recognition
No ratings yet
Intro to Pattern Recognition
9 pages
Applied Generative AI For Beginners Practical Knowledge 1703207445
94% (16)
Applied Generative AI For Beginners Practical Knowledge 1703207445
221 pages
Classification: 12.1 Discriminant Analysis
No ratings yet
Classification: 12.1 Discriminant Analysis
21 pages
Cours FLD
No ratings yet
Cours FLD
28 pages
Machine Learning With Python
100% (15)
Machine Learning With Python
692 pages
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
95% (22)
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
471 pages
LLMs Guide for Developers & Data Scientists
100% (14)
LLMs Guide for Developers & Data Scientists
132 pages
Inf2b Learn10 Notes Nup
No ratings yet
Inf2b Learn10 Notes Nup
6 pages
Kuliah 3 Teori Keputusan Bayes Bag 2
No ratings yet
Kuliah 3 Teori Keputusan Bayes Bag 2
30 pages
AE - Tema 5 - Two-Class Fisher Discriminant Analysis
No ratings yet
AE - Tema 5 - Two-Class Fisher Discriminant Analysis
6 pages
Materi 5 - 2
No ratings yet
Materi 5 - 2
25 pages
Chapter - 5 (New) PDF
No ratings yet
Chapter - 5 (New) PDF
17 pages
Multivariate Analysis (Slides 8)
No ratings yet
Multivariate Analysis (Slides 8)
19 pages
Chapter5 PDF
No ratings yet
Chapter5 PDF
13 pages
PR January20 04 PDF
No ratings yet
PR January20 04 PDF
40 pages
Linear Discriminat Analysis
No ratings yet
Linear Discriminat Analysis
23 pages
Full Course of Machine Learning
100% (16)
Full Course of Machine Learning
660 pages
IIT Madras Notes Machine Learning
No ratings yet
IIT Madras Notes Machine Learning
13 pages
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
100% (10)
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
168 pages
Linear Algebra Optimization Machine Learning PDF
100% (12)
Linear Algebra Optimization Machine Learning PDF
507 pages
Reference Material - LDA
No ratings yet
Reference Material - LDA
24 pages
Linear Methods in Supervised Learning
No ratings yet
Linear Methods in Supervised Learning
15 pages
Python Machine Learning For Beginners Ebook Final
100% (11)
Python Machine Learning For Beginners Ebook Final
305 pages
Hackers Guide To Machine Learning With Python PDF
100% (15)
Hackers Guide To Machine Learning With Python PDF
272 pages
n9 PDF
No ratings yet
n9 PDF
6 pages
Data Structure and Algorithms With Python
100% (15)
Data Structure and Algorithms With Python
369 pages
MAS 408 - Discriminant Analysis
No ratings yet
MAS 408 - Discriminant Analysis
7 pages
CampusX DSMP 2.0 Syllabus
No ratings yet
CampusX DSMP 2.0 Syllabus
62 pages
Understanding Machine Learning
100% (71)
Understanding Machine Learning
416 pages
Python For Science and Engineering
100% (15)
Python For Science and Engineering
304 pages
(Studies in Computational Intelligence) Witold Pedrycz, Shyi-Ming Chen - Deep Learning - Algorithms and Applications-Springer (2020)
100% (7)
(Studies in Computational Intelligence) Witold Pedrycz, Shyi-Ming Chen - Deep Learning - Algorithms and Applications-Springer (2020)
368 pages
Natural Language Processing With PyTorch - Build Intelligent Language Applications Using Deep Learning PDF
100% (15)
Natural Language Processing With PyTorch - Build Intelligent Language Applications Using Deep Learning PDF
210 pages
Bayesian Classification
No ratings yet
Bayesian Classification
14 pages
Practical Machine Learning R
90% (10)
Practical Machine Learning R
149 pages
Python Programming. A Step-by-Step Guide For Absolute Beginners
91% (44)
Python Programming. A Step-by-Step Guide For Absolute Beginners
181 pages
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
No ratings yet
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
5 pages
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
100% (18)
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
208 pages
Machine Learning Paradigms
100% (10)
Machine Learning Paradigms
336 pages
Hands On Machine Learning With Python Concepts and Applications For Beginners - John Anderson 2018
91% (11)
Hands On Machine Learning With Python Concepts and Applications For Beginners - John Anderson 2018
166 pages
Machine Learning
100% (11)
Machine Learning
135 pages
Inf2b Learn Note10 2up
No ratings yet
Inf2b Learn Note10 2up
7 pages
Week01 Lecture BB
No ratings yet
Week01 Lecture BB
70 pages
How To Learn Machine Learning Algorithms For Interviews
No ratings yet
How To Learn Machine Learning Algorithms For Interviews
16 pages
Sahil Final Project REPORT
No ratings yet
Sahil Final Project REPORT
49 pages
Analytics For Customer Engagement
No ratings yet
Analytics For Customer Engagement
16 pages
Geometry of Deep Learning - Ye
100% (7)
Geometry of Deep Learning - Ye
338 pages
ML Module2
No ratings yet
ML Module2
124 pages
Prediksi Financial Distress Kasus Industri Manufaktur Pendekatan Model Regresi Logistik
No ratings yet
Prediksi Financial Distress Kasus Industri Manufaktur Pendekatan Model Regresi Logistik
13 pages
Legal 3 AI
No ratings yet
Legal 3 AI
3 pages
Linear Classifiers: Dept. Computer Science & Engineering, Shanghai Jiao Tong University
No ratings yet
Linear Classifiers: Dept. Computer Science & Engineering, Shanghai Jiao Tong University
46 pages
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
No ratings yet
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
10 pages
Summary Business Analytics
No ratings yet
Summary Business Analytics
24 pages
An Introduction To Categorical Data Analysis, 2Nd Ed
No ratings yet
An Introduction To Categorical Data Analysis, 2Nd Ed
13 pages
CpE646 9v3 PDF
No ratings yet
CpE646 9v3 PDF
45 pages
Course Syllabus 6337 Fall 2021 - SW1
No ratings yet
Course Syllabus 6337 Fall 2021 - SW1
7 pages
Machine Learning Projects in Python
100% (16)
Machine Learning Projects in Python
135 pages
Bayesian Decision Theory Guide
No ratings yet
Bayesian Decision Theory Guide
3 pages
Love Marriage Trends in India
No ratings yet
Love Marriage Trends in India
23 pages
Machine Learning-Lecture 3 (Student)
No ratings yet
Machine Learning-Lecture 3 (Student)
4 pages
Jurnal Inter FD
No ratings yet
Jurnal Inter FD
22 pages
Non-Linear Regression Guide
No ratings yet
Non-Linear Regression Guide
10 pages
Fisher Linear Discriminant Analysis: 1 What's LDA
No ratings yet
Fisher Linear Discriminant Analysis: 1 What's LDA
6 pages
Mental Illness Awareness in Dire Dawa
No ratings yet
Mental Illness Awareness in Dire Dawa
64 pages
Generative Algorithms
No ratings yet
Generative Algorithms
3 pages
Deep Learning For NLP and Speech Recogni
100% (8)
Deep Learning For NLP and Speech Recogni
640 pages
Chapter11 Slides
No ratings yet
Chapter11 Slides
20 pages
ArticleText 36579 1 10 20210115
No ratings yet
ArticleText 36579 1 10 20210115
15 pages
Application of Logistic Regression To People-Analytics
No ratings yet
Application of Logistic Regression To People-Analytics
30 pages
Math Behind Machine Learning
No ratings yet
Math Behind Machine Learning
9 pages
Machine Learning Masterclass
100% (11)
Machine Learning Masterclass
108 pages
Data Analysis for Students
No ratings yet
Data Analysis for Students
6 pages
Heywood 2012 PDF
No ratings yet
Heywood 2012 PDF
6 pages
Project - Report (Movie Genre Classification)
No ratings yet
Project - Report (Movie Genre Classification)
19 pages
Employee Attrition Risk Assessment Report - Global Organization by The Brew (Https://thebrew - In)
No ratings yet
Employee Attrition Risk Assessment Report - Global Organization by The Brew (Https://thebrew - In)
26 pages
GAMLSS in Learning Analytics: An Overview
No ratings yet
GAMLSS in Learning Analytics: An Overview
22 pages
Determinants of Small-Scale Irrigation Use: The Case of Jeldu District, West Shewa Zone, Oromia National Regional State, Ethiopia
No ratings yet
Determinants of Small-Scale Irrigation Use: The Case of Jeldu District, West Shewa Zone, Oromia National Regional State, Ethiopia
7 pages
Lussier 2005
No ratings yet
Lussier 2005
9 pages
1 s2.0 S0033350622002037 Main
No ratings yet
1 s2.0 S0033350622002037 Main
7 pages
Linear Models For Classification: Sumeet Agarwal, EEL709 (Most Figures From Bishop, PRML)
No ratings yet
Linear Models For Classification: Sumeet Agarwal, EEL709 (Most Figures From Bishop, PRML)
21 pages
Project 5 Surabhi Sood - Report
No ratings yet
Project 5 Surabhi Sood - Report
34 pages
Data Science and Machine Learning
No ratings yet
Data Science and Machine Learning
13 pages
Major Contributing Factors To Lower Level of Connection To Existing Sewer Network
No ratings yet
Major Contributing Factors To Lower Level of Connection To Existing Sewer Network
8 pages
Zhao 等 - 2023 - Unprotected Left-Turn Behavior Model Capturing Path Variations at Intersections
No ratings yet
Zhao 等 - 2023 - Unprotected Left-Turn Behavior Model Capturing Path Variations at Intersections
15 pages
Regression
No ratings yet
Regression
12 pages
The Python Bible
97% (31)
The Python Bible
506 pages
Machine Learning - An Applied Mathematics Introduction PDF
100% (13)
Machine Learning - An Applied Mathematics Introduction PDF
246 pages

Linear and Quadratic Discriminant Analysis: Tutorial: Benyamin Ghojogh

Uploaded by

Linear and Quadratic Discriminant Analysis: Tutorial: Benyamin Ghojogh

Uploaded by

Linear and Quadratic Discriminant Analysis: Tutorial

Benyamin Ghojogh BGHOJOGH @ UWATERLOO . CA

Abstract fundamental methods. Finally, some experiments on syn-

P(error) = P(x > x∗ , x ∈ C1 ) + P(x < x∗ , x ∈ C2 ). (4)

where |C| is the number of classes which is two here. The

by finding the best boundary of classes, i.e., x∗ .

where I(.) is the indicator function which is one and zero if

Or we can use the unbiased estimation of the covariance

In LDA, we assume that the covariance matrices of the

As Λ−1 d2M (x, µ) := ||x − µ||2M = (x − µ)> Σ−1 (x − µ),

The FDA maximizes the Fisher criterion: 9. Relation to Logistic Regression

u> x ∝ Σ−1 (µ2 − µ1 )2

P(x ∈ Ck | X = x) ∝ P(X = x | x ∈ Ck ) P(x ∈ Ck ), X

(a) (b) (c)

(d) (e) (f)

(a) (b) (c)

(d) (e) (f)

(a) (b) (c)

(d) (e) (f)

(a) (b) (c)

(d) (e) (f)

ters were: Casella, George and Berger, Roger L. Statistical inference,

Acknowledgment Ghojogh, Benyamin, Mohammadzade, Hoda, and Mokari,

Ham, Ji Hun, Lee, Daniel D, Mika, Sebastian, and

You might also like