Linear and Quadratic Discriminant Analysis: Tutorial: Benyamin Ghojogh
Linear and Quadratic Discriminant Analysis: Tutorial: Benyamin Ghojogh
f1 (x) π1
P|C|
i=1 P(X = x | x ∈ Ci ) πi
Figure 1. Two Gaussian density functions where they are equal at
the point x∗ . set f2 (x) π2
= P|C|
i=1 P(X = x | x ∈ Ci ) πi
As we have P(A, B) = P(A|B) P(B), we can say: =⇒ f1 (x) π1 = f2 (x) π2 . (15)
∗
P(error) = P(x > x | x ∈ C1 ) P(x ∈ C1 )
(5)
+ P(x < x∗ | x ∈ C2 ) P(x ∈ C2 ),
which we want to minimize: Now let us think of data as multivariate data with dimen-
sionality d. The PDF for multivariate Gaussian distribu-
minimize
∗
P(error), (6) tion, x ∼ N (µ, Σ) is:
x
1 (x − µ )> Σ−1 (x − µ )
We take derivative for the sake of minimization: 1 1 1
p exp − π1
∂ P(error) set (2π)d |Σ1 | 2
∗
= −f1 (x∗ ) π1 + f2 (x∗ ) π2 = 0, (x − µ )> Σ−1 (x − µ )
∂x 1 2 2 2
=⇒ f1 (x∗ ) π1 = f2 (x∗ ) π2 . (12) =p exp − π2 ,
(2π)d |Σ2 | 2
Another way to obtain this expression is equating the pos- (17)
terior probabilities to have the equation of the boundary of where the distributions of the first and second class are
classes: N (µ1 , Σ1 ) and N (µ2 , Σ2 ), respectively.
set
P(x ∈ C1 | X = x) = P(x ∈ C2 | X = x). (13) 3. Linear Discriminant Analysis for Binary
According to Bayes rule, the posterior is: Classification
P(X = x | x ∈ C1 ) P(x ∈ C1 ) In Linear Discriminant Analysis (LDA), we assume that the
P(x ∈ C1 | X = x) = two classes have equal covariance matrices:
P(X = x)
f1 (x) π1
= P|C| ,
k=1 P(X = x | x ∈ Ck ) πk
(14) Σ1 = Σ2 = Σ. (18)
Linear and Quadratic Discriminant Analysis: Tutorial 3
Therefore, the Eq. (17) becomes: the class of an instance x is estimated as:
(x − µ )> Σ−1 (x − µ ) 1, if δ(x) < 0,
1 1 1 C(x)
b = (22)
p exp − π1 2, if δ(x) > 0.
(2π)d |Σ| 2
1 (x − µ )> Σ−1 (x − µ ) If the priors of two classes are equal, i.e., π1 = π2 , the Eq.
2 2
=p exp − π2 , (20) becomes:
d
(2π) |Σ| 2
>
(x − µ )> Σ−1 (x − µ ) 2 Σ−1 (µ2 − µ1 ) x
=⇒ exp − 1 1
π1 (23)
2 + µ1 − µ2 )> Σ−1 (µ1 − µ2 ) = 0,
(x − µ )> Σ−1 (x − µ )
2 2 whose left-hand-side expression can be considered as δ(x)
= exp − π2 ,
2 in Eq. (22).
(a) 1
=⇒ − (x − µ1 )> Σ−1 (x − µ1 ) + ln(π1 )
2 4. Quadratic Discriminant Analysis for
1
= − (x − µ2 )> Σ−1 (x − µ2 ) + ln(π2 ), Binary Classification
2
In Quadratic Discriminant Analysis (QDA), we relax the
where (a) takes natural logarithm from the sides of equa- assumption of equality of the covariance matrices:
tion.
We can simplify this term as: Σ1 6= Σ2 , (24)
which means the covariances are not necessarily equal (if
(x − µ1 )> Σ−1 (x − µ1 ) = (x> − µ>
1 )Σ
−1
(x − µ1 )
they are actually equal, the decision boundary will be linear
= x> Σ−1 x − x> Σ−1 µ1 − µ>
1Σ
−1
x + µ>
1Σ
−1
µ1 and QDA reduces to LDA).
(a) Therefore, the Eq. (17) becomes:
= x> Σ−1 x + µ>
1Σ
−1
µ1 − 2 µ>
1Σ
−1
x, (19)
1 (x − µ )> Σ−1 (x − µ )
1 1 1
where (a) is because x> Σ−1 µ1 = µ> −1 exp − π1
1Σ x as it is a p
−1 −> (2π)d |Σ1 | 2
scalar and Σ is symmetric so Σ = Σ−1 . Thus, we (x − µ )> Σ−1 (x − µ )
have: 1 2 2 2
=p exp − π2 ,
(2π)d |Σ2 | 2
1 1
− x> Σ−1 x − µ> Σ−1 µ1 + µ>
1Σ
−1
x + ln(π1 ) d 1
2 2 1 (a)
=⇒ − ln(2π) − ln(|Σ1 |)
1 1 2 2
= − x> Σ−1 x − µ> Σ−1 µ2 + µ>
2Σ
−1
x + ln(π2 ). 1
2 2 2 − (x − µ1 )> Σ−1 1 (x − µ1 ) + ln(π1 )
2
Therefore, if we multiply the sides of equation by 2, we d 1
have: = − ln(2π) − ln(|Σ2 |)
2 2
> 1
2 Σ−1 (µ2 − µ1 ) x > −1
− (x − µ2 ) Σ2 (x − µ2 ) + ln(π2 ),
π2 2
+ µ1 − µ2 )> Σ−1 (µ1 − µ2 ) + 2 ln( ) = 0,
π1 where (a) takes natural logarithm from the sides of equa-
(20) tion. According to Eq. (19), we have:
which is the equation of a line in the form of a> x + b = 0. 1 1 1 > −1
Therefore, if we consider Gaussian distributions for the two − ln(|Σ1 |) − x> Σ−1 1 x − µ1 Σ1 µ1
2 2 2
classes where the covariance matrices are assumed to be −1
+ µ>1 Σ1 x + ln(π 1 )
equal, the decision boundary of classification is a line. Be-
1 1 1 > −1
cause of linearity of the decision boundary which discrimi- = − ln(|Σ2 |) − x> Σ−1 2 x − µ2 Σ2 µ2
nates the two classes, this method is named linear discrim- 2 2 2
−1
inant analysis. + µ>2 Σ2 x + ln(π2 ).
For obtaining Eq. (20), we brought the expressions to the Therefore, if we multiply the sides of equation by 2, we
right side which was corresponding to the second class; have:
therefore, if we use δ(x) : Rd → R as the left-hand-side
expression (function) in Eq. (20): x> (Σ1 − Σ2 )−1 x + 2 (Σ−1 −1
2 µ2 − Σ1 µ1 ) x
>
|Σ |
−1 > −1 1
> + (µ>1 Σ1 µ1 − µ2 Σ2 µ2 ) + ln
δ(x) := 2 Σ−1 (µ2 − µ1 ) x |Σ2 | (25)
π2 (21) π2
+ µ1 − µ2 )> Σ−1 (µ1 − µ2 ) + 2 ln( ),
+ 2 ln( ) = 0,
π1 π1
Linear and Quadratic Discriminant Analysis: Tutorial 4
which is in the quadratic form x> A x + b> x + c = 0. because it maximizes the posterior of that class. In this
Therefore, if we consider Gaussian distributions for the two expression, δ(x) is Eq. (28).
classes, the decision boundary of classification is quadratic. In LDA, we assume that the covariance matrices of the k
Because of quadratic decision boundary which discrimi- classes are equal:
nates the two classes, this method is named quadratic dis-
criminant analysis. Σ1 = · · · = Σ|C| = Σ. (30)
For obtaining Eq. (25), we brought the expressions to the Therefore, the Eq. (28) becomes:
right side which was corresponding to the second class;
1
therefore, if we use δ(x) : Rd → R as the left-hand-side δk (x) = − ln(|Σ|)
expression (function) in Eq. (25): 2
1 1
− (x − µk )> Σ−1 (x − µk ) + ln(πk ) = − ln(|Σ|)
δ(x) := x> (Σ1 − Σ2 )−1 x + 2 (Σ−1 −1
2 µ2 − Σ1 µ1 ) x
>
2 2
|Σ | π2 1 1
+ (µ> −1 > −1 1 − x> Σ−1 x − µ> Σ−1 µk + µ> kΣ
−1
x + ln(πk ).
1 Σ1 µ1 − µ2 Σ2 µ2 ) + ln + 2 ln( ), 2 2 k
|Σ2 | π1
(26) We drop the constant terms −(1/2) ln(|Σ|) and
the class of an instance x is estimated as the Eq. (22). −(1/2) x> Σ−1 x which are the same for all classes (note
If the priors of two classes are equal, i.e., π1 = π2 , the Eq. that before taking the logarithm, the term −(1/2) ln(|Σ|)
(20) becomes: is multiplied and the term −(1/2) x> Σ−1 x is multiplied
as an exponential term). Thus, the scaled posterior of the
x> (Σ1 − Σ2 )−1 x + 2 (Σ−1 −1
2 µ2 − Σ1 µ1 ) x
>
k-th class becomes:
−1 > −1
|Σ |
1 (27)
+ (µ>1 Σ1 µ1 − µ2 Σ2 µ2 ) + ln = 0, δk (x) := µ> −1 1
x − µ> Σ−1 µk + ln(πk ).
|Σ2 | kΣ (31)
2 k
whose left-hand-side expression can be considered as δ(x) In LDA, the class of the instance x is determined by Eq.
in Eq. (22). (29), where δ(x) is Eq. (31), because it maximizes the
posterior of that class.
5. LDA and QDA for Multi-class In conclusion, QDA and LDA deal with maximizing the
Classification posterior of classes but work with the likelihoods (class
Now we consider multiple classes, which can be more than conditional) and priors.
two, indexed by k ∈ {1, . . . , |C|}. Recall Eq. (12) or (15)
where we are using the scaled posterior, i.e., fk (x) πk . Ac- 6. Estimation of Parameters in LDA and QDA
cording to Eq. (16), we have: In LDA and QDA, we have several parameters which are
required in order to calculate the posteriors. These param-
fk (x) πk
eters are the means and the covariance matrices of classes
1 (x − µ )> Σ−1 (x − µ )
=p exp − k k k
πk . and the priors of classes.
(2π)d |Σk | 2 The priors of the classes are very tricky to calculate. It is
Taking natural logarithm gives: somewhat a chicken and egg problem because we want to
know the class probabilities (priors) to estimate the class of
d 1 an instance but we do not have the priors and should esti-
ln(fk (x) πk ) = − ln(2π) − ln(|Σk |)
2 2 mate them. Usually, the prior of the k-th class is estimated
1 according to the sample size of the k-th class:
− (x − µk )> Σ−1
k (x − µk ) + ln(πk ).
2 nk
π
bk = , (32)
We drop the constant term −(d/2) ln(2π) which is the n
same for all classes (note that this term is multiplied be- where nk and n are the number of training instances in the
fore taking the logarithm). Thus, the scaled posterior of the k-th class and in total, respectively. This estimation consid-
k-th class becomes: ers Bernoulli distribution for choosing every instance out of
1 the overall training set to be in the k-th class.
δk (x) := − ln(|Σk |)
2 (28) The mean of the k-th class can be estimated using the Max-
1 imum Likelihood Estimation (MLE), or Method of Mo-
− (x − µk )> Σ−1 k (x − µ k ) + ln(π k ).
2 ments (MOM), for the mean of a Gaussian distribution:
In QDA, the class of the instance x is estimated as: n
1 X
Rd 3 µ
bk = xi I C(xi ) = k , (33)
C(x)
b = arg max δk (x), (29) nk i=1
k
Linear and Quadratic Discriminant Analysis: Tutorial 5
Rd×d 3 Σ
bk =
n
1 X (34)
(xi − µ b k )> I C(xi ) = k .
b k )(xi − µ
nk i=1
As the next step, consider a more general case where the & Jin, 2006; Kulis, 2013) in a perspective. Note that in
covariance matrices are not equal as we have in QDA. We metric learning, a valid distance metric is defined as (Yang
apply Singular Value Decomposition (SVD) to the covari- & Jin, 2006):
ance matrix of the k-th class:
d2A (x, µk ) := ||x − µk ||2A = (x − µk )> A (x − µk ),
Σk = U k Λk U >
k, (44)
where the left and right matrices of singular vectors are where A is a positive semi-definite matrix, i.e., A 0.
equal because the covariance matrix is symmetric. There- In QDA, we are also using (x − µk )> Σ−1 k (x − µk ). The
fore: covariance matrix is positive semi-definite according to the
characteristics of covariance matrix. Moreover, according
Σ−1 −1 >
k = U k Λk U k , to characteristics of a positive semi-definite matrix, the in-
verse of a positive semi-definite matrix is positive semi-
where U −1
k = U> k because it is an orthogonal matrix. definite so Σ−1 0. Therefore, QDA is using metric
k
Therefore, we can simplify the following term: learning (and as will be discussed in next section, it can
be seen as a manifold learning method, too).
(x − µk )> Σ−1
k (x − µk )
It is also noteworthy that the QDA and LDA can also
= (x − µk )> U k Λ−1 >
k U k (x − µk ) be seen as Mahalanobis distance (McLachlan, 1999;
= (U > > > −1 > >
k x − U k µk ) Λk (U k x − U k µk ).
De Maesschalck et al., 2000) which is also a metric:
Λ−1
−1/2 −1/2 where Σ is the covariance matrix of the cloud of data whose
k = Λk Λk .
mean is µ. The intuition of Mahalanobis distance is that
Therefore: if we have several data clouds (e.g., classes), the distance
from the class with larger variance should be scaled down
(U > > > −1 > >
k x − U k µk ) Λk (U k x − U k µk ) because that class is taking more of the space so it is more
−1/2 −1/2 probable to happen. The scaling down shows in the inverse
= (U > > >
k x − U k µk ) Λk Λk (U > >
k x − U k µk )
of covariance matrix. Comparing (x − µk )> Σ−1 k (x − µk )
(a) −1/2 −1/2
= (Λk U>
k x − Λk U>
k µk )
> in QDA or LDA with Eq. (45) shows that QDA and LDA
−1/2 −1/2 are sort of using Mahalanobis distance.
(Λk U>
k x − Λk U>
k µk ),
?
where (a) is because Λk
−>/2
= Λk
−1/2
because it is diago- 8. LDA ≡ FDA
nal. We define the following transformation: In the previous section, we saw that LDA and QDA can
be seen as metric learning. We know that metric learning
−1/2
φk : x 7→ Λk U>
k x, (42) can be seen as a family of manifold learning methods. We
briefly explain the reason of this assertion: As A 0, we
which also results in the transformation of the mean: φk : can say A = U U > 0. Therefore, Eq. (44) becomes:
−1/2
µ 7→ Λk U > k µ. Therefore, the Eq. (28) can be restated
as: ||x − µk ||2A = (x − µk )> U U > (x − µk )
1 = (U > x − U > µk )> (U > x − U > µk ),
δk (x) = − ln(|Σk |)
2
1 > which means that metric learning can be seen as compari-
− φk (x) − φk (µk ) φk (x) − φk (µk ) + ln(πk ). son of simple Euclidean distances after the transformation
2
(43) φ : x 7→ U > x which is a projection into a subspace with
Ignoring the terms −(1/2) ln(|Σk |) and ln(πk ), we can see projection matrix U . Thus, metric learning is a manifold
that the transformation has changed the covariance matrix learning approach. This gives a hint that the Fisher Dis-
of the class to identity matrix. Therefore, the QDA (and criminant Analysis (FDA) (Fisher, 1936; Welling, 2005),
also LDA) can be seen as simple comparison of distances which is a manifold learning approach (Tharwat et al.,
from the means of classes after applying a transformation to 2017), might have a connection to LDA; especially, be-
the data of every class. In other words, we are learning the cause the names FDA and LDA are often used interchange-
metric using the SVD of covariance matrix of every class. ably in the literature. Actually, other names of FDA are
Thus, LDA and QDA can be seen as metric learning (Yang Fisher LDA (FLDA) and even LDA.
Linear and Quadratic Discriminant Analysis: Tutorial 7
We know that if we project (transform) the data of a class Comparing Eq. (53) with Eq. (23) shows that LDA
using a projection vector u ∈ Rp to a p dimensional sub- and FDA are equivalent up to a scaling factor µ1 −
space (p ≤ d), i.e.: µ2 )> Σ−1 (µ1 − µ2 ) (note that this term is multiplied as
an exponential factor before taking logarithm to obtain Eq.
x 7→ u> x, (46) (23), so this term a scaling factor). Hence, we can say:
for all data instances of the class, the mean and the covari-
LDA ≡ FDA. (54)
ance matrix of the class are transformed as:
In other words, FDA projects into a subspace. On the other
µ 7→ u> µ, (47)
hand, according to Section 7, LDA can be seen as a met-
Σ 7→ u> Σ u, (48) ric learning with a subspace where the Euclidean distance
is used after projecting onto that subspace. The two sub-
because of characteristics of mean and variance. spaces of FDA and LDA are the same subspace. It should
The Fisher criterion (Xu & Lu, 2006) is the ratio of the be noted that in manifold (subspace) learning, the scale
between-class variance, σb2 , and within-class variance, σw
2
: does not matter because all the distances scale similarly.
2 Note that LDA assumes one (and not several) Gaussian for
σb2 (u> µ2 − u> µ1 )2 u> (µ2 − µ1 ) every class and so does the FDA. That is why FDA faces
f := 2 = > = > .
σw u Σ2 u + u> Σ1 u u (Σ2 + Σ1 ) u problem for multi-modal data (Sugiyama, 2007).
(49)
∂L set P(C(x) | X = x)
= 2 (µ2 − µ1 )2 u − 2 λ (Σ2 + Σ1 ) u = 0 exp(β > x0 ) C(x)
∂u 1 1−C(x)
=⇒ (µ2 − µ1 )2 u = λ (Σ2 + Σ1 ) u, = ,
1 + exp(β > x0 ) 1 + exp(β > x0 )
(55)
which is a generalized
eigenvalue problem (µ2 − where C(x) ∈ {−1, +1} for the two classes. Logistic
µ1 )2 , (Σ2 + Σ1 ) according to (Ghojogh et al., 2019b). regression considers the coefficient β as the parameter to
The projection vector is the eigenvector of (Σ2 + be optimized and uses Newton’s method (Boyd & Vanden-
Σ1 )−1 (µ2 − µ1 )2 ; therefore, we can say: berghe, 2004) for the optimization. Therefore, in summary,
logistic regression makes assumption on the posterior while
u ∝ (Σ2 + Σ1 )−1 (µ2 − µ1 )2 .
LDA and QDA make assumption on likelihood and prior.
In LDA, the equality of covariance matrices is assumed.
Thus, according to Eq. (18), we can say: 10. Relation to Bayes Optimal Classifier and
Gaussian Naive Bayes
u ∝ (2 Σ)−1 (µ2 − µ1 )2 ∝ Σ−1 (µ2 − µ1 )2 . (52)
The Bayes classifier maximizes the posteriors of the classes
According to Eq. (46), we have: (Murphy, 2012):
According to Eq. (14) and Bayes rule, we have: Therefore, Eq. (59) becomes (Mitchell, 1997):
where the denominator of posterior (the marginal) which In conclusion, the Bayes classifier is optimal. Therefore, if
is: the likelihoods of classes are Gaussian, QDA is an optimal
classifier and if the likelihoods are Gaussian and the co-
|C|
X variance matrices are equal, the LDA is an optimal classi-
P(X = x) = P(X = x | x ∈ Cr ) πr , fier. Often, the distributions in the natural life are Gaussian;
r=1
especially, because of central limit theorem (Hazewinkel,
is ignored because it is not dependent on the classes C1 to 2001), the summation of independent and identically dis-
C|C| . tributed (iid) variables is Gaussian and the signals usually
According to Eq. (57), the posterior can be written in terms add in the real world. This explains why LDA and QDA
of likelihood and prior; therefore, Eq. (56) can be restated are very effective classifiers in machine learning. We also
as: saw that FDA is equivalent to LDA. Thus, the reason of ef-
fectiveness of the powerful FDA classifier becomes clear.
C(x)
b = arg max πk P(X = x | x ∈ Ck ). (58) We have seen the very successful performance of FDA
k
and LDA in different applications, such as face recogni-
Note that the Bayes classifier does not make any assump- tion (Belhumeur et al., 1997; Etemad & Chellappa, 1997;
tion on the posterior, prior, and likelihood, unlike LDA and Zhao et al., 1999), action recognition (Ghojogh et al., 2017;
QDA which assume the uni-modal Gaussian distribution Mokari et al., 2018), and EEG classification (Malekmo-
for the likelihood (and we may assume Bernoulli distribu- hammadi et al., 2019).
tion for the prior in LDA and QDA according to Eq. (32)). Implementing Bayes classifier is difficult in practice so we
Therefore, we can say the difference of Bayes and QDA approximate it by naive Bayes (Zhang, 2004). If xj denotes
is in assumption of uni-modal Gaussian distribution for the the j-th dimension (feature) of x = [x1 , . . . , xd ]> , Eq. (58)
likelihood (class conditional); hence, if the likelihoods are is restated as:
already uni-modal Gaussian, the Bayes classifier reduces to
QDA. Likewise, the difference of Bayes and LDA is in as- C(x)
b = arg max πk P(x1 , x2 , . . . , xd | x ∈ Ck ). (61)
k
sumption of Gaussian distribution for the likelihood (class
conditional) and equality of covariance matrices of classes;
The term P(x1 , x2 , . . . , xd | x ∈ Ck ) is very difficult to
thus, if the likelihoods are already Gaussian and the co-
compute as the features are possibly correlated. Naive
variance matrices are already equal, the Bayes classifier re-
Bayes relaxes this possibility and naively assumes that the
duces to LDA.
features are conditionally independent (⊥⊥) when they are
It is noteworthy that the Bayes classifier is an optimal clas- conditioned on the class:
sifier because it can be seen as an ensemble of hypothe-
ses (models) in the hypothesis (model) space and no other P(x1 , x2 , . . . , xd | x ∈ Ck )
ensemble of hypotheses can outperform it (see Chapter 6, d
Page 175 in (Mitchell, 1997)). In the literature, it is referred ⊥
⊥ Y
≈ P(x1 | Ck ) P(x2 | Ck ) · · · P(xd | Ck ) = P(xj | Ck ).
to as Bayes optimal classifier. To better formulate the ex-
j=1
plained statements, the Bayes optimal classifier estimates
the class as: Therefore, Eq. (61) becomes:
X
C(x)
b = arg max P(Ck | hj ) P(D | hj ) P(hj ), (59)
Ck ∈C d
hj ∈H Y
C(x)
b = arg max πk P(xj | Ck ). (62)
k
where C := {C1 , . . . , C|C| }, D := {xi }ni=1
is the training j=1
set, hj is a hypothesis for estimating the class of instances,
and H is the hypothesis space including all possible hy- In Gaussian naive Bayes, univariate Gaussian distribution
potheses. is assumed for the likelihood (class conditional) of every
According to Bayes rule, similar to what we had for Eq. feature:
(57), we have: (x − µ )2
1 j k
P(xj | Ck ) = p exp − , (63)
P(hj | D) ∝ P(D | hj ) P(hj ). 2πσk2 2σk2
Linear and Quadratic Discriminant Analysis: Tutorial 9
where the mean and unbiased variance are estimated as: where the degree of freedom of χ2 distribution is df :=
n dim(HA ) − dim(H0 ) and dim(.) is the number of unspeci-
1 X
R3µ
bk = xi,j I C(xi ) = k , (64) fied parameters in the hypothesis.
nk i=1
There is a connection between LDA or QDA and the LRT
n
1 X (Lachenbruch & Goldstein, 1979). Recall Eq. (12) or (15)
bk2 = bk )2 I C(xi ) = k , (65)
R3σ (xi,j − µ which can be restated as:
nk − 1 i=1
where xi,j denotes the j-th feature of the i-th training in- f2 (x) π2
= 1, (68)
stance. The prior can again be estimated using Eq. (32). f1 (x) π1
According to Eqs. (62) and (63), Gaussian naive Bayes is
equivalent to QDA where the covariance matrices are di- which is for the decision boundary. The Eq. (22) dealt with
agonal, i.e., the off-diagonal of the covariance matrices are the difference of f2 (x) π2 and f1 (x) π1 ; however, here we
ignored. Therefore, we can say that QDA is more pow- are dealing with their ratio. Recall Fig. 1 where if we move
erful than Gaussian naive Bayes because Gaussian naive x∗ to the right and left, the ratio f2 (x∗ ) π2 /f1 (x∗ ) π1 de-
Bayes is a simplified version of QDA. Moreover, it is obvi- creases and increases, respectively, because the probabili-
ous that Gaussian naive Bayes and QDA are equivalent for ties of the first and second class happening change. In other
one dimensional data. Comparing to LDA, the Gaussian words, moving the x∗ changes the significance level and
naive Bayes is equivalent to LDA if the covariance matrices power. Therefore, Eq. (68) can be used to have a statis-
are diagonal and they are all equal, i.e., σ12 = · · · = σ|C|
2
; tical test where the posteriors are used in the ratio, as we
therefore, LDA and Gaussian naive Bayes have their own also used posteriors in LDA and QDA. The null/alternative
assumptions, one on the off-diagonal of covariance matri- hypothesis an be considered to be the mean and covariance
ces and the other one on equality of the covariance matri- of the first/second class. In other words, the two hypothe-
ces. As Gaussian naive Bayes has some level of optimality ses say that the point belongs to a specific class. Hence, if
(Zhang, 2004), it becomes clear why LDA and QDA are the ratio is larger than a value t, the instance x is estimated
such effective classifiers. to belong to the second class; otherwise, the first class is
chosen. According to Eq. (16), the Eq. (68) becomes:
11. Relation to Likelihood Ratio Test
(|Σ2 |)−1/2 exp − 21 (x − µ2 )> Σ−1
Consider two hypotheses for estimating some parameter, 2 (x − µ2 ) π2
≥ t,
(|Σ1 |)−1/2 exp − 21 (x − µ1 )> Σ−1
a null hypothesis H0 and an alternative hypothesis HA . 1 (x − µ1 ) π1
The probability P(reject H0 | H0 ) is called type 1 error, (69)
false positive error, or false alarm error. The probabil-
ity P(accept H0 | HA ) is called type 2 error or false nega- for QDA. In LDA, the covariance matrices are equal, so:
tive error. The P(reject H0 | H0 ) is also called significance
exp − 12 (x − µ2 )> Σ−1 (x − µ2 ) π2
level, while 1 − P(accept H0 | HA ) = P(reject H0 | HA ) is
called power. ≥ t. (70)
exp − 21 (x − µ1 )> Σ−1 (x − µ1 ) π1
If L(θA ) and L(θ0 ) are the likelihoods (probabilities) for
the alternative and null hypotheses, the likelihood ratio is: As can be seen, changing the priors change impacts the ra-
L(θA ) f (x; θA ) tio as expected. Moreover, the value of t can be chosen
Λ= = . (66) according to the desired significance level in the χ2 distri-
L(θ0 ) f (x; θ0 )
bution using the χ2 table. The Eqs. (69) and (70) show
The Likelihood Ratio Test (LRT) (Casella & Berger, 2002) the relation of LDA and QDA with LRT. As the LRT has
rejects the H0 in favor of HA if the likelihood ratio is the largest power (Neyman & Pearson, 1933), the effective-
greater than a threshold, i.e., Λ ≥ t. The LRT is a very ness of LDA and QDA in classification is explained from a
effective statistical test because according to the Neyman- hypothesis testing point of view.
Pearson lemma (Neyman & Pearson, 1933), it has the
largest power among all statistical tests with the same sig- 12. Simulations
nificance level. In this section, we report some simulations which make the
If the sample size is large, n → ∞, and the θA is estimated concepts of tutorial clearer by illustration.
using MLE, the logarithm of the likelihood ratio asymptot-
ically has the distribution of χ2 under the null hypothesis 12.1. Experiments with Equal Class Sample Sizes
(White, 1984; Casella & Berger, 2002): We created a synthetic dataset of three classes each of
H which is a two dimensional Gaussian distribution. The
2 ln(Λ) ∼0 χ2(df ) , (67) means and covariance matrices of the three Gaussians from
Linear and Quadratic Discriminant Analysis: Tutorial 10
(g)
Figure 3. The synthetic dataset: (a) three classes each with size 200, (b) two classes each with size 200, (c) three classes each with size
10, (d) two classes each with size 10, (e) three classes with sizes 200, 100, and 10, (f) two classes with sizes 200 and 10, and (g) two
classes with sizes 400 and 200 where the larger class has two modes.
which the class samples were randomly drawn are: The three classes are shown in Fig. 3-a where each has
sample size 200. Experiments were performed on the three
µ1 = [−4, 4]> , µ2 = [3, −3]> , µ1 = [−3, 3]> , classes. We also performed experiments on two of the three
classes to test a binary classification. The two classes are
10 1 3 0 6 1.5
Σ1 = , Σ2 = , Σ3 = . shown in Fig. 3-b. The LDA, QDA, naive Bayes, and
1 5 0 4 1.5 4
Linear and Quadratic Discriminant Analysis: Tutorial 11
(g) (h)
Figure 4. Experiments with equal class sample sizes: (a) LDA for two classes, (b) QDA for two classes, (c) Gaussian naive Bayes for
two classes, (d) Bayes for two classes, (e) LDA for three classes, (f) QDA for three classes, (g) Gaussian naive Bayes for three classes,
and (h) Bayes for three classes.
Bayes classifications of the two and three classes are shown we used Eqs. (62) and (63) and estimated the parameters
in Fig. 4. For both binary and ternary classification with using Eqs. (64) and (65). For Bayes classifier, we used
LDA and QDA, we used Eqs. (31) and (28), respectively, Eq. (58) with Eq. (63) but we do not estimate the mean
with Eq. (29). We also estimated the mean and covariance and variance; except, in order to use the exact likelihoods
using Eqs. (33), (35), and (36). For Gaussian naive Bayes, in Eq. (58), we use the exact mean and covariance matrices
Linear and Quadratic Discriminant Analysis: Tutorial 12
(g) (h)
Figure 5. Experiments with small class sample sizes: (a) LDA for two classes, (b) QDA for two classes, (c) Gaussian naive Bayes for
two classes, (d) Bayes for two classes, (e) LDA for three classes, (f) QDA for three classes, (g) Gaussian naive Bayes for three classes,
and (h) Bayes for three classes.
of the distributions which we sampled from. We, however, LDA and QDA are linear and curvy (quadratic), respec-
estimated the priors. The priors were estimated using Eq. tively. The results of QDA, Gaussian naive Bayes, and
(32) for all the classifiers. Bayes are very similar although they have slight differ-
As can be seen in Fig. 4, the space is partitioned into ences. This is because the classes are already Gaussian so
two/three parts and this validates the assertion that LDA if the estimates of means and covariance matrices are accu-
and QDA can be considered as metric learning methods rate enough, QDA and Bayes are equivalent. The classes
as discussed in Section 7. As expected, the boundaries of are Gaussians and the off-diagonal elements of covariance
Linear and Quadratic Discriminant Analysis: Tutorial 13
(g) (h)
Figure 6. Experiments with different class sample sizes: (a) LDA for two classes, (b) QDA for two classes, (c) Gaussian naive Bayes for
two classes, (d) Bayes for two classes, (e) LDA for three classes, (f) QDA for three classes, (g) Gaussian naive Bayes for three classes,
and (h) Bayes for three classes.
matrices are also small compared to the diagonal; therefore, i.e., n → ∞. Therefore, if the sample size is small, we
naive Bayes is also behaving similarly. expect mode difference between QDA and Bayes classi-
fiers. We made a synthetic dataset with three or two classes
12.2. Experiments with Small Class Sample Sizes with the same mentioned means and covariance matrices.
According to Monte-Carlo approximation (Robert & The sample size of every class was 10. Figures 3-c and
Casella, 2013), the estimates in Eqs. (33), (35), (64) and 3-d show these datasets. The results of LDA, QDA, Gaus-
(65) are more accurate if the sample size goes to infinity, sian naive Bayes, and Bayes classifiers for this dataset are
Linear and Quadratic Discriminant Analysis: Tutorial 14
(a) (b)
(c) (d)
Figure 7. Experiments with multi-modal data: (a) LDA, (b) QDA, (c) Gaussian naive Bayes, and (d) Bayes.
shown in Fig. 5. As can be seen, now, the results of QDA, or LDA faces problem for multi-modal data (Sugiyama,
Gaussian naive Bayes, and Bayes are different for the rea- 2007). For testing this, we made a synthetic dataset with
son explained. two classes, one with sample size 400 having two modes of
Gaussians and the other with sample size 200 having one
12.3. Experiments with Different Class Sample Sizes mode. We again used the same mentioned means and co-
According to Eq. (32) used in Eqs. (28), (31), (58), and variance matrices. The dataset is shown in Fig. 3-g.
(62), the prior of a class changes by the sample size of The results of the LDA, QDA, Gaussian naive Bayes, and
the class. In order to see the effect of sample size, we Bayes classifiers for this dataset are shown in Fig. 7. The
made a synthetic dataset with different class sizes, i.e., 200, mean and covariance matrix of the larger class, although
100, and 10, shown in Figs. 3-e, 3-f. We used the same it has two modes, were estimated using Eqs. (33), (35),
mentioned means and covariance matrices. The results are (64) and (65) in LDA, QDA, and Gaussian naive Bayes.
shown in Fig. 6. As can be seen, the class with small sam- However, for the likelihood used in Bayes classifier, i.e., in
ple size has covered a small portion of space in discrimina- Eq. (58), we need to know the exact multi-modal distribu-
tion which is expected because its prior is small according tion. Therefore, we fit a mixture of two Gaussians (Gho-
to Eq. (32); therefore, its posterior is small. On the other jogh et al., 2019a) to the data of the larger class:
hand, the class with large sample size has covered a larger
portion because of a larger prior.
2
X
12.4. Experiments with Multi-Modal Data P(X = x | x ∈ Ck ) = wk f (x; µk , Σk ), (71)
k=1
As mentioned in Section 8, LDA and QDA assume uni-
modal Gaussian distribution for every class and thus FDA
where f (x; µk , Σk ) is Eq. (16) and we the fitted parame-
Linear and Quadratic Discriminant Analysis: Tutorial 15
Kleinbaum, David G, Dietz, K, Gail, M, Klein, Mitchel, Welling, Max. Fisher linear discriminant analysis. Tech-
and Klein, Mitchell. Logistic regression. Springer, 2002. nical report, University of Toronto, Toronto, Ontario,
Canada, 2005.
Kulis, Brian. Metric learning: A survey. Foundations and
Trends R in Machine Learning, 5(4):287–364, 2013. White, Halbert. Asymptotic theory for econometricians.
Academic press, 1984.
Lachenbruch, Peter A and Goldstein, M. Discriminant
analysis. Biometrics, pp. 69–85, 1979. Xu, Yong and Lu, Guangming. Analysis on fisher discrim-
inant criterion and linear separability of feature space. In
Li, Yongmin, Gong, Shaogang, and Liddell, Heather.
2006 International Conference on Computational Intel-
Recognising trajectories of facial identities using kernel
ligence and Security, volume 2, pp. 1671–1676. IEEE,
discriminant analysis. Image and Vision Computing, 21
2006.
(13-14):1077–1086, 2003.
Lu, Juwei, Plataniotis, Konstantinos N, and Venetsanopou- Yang, Liu and Jin, Rong. Distance metric learning: A
los, Anastasios N. Face recognition using kernel direct comprehensive survey. Technical report, Department
discriminant analysis algorithms. IEEE Transactions on of Computer Science and Engineering, Michigan State
Neural Networks, 14(1):117–126, 2003. University, 2006.
Malekmohammadi, Alireza, Mohammadzade, Hoda, Zhang, Harry. The optimality of naive Bayes. In American
Chamanzar, Alireza, Shabany, Mahdi, and Ghojogh, Association for Artificial Intelligence (AAAI), 2004.
Benyamin. An efficient hardware implementation for a Zhao, Wenyi, Chellappa, Rama, and Phillips, P Jonathon.
motor imagery brain computer interface system. Scientia Subspace linear discriminant analysis for face recogni-
Iranica, 26:72–94, 2019. tion. Citeseer, 1999.
McLachlan, Goeffrey J. Mahalanobis distance. Resonance,
4(6):20–26, 1999.
Mitchell, Thomas. Machine learning. McGraw Hill Higher
Education, 1997.
Mokari, Mozhgan, Mohammadzade, Hoda, and Ghojogh,
Benyamin. Recognizing involuntary actions from 3d
skeleton data using body states. Scientia Iranica, 2018.
Murphy, Kevin P. Machine learning: a probabilistic per-
spective. MIT press, 2012.
Neyman, Jerzy and Pearson, Egon Sharpe. IX. On the prob-
lem of the most efficient tests of statistical hypotheses.
Philosophical Transactions of the Royal Society of Lon-
don. Series A, Containing Papers of a Mathematical or
Physical Character, 231(694-706):289–337, 1933.
Robert, Christian and Casella, George. Monte Carlo sta-
tistical methods. Springer Science & Business Media,
2013.
Strange, Harry and Zwiggelaar, Reyer. Open Problems in
Spectral Dimensionality Reduction. Springer, 2014.
Sugiyama, Masashi. Dimensionality reduction of multi-
modal labeled data by local fisher discriminant analysis.
Journal of machine learning research, 8(May):1027–
1061, 2007.
Tharwat, Alaa, Gaber, Tarek, Ibrahim, Abdelhameed, and
Hassanien, Aboul Ella. Linear discriminant analysis:
A detailed tutorial. AI communications, 30(2):169–190,
2017.