Linear Discriminant
Analysis (LDA)
Module No. Training Models/ Regression and classification
2
Linear Regression, Multivariate Regression, Subset Selection, Shrinkage Methods, Principal
Component Regression, Partial Least squares, Linear Classification, Logistic Regression, LDA, K-
Nearest Neighbor learning.
Linear Discriminant Analysis
• Linear discriminant analysis (LDA) is an approach used in supervised
machine learning to solve multi-class classification problems.
• LDA separates multiple classes with multiple features through data
dimensionality reduction.
• Linear discriminant analysis, also known as normal discriminant
analysis (NDA) or discriminant function analysis (DFA).
• LDA works by identifying a linear combination of features that
separates or characterizes two or more classes of objects or events
LDA Objective
• The objective of LDA is to perform dimensionality reduction …
– So what, PCA does this …
• However, we want to preserve as much of the class
discriminatory information as possible.
– OK, that’s new, let’s well deeper …
Recall … PCA
• In PCA, the main idea to re-express the available dataset to extract the
relevant information by reducing the redundancy and minimize the
noise.
m - dimensional data vector
• We didn’t care about whether this dataset represent features from one
or more classes, i.e. the discrimination power was not taken into
consideration while we were talking about PCA.
• In PCA, we had a dataset matrix X with dimensions mxn, where
columns represent different data samples.
• We first started by subtracting the mean to have a zero mean dataset,
n – feature vectors
then we computed the covariance matrix S = (data samples)
x
XXT.
• Eigen values and eigen vectors were then computed for Sx. Hence the new basis vectors are those
eigen vectors with highest eigen values, where the number of those vectors was our choice.
• Thus, using the new basis, we can project the dataset onto a less dimensional space with more
powerful data representation.
PCA Vs LDA
PCA Vs LDA
PCA Vs LDA
LDA … Two Classes
• In order to find a good projection vector, we need to define a measure
of separation between the projections.
• The mean vector of each class in x and y feature space is:
1 1 1
N
i
i
x
x and i N
i
y
y N
i
w T
~
i i i
x x
1
wT Ni xi x wT i
– i.e. projecting x to y will lead to projecting the mean of x to the mean of y.
• We could then choose the distance between the projected means as our
objective function
J (w) 1 ~ 2 ~ 1 1
wT 2 wT 2
wT
LDA … Two Classes
• However, the distance between the projected means is not a very good
measure since it does not take into account the standard deviation within
the classes.
This axis yields better class separability
This axis has a larger distance between means
LDA … Two Classes
• The solution proposed by Fisher is to maximize a function that represents
the difference between the means, normalized by a measure of the within-
class variability, or the so-called scatter.
• For each class we define the scatter, an equivalent of
the variance, as; (sum of square differences between the projected
samples and their class mean).
• ~ ~
si 2 y
2
i
~2
yi
• si
measures the variability within class ω
i after projecting it on
the y-space.
• Thus ~
2 2
s1 2 measures the variability within the
~
s projection,
two classes atit is called within-class scatter of the
hand after hence
projected samples.
LDA … Two Classes
• The Fisher linear discriminant is defined as the
linear function wTx that maximizes the criterion
function: (the distance between the projected
means normalized by the within- class scatter of
the projected samples.
2
J (w) 1 2
~~
s12 ~~
s22
• Therefore, we will be looking for a projection where examples from the
same class are projected very close to each other and, at the same time,
the projected means are as farther apart as possible.
LDA … Two Classes
• In order to find the optimum projection w*, we need to express
J(w) as an explicit function of w.
• We will define a measure of the scatter in multivariate feature space
x which are denoted as scatter matrices;
2
J (w) ~
1 2
Si x i x i
T ~s~12 ~22
s
x i
S w S1 S 2
• Where Si is the covariance matrix of class
ωi, and Sw is called the within-class
scatter matrix.
LDA … Two Classes
• Now, the scatter of the projection y can then be expressed as a function
ofthe scatter matrix in feature space 2
x. ~
1 2
~ J (w)
2 2
si 2 ~ wT
x w T
~ ~2
y
2
s
i i ~1 2
yi xi s
w x x
T T
i i
w x i
T
x i x i w wT Si w
T
w x i
~ ~
s12 s 22 w T 1S w S 2 w w S1 S2 w w SW w
T T T
~ w SW
WhereSW~is the within-class scatter matrix of the projected samples y.
LDA … Two Classes
• Similarly, the difference between the projected means (in y-space) can be expressed in
terms of the means in the original feature space (x-space).
2
1
~
~ 2 w
2 T
1 T 2
2w J (w) ~
~
1 2
s~12 ~s2 2
T ~
w SB w
• The matrix SB
S B the between-class scatter of
is called the original samples/feature
vectors, whileS~B is the between-class scatter of the projected samples y.
• Since SB is the outer product of two vectors, its rank is at most one.
LDA … Two Classes
• We can finally express the Fisher criterion in terms of SW and SB
as:
~ 2 wT S B w
J (w) ~
1
T
s~12 ~s2 2 w SW w
• Hence J(w) is a measure of the difference between class means
(encoded in the between-class scatter matrix) normalized by a
measure of the within-class scatter matrix.
LDA … Two Classes
• To find themaximum of J(w), we differentiate and
equate to zero.
d J (w) d wTT S B w
dw dw w SW w 0
d wT S d
w SW
T
wT SB w B T
w SW 0
dw w dw
wT SW
2S
2SW w w
w w wT S
B B
Dividing by 2wT S w 0
w: w
W
wT S W w wT S B w
wT S w S B w wT S w SW w
W W
0
S B w J (w)SW w 0
S W1S B w J (w)w
LDA … Two Classes
• Solving the generalized eigen value problem
SW1 S B w wher J (w)
yield w e scalar
s
w arg max J (w) arg T S B w SW1 1 2
*
w T
max w w SW w
w
• This is known as Fisher’s Linear Discriminant, although it is not a discriminant
but rather a specific choice of direction for the projection of the data down
to one dimension.
• Using thesame notation as PCA, the solution will be
vector(s) ofS X SW
the eigen
S
1 B
LDA … Two Classes - Example
• Compute the Linear Discriminant projection for the
following two- dimensional dataset.
– Samples for class ω1 : X1=(x1,x2)={(4,2),(2,4),(2,3),(3,6),(4,4)}
– Sample for class ω2 : X2=(x1,x2)={(9,10),(6,8),(9,5),(8,7),(10,8)}
1
0
6
x
2
1
0
1 2 3 4 5 6 7 8 9 1
0 0
x1
LDA … Two Classes - Example
• The classes mean
are :
1
N
x 1 4 2 2 3 4 3
5 2 4 3 6 4
3.8
1 1 x
1
1 1 9 6 9 8 10 8.4
2
x 8 5 7 8 7.6
2 x 5 10
2
N
LDA … Two Classes - Example
• Covariance matrix of the first class:
2 2
x 1 x 4 3 2 3
T
S1
x1 1 2 3.8 4 3.8
2 2 2
2 3 3 3 4 3
3 3.8 6 3.8 4 3.8
1 0.25
0.25
2.2
LDA … Two Classes - Example
• Covariance matrix of the second class:
2 2
S2 x x
2
T
9 8.4 6 8.4
10 7.6 8 7.6
x 2 2
2 2 2
9 8.4 8 8.4 10 8.4
5 7.6 7 7.6 8 7.6
2.3 0.05
0.05
3.3
LDA … Two Classes - Example
• Within-class scatter matrix:
1 0.25
2.3
S w S1 S2
2.2 0.05 3.3
0.05
0.25 0.3
3.3
0.3 5.5
LDA … Two Classes - Example
Between-class scatter matrix:
LDA … Two Classes - Example
The LDA projection is then obtained as the solution of the generalized eigen value
problem
LDA … Two Classes - Example
The optimal projection is the one that given maximum λ = J(w)
LDA … Two Classes - Example
LDA - Projection
LDA - Projection