Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views52 pages

Lec 9

This document discusses Anisotropic Gaussians and their application in Quadratic Discriminant Analysis (QDA) and Linear Discriminant Analysis (LDA). It covers the mathematical foundations of anisotropic normal distributions, maximum likelihood estimation, and the differences between QDA and LDA in terms of decision boundaries and parameterization. The document emphasizes the implications of using QDA and LDA for classification tasks, including potential overfitting issues with QDA in high-dimensional spaces.

Uploaded by

Muhammad Furrukh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views52 pages

Lec 9

This document discusses Anisotropic Gaussians and their application in Quadratic Discriminant Analysis (QDA) and Linear Discriminant Analysis (LDA). It covers the mathematical foundations of anisotropic normal distributions, maximum likelihood estimation, and the differences between QDA and LDA in terms of decision boundaries and parameterization. The document emphasizes the implications of using QDA and LDA for classification tasks, including potential overfitting issues with QDA in high-dimensional spaces.

Uploaded by

Muhammad Furrukh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

KASHIF JAVED

EED, UET, Lahore

1
Lecture 9
Anisotropic Gaussians and GDA

KASHIF JAVED
Readings: EED, UET, Lahore
▪ https://people.eecs.berkeley.edu/~jrs/189/
2
Anisotropic Normal Distribution
• Recall from our last lecture the probability density function of the
multivariate normal distribution in its full generality. 𝑥 and µ are 𝑑-vectors

• Normal PDF: 𝑓(𝑥) = 𝑛(𝑞(𝑥))


1
• 𝑛(𝑞) = 𝑒𝑥𝑝 −𝑞/2 , 𝑞(𝑥) = (𝑥 − 𝜇)𝑇 Σ −1 (𝑥 − 𝜇)
(2𝜋)𝑑 |Σ|

↑ ↑
𝑑
• ℝ → ℝ, exponential ℝ → ℝ, JAVED
KASHIF quadratic
EED, UET, Lahore

3
Anisotropic Normal Distribution
• The covariance matrix Σ and its symmetric square root and its inverse all
play roles in our intuition about the multivariate normal distribution.

• Consider their eigendecompositions.

• Σ = 𝑉 Γ 𝑉𝑇
↑ eigenvalues of Σ are variances along the eigenvectors, Γ𝑖𝑖 = 𝜎𝑖2

KASHIF JAVED
EED, UET, Lahore

4
Anisotropic Normal Distribution
• Σ1/2 = 𝑉Γ1/2 𝑉 𝑇 maps spheres to ellipsoids
↑ eigenvalues of Σ1/2 are Gaussian widths/ellipsoid radii/standard
deviations , Γ𝑖𝑖 = 𝜎𝑖

KASHIF JAVED
EED, UET, Lahore

5
Anisotropic Normal Distribution
• Σ1/2 = 𝑉Γ1/2 𝑉 𝑇 maps spheres to ellipsoids
↑ eigenvalues of Σ1/2 are Gaussian widths/ellipsoid radii/standard
deviations , Γ𝑖𝑖 = 𝜎𝑖

𝜎1
𝜎2

KASHIF JAVED
EED, UET, Lahore

6
Anisotropic Normal Distribution

• Σ −1 = 𝑉Γ −1 𝑉 𝑇 precision matrix (metric tensor) ↑ quadratic form of Σ −1


defines contours KASHIF JAVED
EED, UET, Lahore

7
the isocontours
KASHIF of theJAVED
multivariate
normal distribution are the same as
EED, UET, Lahore
the isocontours of the quadratic form
of the precision matrix Σ −1
8
Maximum Likelihood Estimation for
Anisotropic Gaussians
• Given sample points 𝑋1, 𝑋2 . . . , 𝑋𝑛 and 𝑦1 , … , 𝑦𝑛 find best-fit Gaussians.

• Once again, we want to fit the Gaussian that maximizes the likelihood of
generating the sample points in a specified class.

• This time we won’t derive the maximum-likelihood Gaussian

KASHIF JAVED
EED, UET, Lahore

9
Maximum Likelihood Estimation for
Anisotropic Gaussians
• For QDA:

1
Σ𝐶^ = σ𝑖:𝑦𝑖 =𝐶 (𝑋𝑖 − 𝜇𝐶^ ) (𝑋𝑖 − 𝜇𝐶^ )𝑇 ⇐ conditional covariance
𝑛𝐶

outer product matrix, 𝑑 × 𝑑

where 𝑛𝐶 is the number of points in class 𝐶.


Prior, 𝜋𝐶^ , mean 𝜇 ^ 𝐶 : same as before.
KASHIF𝑋𝑖JAVED
is a column vector.
EED, UET, Lahore

10
Maximum Likelihood Estimation for
Anisotropic Gaussians

Maximum likelihood estimation


takes these points and outputs
this Gaussian

KASHIF JAVED
EED, UET, Lahore

11
Maximum Likelihood Estimation for
Anisotropic Gaussians
• Σ𝐶^ is positive semidefinite, but not always definite (possible to have a
singular matrix)!

• If there are some zero eigenvalues, the standard version of QDA just
doesn’t work

• We can try to fix it by eliminating the zero-variance dimensions


(eigenvectors)
KASHIF JAVED
EED, UET, Lahore

12
Maximum Likelihood Estimation for
Anisotropic Gaussians
2 4
• 𝑋= 3 5
1 3

• Σ𝐶^ =?

KASHIF JAVED
EED, UET, Lahore

13
Maximum Likelihood Estimation for
Anisotropic Gaussians
• Following data belongs to class 𝐶
2 4
• 𝑋𝐶 = 3 5
1 3

• 𝜇𝐶 = 2 4

0 0
• 𝑋mean−centered = 1 1
−1 −1KASHIF JAVED
EED, UET, Lahore

14
Maximum Likelihood Estimation for
Anisotropic Gaussians
𝑇 0 1 −1
• 𝑋mean−centered =
0 1 −1

1
• Σ𝐶^ = ∗ 𝑥 :,1 ∗ 𝑥 :,1 ′ + 𝑥 :,2 ∗ 𝑥 :,2 ′ + 𝑥 :,3 ∗ 𝑥 :,3 ′
3

0.6667 0.6667
• Σ𝐶^ =
0.6667 0.6667
KASHIF JAVED
EED, UET, Lahore

15
Maximum Likelihood Estimation for
Anisotropic Gaussians
1 4
• 𝑋= 1 5
1 3

• Σ𝐶^ =?

KASHIF JAVED
EED, UET, Lahore

16
Maximum Likelihood Estimation for
Anisotropic Gaussians
• For LDA:

1
^
•Σ = σ𝐶 σ𝑖:𝑦𝑖 =𝐶(𝑋𝑖 − 𝜇𝐶^ ) (𝑋𝑖 − 𝜇𝐶^ )𝑇 ⇐ pooled within-class covariance matrix
𝑛

KASHIF JAVED
EED, UET, Lahore

17
Maximum Likelihood Estimation for
Anisotropic Gaussians
• For LDA:

1
^
•Σ = σ𝐶 σ𝑖:𝑦𝑖 =𝐶(𝑋𝑖 − 𝜇𝐶^ ) (𝑋𝑖 − 𝜇𝐶^ )𝑇 ⇐ pooled within-class covariance matrix
𝑛

KASHIF JAVED
EED, UET, Lahore

18
QDA and LDA
• Let’s revisit QDA and LDA and see what has changed now that we know
anisotropic Gaussians.

• The short answer is “not much has changed, but the graphs look cooler.”

• By the way, capital 𝑋 once again represents a random variable.

KASHIF JAVED
EED, UET, Lahore

19
Quadratic Discriminant Analysis (QDA)
• Choosing C that maximizes 𝑓(𝑋 = 𝑥|𝑌 = 𝐶) 𝜋𝐶 is equivalent to
maximizing the quadratic discriminant fn

𝑑 1 𝑇 −1
1
𝑄𝐶 𝑥 = 𝑙𝑛 2𝜋 𝑓𝐶 𝑥 𝜋𝐶 = − 𝑥 − 𝜇𝐶 Σ𝐶 𝑥 − 𝜇𝐶 − Σ𝐶 + 𝑙𝑛 𝜋𝐶
2 2

• This works for any number of classes. In a multi-class problem, you just
pick the class with the greatest quadratic discriminant for 𝑥
KASHIF JAVED
EED, UET, Lahore

20
Quadratic Discriminant Analysis (QDA)
• 2 classes: Decision fn 𝑄𝐶 𝑥 − 𝑄𝐷 𝑥 is quadratic, but may be indefinite

⇒ Bayes decision boundary is a quadric

• Posterior is 𝑃(𝑌 = 𝐶|𝑋 = 𝑥) = 𝑠( 𝑄𝐶 𝑥 − 𝑄𝐷 𝑥 ) where 𝑠(·) is logistic


fn

KASHIF JAVED
EED, UET, Lahore

21
Quadratic Discriminant Analysis (QDA)

KASHIF JAVED
The decision boundary is a hyperbola. At EED,
left, twoUET, Lahore
anisotropic Gaussians. Center left, the difference
QC - QD. After applying the logistic function to this difference we obtain the posterior probabilities at
right, which tells us the probability our prediction is correct. 22
Quadratic Discriminant Analysis (QDA)

0.5
0

KASHIF JAVED
The decision boundary is a hyperbola. At EED,
left, twoUET, Lahore
anisotropic Gaussians. Center left, the difference
QC - QD. After applying the logistic function to this difference we obtain the posterior probabilities at
right, which tells us the probability our prediction is correct. 23
Quadratic Discriminant Analysis (QDA)

KASHIF JAVED
EED,inUET,
Observe that we can see the decision boundary Lahore
both contour plots: it is Q C −QD = 0 and s(QC −QD)
= 0.5. We don’t need to apply the logistic function to find the decision boundary, but we do need to
apply it if we want the posterior probabilities. 24
Quadratic Discriminant Analysis (QDA)

KASHIF JAVED
EED, UET, Lahore

25
Linear Discriminant Analysis
• One Σ ^ for all classes.

• Once again, the quadratic terms cancel each other out so the decision
function is linear, and the decision boundary is a hyperplane

𝑇 −1 𝑇 −1
𝑇 −1 𝜇𝐶 Σ 𝜇𝐶 − 𝜇𝐷 Σ 𝜇𝐷
𝑄𝐶 𝑥 − 𝑄𝐷 𝑥 = 𝜇𝐶 − 𝜇𝐷 Σ 𝑥 − 2
+ ln 𝜋𝐶 − ln 𝜋𝐷

𝑤𝑇𝑥 +𝛼
KASHIF JAVED
EED, UET, Lahore

26
Linear Discriminant Analysis
• Choose 𝐶 that maximizes linear discriminant fn

𝑇 −1
𝑇 −1
𝜇 𝐶 Σ 𝜇𝐶
𝜇𝐶 Σ 𝑥 − + ln 𝜋𝐶
2

• this works for any number of classes

KASHIF JAVED
EED, UET, Lahore

27
Linear Discriminant Analysis
• Choose 𝐶 that maximizes linear discriminant fn

𝑇 −1
𝑇 −1
𝜇 𝐶 Σ 𝜇𝐶
𝜇𝐶 Σ 𝑥 − + ln 𝜋𝐶
2

• 2-class case: decision boundary is 𝑤 𝑇 𝑥 + 𝛼 = 0


posterior is 𝑃(𝑌 = 𝐶|𝑋 = 𝑥) = 𝑠( 𝑤 𝑇 𝑥 + 𝛼)
𝑇 −1
• we use a linear solver to efficiently
KASHIF JAVED Σ just once, so the
compute 𝜇 𝐶
classifier can evaluate test points
EED,quickly
UET, Lahore

28
Linear Discriminant Analysis

KASHIF JAVED
EED, UET, Lahore
In LDA, the decision boundary is always a hyperplane. Note that Mathematica messed up
the top left plot a bit; there should be no red in the left corner, nor blue in the right corner. 29
Linear Discriminant Analysis

KASHIF JAVED
EED, UET, Lahore
In LDA, the decision boundary is always a hyperplane. Note that Mathematica messed up
the top left plot a bit; there should be no red in the left corner, nor blue in the right corner. 30
ESL, Figure 4.11

An example of LDA with messy data.


The points are not sampled from
perfect Gaussians, but LDA still
works reasonably well. KASHIF JAVED
EED, UET, Lahore

31
Linear Discriminant Analysis
• LDA often interpreted as projecting points onto normal 𝑤; cutting the line in
half
𝑤

Decision boundary

KASHIF JAVED
EED, UET, Lahore

32
QDA vs. LDA
• For 2 classes,

▪ LDA has 𝑑 + 1 parameters (𝑤, 𝛼);

d(d+3) + 1
▪ QDA has params
2

▪ QDA more likely to overfit — the danger is much bigger when the dimension d
is large.

KASHIF JAVED
EED, UET, Lahore

33
ISL, Figure 4.9

KASHIF JAVED
EED, UET, Lahore
Synthetically generated data - In these examples, the Bayes optimal decision
boundary is purple (and dashed), the QDA decision boundary is green, the LDA
decision boundary is black (and dotted). 34
ISL, Figure 4.9

KASHIF JAVED
EED, UET, Lahore
When the optimal boundary is linear, as at left, LDA gives a more stable fit whereas
QDA may overfit. When the optimal boundary is curved, as at right, QDA often
gives you a better fit. 35
Remarks
• With added features, LDA can give nonlinear boundaries; QDA
nonquadratic.

• We don’t get true optimum Bayes classifier


▪ estimate distributions from finite data
▪ real-world data not perfectly Gaussian

KASHIF JAVED
EED, UET, Lahore

36
Remarks
• LDA & QDA are the best methods in practice for many applications.

• In the STATLOG project, either LDA or QDA were in the top three
classifiers for 10 out of 22 datasets — But it’s not because all those
datasets are Gaussian.

• LDA & QDA work well when the data can only support simple decision
boundaries such as linear or quadratic, because Gaussian models provide
stable estimates.
KASHIF JAVED
EED, UET, Lahore

37
Some Terms
• Let 𝑋 be 𝑛 × 𝑑 design matrix of sample pts

• Each row 𝑖 of 𝑋 is a sample pt 𝑋𝑖𝑇

• 𝑋𝑖 is a column vector to match the standard convention for multivariate


distributions like the Gaussian, but 𝑋𝑖𝑇 is a row of 𝑋.

KASHIF JAVED
EED, UET, Lahore

38
Some Terms
4 1
2 3
• 𝑋=
5 4
1 0

KASHIF JAVED
EED, UET, Lahore

39
Some Terms
• centering 𝑋: subtracting 𝜇 𝑇 from each row of 𝑋. 𝑋 → 𝑋ሶ

• 𝜇 𝑇 is the mean of all the rows of 𝑋.

• Now the mean of all the rows of 𝑋ሶ is zero

KASHIF JAVED
EED, UET, Lahore

40
Some Terms
4 1
2 3
• 𝑋=
5 4
1 0

• 𝜇 = [3 2]

1 −1
ሶ −1 1
• 𝑋=
2 2 KASHIF JAVED
−2 −2 EED, UET, Lahore

41
Some Terms
• Sample covariance matrix for 𝑅 is

1
𝑉𝑎𝑟 𝑅 = 𝑋ሶ 𝑇 𝑋ሶ
𝑛

• This is the simplest way to remember how to compute a covariance matrix


for QDA
• Imagine you have a design matrix 𝑋𝐶 that contains only the sample points
of class 𝐶; then you have
1 KASHIF JAVED
Σ𝐶^ = 𝑋ሶ 𝑇 𝑋ሶ
𝑛 EED, UET, Lahore

42
Some Terms
2.5 1.5
• 𝑉𝑎𝑟(𝑅) =
1.5 2.5

KASHIF JAVED
EED, UET, Lahore

43
Some Terms
• When we have points from an anisotropic Gaussian distribution,
sometimes it’s useful to perform a linear transformation that maps them to
an axis-aligned distribution, or maybe even to an isotropic distribution.

• decorrelating 𝑋ሶ : applying rotation 𝑍 = 𝑋V,


ሶ where 𝑉𝑎𝑟 𝑅 = 𝑉Λ𝑉 𝑇
• rotates the sample points to the eigenvector coordinate system

KASHIF JAVED
EED, UET, Lahore

44
Some Terms
−1/ 2 1/ 2 1 0 −1/ 2 1/ 2
• 𝑉𝑎𝑟 𝑅 =
1/ 2 1/ 2 0 4 1/ 2 1/ 2

− 2 0
ሶ 2 0
• 𝑍 = 𝑋V =
0 2 2
0 −2 2 KASHIF JAVED
EED, UET, Lahore

45
Some Terms
• Then 𝑉𝑎𝑟 𝑍 = Λ

• 𝑍 has a diagonal covariance. If 𝑋𝑖 ~ 𝒩 𝜇, 𝛴 , then approximately,


𝑍𝑖 ~ 𝒩 0, Λ

• Proof:
1 𝑇 1 𝑇 𝑇
𝑉𝑎𝑟 𝑍 = 𝑍 𝑍 = 𝑉 𝑋ሶ 𝑋𝑉
ሶ = 𝑉 𝑇 𝑉𝑎𝑟 𝑅 𝑉 = 𝑉 𝑇 𝑉Λ𝑉 𝑇 𝑉 = Λ
𝑛 𝑛

KASHIF JAVED
EED, UET, Lahore

46
Some Terms
1 0
• 𝑉𝑎𝑟 𝑍 =
0 4

KASHIF JAVED
EED, UET, Lahore

47
Some Terms

KASHIF JAVED
EED, UET, Lahore

48
Some Terms
• sphering 𝑋ሶ : applying transform 𝑊 = 𝑋ሶ 𝑉𝑎𝑟(𝑅)−1/2

• Recall that Σ −1/2 maps ellipsoids to spheres

1 −1 1 −1
−1 1 −1 1
• 𝑋ሶ = 𝑊=
2 2 1 1
−2 −2 −1 −1

KASHIF JAVED
EED, UET, Lahore

49
Some Terms
• whitening 𝑋: centering + sphering, 𝑋 → 𝑊

• Then 𝑊 has covariance matrix 𝐼.


• If 𝑋𝑖 ~ 𝒩 𝜇, 𝛴 , then approximately, 𝑊𝑖 ~ 𝒩 0, 𝐼

1 0
• 𝑐𝑜𝑣 𝑊 =
0 1
KASHIF JAVED
EED, UET, Lahore

50
Some Terms
• Whitening input data is often used with other machine learning algorithms,
like SVMs and neural networks.

• The idea is that some features may be much bigger than others—for
instance, because they’re measured in different units

• SVMs penalize violations by large features more heavily than they


penalize small features.
KASHIF JAVED
EED, UET, Lahore

51
Some Terms
• Whitening the data before you run an SVM puts the features on an equal
basis

• One nice thing about discriminant analysis is that whitening is built in.

KASHIF JAVED
EED, UET, Lahore

52

You might also like