KASHIF JAVED
EED, UET, Lahore
1
Lecture 9
Anisotropic Gaussians and GDA
KASHIF JAVED
Readings: EED, UET, Lahore
▪ https://people.eecs.berkeley.edu/~jrs/189/
2
Anisotropic Normal Distribution
• Recall from our last lecture the probability density function of the
multivariate normal distribution in its full generality. 𝑥 and µ are 𝑑-vectors
• Normal PDF: 𝑓(𝑥) = 𝑛(𝑞(𝑥))
1
• 𝑛(𝑞) = 𝑒𝑥𝑝 −𝑞/2 , 𝑞(𝑥) = (𝑥 − 𝜇)𝑇 Σ −1 (𝑥 − 𝜇)
(2𝜋)𝑑 |Σ|
↑ ↑
𝑑
• ℝ → ℝ, exponential ℝ → ℝ, JAVED
KASHIF quadratic
EED, UET, Lahore
3
Anisotropic Normal Distribution
• The covariance matrix Σ and its symmetric square root and its inverse all
play roles in our intuition about the multivariate normal distribution.
• Consider their eigendecompositions.
• Σ = 𝑉 Γ 𝑉𝑇
↑ eigenvalues of Σ are variances along the eigenvectors, Γ𝑖𝑖 = 𝜎𝑖2
KASHIF JAVED
EED, UET, Lahore
4
Anisotropic Normal Distribution
• Σ1/2 = 𝑉Γ1/2 𝑉 𝑇 maps spheres to ellipsoids
↑ eigenvalues of Σ1/2 are Gaussian widths/ellipsoid radii/standard
deviations , Γ𝑖𝑖 = 𝜎𝑖
KASHIF JAVED
EED, UET, Lahore
5
Anisotropic Normal Distribution
• Σ1/2 = 𝑉Γ1/2 𝑉 𝑇 maps spheres to ellipsoids
↑ eigenvalues of Σ1/2 are Gaussian widths/ellipsoid radii/standard
deviations , Γ𝑖𝑖 = 𝜎𝑖
𝜎1
𝜎2
KASHIF JAVED
EED, UET, Lahore
6
Anisotropic Normal Distribution
• Σ −1 = 𝑉Γ −1 𝑉 𝑇 precision matrix (metric tensor) ↑ quadratic form of Σ −1
defines contours KASHIF JAVED
EED, UET, Lahore
7
the isocontours
KASHIF of theJAVED
multivariate
normal distribution are the same as
EED, UET, Lahore
the isocontours of the quadratic form
of the precision matrix Σ −1
8
Maximum Likelihood Estimation for
Anisotropic Gaussians
• Given sample points 𝑋1, 𝑋2 . . . , 𝑋𝑛 and 𝑦1 , … , 𝑦𝑛 find best-fit Gaussians.
• Once again, we want to fit the Gaussian that maximizes the likelihood of
generating the sample points in a specified class.
• This time we won’t derive the maximum-likelihood Gaussian
KASHIF JAVED
EED, UET, Lahore
9
Maximum Likelihood Estimation for
Anisotropic Gaussians
• For QDA:
1
Σ𝐶^ = σ𝑖:𝑦𝑖 =𝐶 (𝑋𝑖 − 𝜇𝐶^ ) (𝑋𝑖 − 𝜇𝐶^ )𝑇 ⇐ conditional covariance
𝑛𝐶
outer product matrix, 𝑑 × 𝑑
where 𝑛𝐶 is the number of points in class 𝐶.
Prior, 𝜋𝐶^ , mean 𝜇 ^ 𝐶 : same as before.
KASHIF𝑋𝑖JAVED
is a column vector.
EED, UET, Lahore
10
Maximum Likelihood Estimation for
Anisotropic Gaussians
Maximum likelihood estimation
takes these points and outputs
this Gaussian
KASHIF JAVED
EED, UET, Lahore
11
Maximum Likelihood Estimation for
Anisotropic Gaussians
• Σ𝐶^ is positive semidefinite, but not always definite (possible to have a
singular matrix)!
• If there are some zero eigenvalues, the standard version of QDA just
doesn’t work
• We can try to fix it by eliminating the zero-variance dimensions
(eigenvectors)
KASHIF JAVED
EED, UET, Lahore
12
Maximum Likelihood Estimation for
Anisotropic Gaussians
2 4
• 𝑋= 3 5
1 3
• Σ𝐶^ =?
KASHIF JAVED
EED, UET, Lahore
13
Maximum Likelihood Estimation for
Anisotropic Gaussians
• Following data belongs to class 𝐶
2 4
• 𝑋𝐶 = 3 5
1 3
• 𝜇𝐶 = 2 4
0 0
• 𝑋mean−centered = 1 1
−1 −1KASHIF JAVED
EED, UET, Lahore
14
Maximum Likelihood Estimation for
Anisotropic Gaussians
𝑇 0 1 −1
• 𝑋mean−centered =
0 1 −1
1
• Σ𝐶^ = ∗ 𝑥 :,1 ∗ 𝑥 :,1 ′ + 𝑥 :,2 ∗ 𝑥 :,2 ′ + 𝑥 :,3 ∗ 𝑥 :,3 ′
3
0.6667 0.6667
• Σ𝐶^ =
0.6667 0.6667
KASHIF JAVED
EED, UET, Lahore
15
Maximum Likelihood Estimation for
Anisotropic Gaussians
1 4
• 𝑋= 1 5
1 3
• Σ𝐶^ =?
KASHIF JAVED
EED, UET, Lahore
16
Maximum Likelihood Estimation for
Anisotropic Gaussians
• For LDA:
1
^
•Σ = σ𝐶 σ𝑖:𝑦𝑖 =𝐶(𝑋𝑖 − 𝜇𝐶^ ) (𝑋𝑖 − 𝜇𝐶^ )𝑇 ⇐ pooled within-class covariance matrix
𝑛
KASHIF JAVED
EED, UET, Lahore
17
Maximum Likelihood Estimation for
Anisotropic Gaussians
• For LDA:
1
^
•Σ = σ𝐶 σ𝑖:𝑦𝑖 =𝐶(𝑋𝑖 − 𝜇𝐶^ ) (𝑋𝑖 − 𝜇𝐶^ )𝑇 ⇐ pooled within-class covariance matrix
𝑛
KASHIF JAVED
EED, UET, Lahore
18
QDA and LDA
• Let’s revisit QDA and LDA and see what has changed now that we know
anisotropic Gaussians.
• The short answer is “not much has changed, but the graphs look cooler.”
• By the way, capital 𝑋 once again represents a random variable.
KASHIF JAVED
EED, UET, Lahore
19
Quadratic Discriminant Analysis (QDA)
• Choosing C that maximizes 𝑓(𝑋 = 𝑥|𝑌 = 𝐶) 𝜋𝐶 is equivalent to
maximizing the quadratic discriminant fn
𝑑 1 𝑇 −1
1
𝑄𝐶 𝑥 = 𝑙𝑛 2𝜋 𝑓𝐶 𝑥 𝜋𝐶 = − 𝑥 − 𝜇𝐶 Σ𝐶 𝑥 − 𝜇𝐶 − Σ𝐶 + 𝑙𝑛 𝜋𝐶
2 2
• This works for any number of classes. In a multi-class problem, you just
pick the class with the greatest quadratic discriminant for 𝑥
KASHIF JAVED
EED, UET, Lahore
20
Quadratic Discriminant Analysis (QDA)
• 2 classes: Decision fn 𝑄𝐶 𝑥 − 𝑄𝐷 𝑥 is quadratic, but may be indefinite
⇒ Bayes decision boundary is a quadric
• Posterior is 𝑃(𝑌 = 𝐶|𝑋 = 𝑥) = 𝑠( 𝑄𝐶 𝑥 − 𝑄𝐷 𝑥 ) where 𝑠(·) is logistic
fn
KASHIF JAVED
EED, UET, Lahore
21
Quadratic Discriminant Analysis (QDA)
KASHIF JAVED
The decision boundary is a hyperbola. At EED,
left, twoUET, Lahore
anisotropic Gaussians. Center left, the difference
QC - QD. After applying the logistic function to this difference we obtain the posterior probabilities at
right, which tells us the probability our prediction is correct. 22
Quadratic Discriminant Analysis (QDA)
0.5
0
KASHIF JAVED
The decision boundary is a hyperbola. At EED,
left, twoUET, Lahore
anisotropic Gaussians. Center left, the difference
QC - QD. After applying the logistic function to this difference we obtain the posterior probabilities at
right, which tells us the probability our prediction is correct. 23
Quadratic Discriminant Analysis (QDA)
KASHIF JAVED
EED,inUET,
Observe that we can see the decision boundary Lahore
both contour plots: it is Q C −QD = 0 and s(QC −QD)
= 0.5. We don’t need to apply the logistic function to find the decision boundary, but we do need to
apply it if we want the posterior probabilities. 24
Quadratic Discriminant Analysis (QDA)
KASHIF JAVED
EED, UET, Lahore
25
Linear Discriminant Analysis
• One Σ ^ for all classes.
• Once again, the quadratic terms cancel each other out so the decision
function is linear, and the decision boundary is a hyperplane
𝑇 −1 𝑇 −1
𝑇 −1 𝜇𝐶 Σ 𝜇𝐶 − 𝜇𝐷 Σ 𝜇𝐷
𝑄𝐶 𝑥 − 𝑄𝐷 𝑥 = 𝜇𝐶 − 𝜇𝐷 Σ 𝑥 − 2
+ ln 𝜋𝐶 − ln 𝜋𝐷
𝑤𝑇𝑥 +𝛼
KASHIF JAVED
EED, UET, Lahore
26
Linear Discriminant Analysis
• Choose 𝐶 that maximizes linear discriminant fn
𝑇 −1
𝑇 −1
𝜇 𝐶 Σ 𝜇𝐶
𝜇𝐶 Σ 𝑥 − + ln 𝜋𝐶
2
• this works for any number of classes
KASHIF JAVED
EED, UET, Lahore
27
Linear Discriminant Analysis
• Choose 𝐶 that maximizes linear discriminant fn
𝑇 −1
𝑇 −1
𝜇 𝐶 Σ 𝜇𝐶
𝜇𝐶 Σ 𝑥 − + ln 𝜋𝐶
2
• 2-class case: decision boundary is 𝑤 𝑇 𝑥 + 𝛼 = 0
posterior is 𝑃(𝑌 = 𝐶|𝑋 = 𝑥) = 𝑠( 𝑤 𝑇 𝑥 + 𝛼)
𝑇 −1
• we use a linear solver to efficiently
KASHIF JAVED Σ just once, so the
compute 𝜇 𝐶
classifier can evaluate test points
EED,quickly
UET, Lahore
28
Linear Discriminant Analysis
KASHIF JAVED
EED, UET, Lahore
In LDA, the decision boundary is always a hyperplane. Note that Mathematica messed up
the top left plot a bit; there should be no red in the left corner, nor blue in the right corner. 29
Linear Discriminant Analysis
KASHIF JAVED
EED, UET, Lahore
In LDA, the decision boundary is always a hyperplane. Note that Mathematica messed up
the top left plot a bit; there should be no red in the left corner, nor blue in the right corner. 30
ESL, Figure 4.11
An example of LDA with messy data.
The points are not sampled from
perfect Gaussians, but LDA still
works reasonably well. KASHIF JAVED
EED, UET, Lahore
31
Linear Discriminant Analysis
• LDA often interpreted as projecting points onto normal 𝑤; cutting the line in
half
𝑤
Decision boundary
KASHIF JAVED
EED, UET, Lahore
32
QDA vs. LDA
• For 2 classes,
▪ LDA has 𝑑 + 1 parameters (𝑤, 𝛼);
d(d+3) + 1
▪ QDA has params
2
▪ QDA more likely to overfit — the danger is much bigger when the dimension d
is large.
KASHIF JAVED
EED, UET, Lahore
33
ISL, Figure 4.9
KASHIF JAVED
EED, UET, Lahore
Synthetically generated data - In these examples, the Bayes optimal decision
boundary is purple (and dashed), the QDA decision boundary is green, the LDA
decision boundary is black (and dotted). 34
ISL, Figure 4.9
KASHIF JAVED
EED, UET, Lahore
When the optimal boundary is linear, as at left, LDA gives a more stable fit whereas
QDA may overfit. When the optimal boundary is curved, as at right, QDA often
gives you a better fit. 35
Remarks
• With added features, LDA can give nonlinear boundaries; QDA
nonquadratic.
• We don’t get true optimum Bayes classifier
▪ estimate distributions from finite data
▪ real-world data not perfectly Gaussian
KASHIF JAVED
EED, UET, Lahore
36
Remarks
• LDA & QDA are the best methods in practice for many applications.
• In the STATLOG project, either LDA or QDA were in the top three
classifiers for 10 out of 22 datasets — But it’s not because all those
datasets are Gaussian.
• LDA & QDA work well when the data can only support simple decision
boundaries such as linear or quadratic, because Gaussian models provide
stable estimates.
KASHIF JAVED
EED, UET, Lahore
37
Some Terms
• Let 𝑋 be 𝑛 × 𝑑 design matrix of sample pts
• Each row 𝑖 of 𝑋 is a sample pt 𝑋𝑖𝑇
• 𝑋𝑖 is a column vector to match the standard convention for multivariate
distributions like the Gaussian, but 𝑋𝑖𝑇 is a row of 𝑋.
KASHIF JAVED
EED, UET, Lahore
38
Some Terms
4 1
2 3
• 𝑋=
5 4
1 0
KASHIF JAVED
EED, UET, Lahore
39
Some Terms
• centering 𝑋: subtracting 𝜇 𝑇 from each row of 𝑋. 𝑋 → 𝑋ሶ
• 𝜇 𝑇 is the mean of all the rows of 𝑋.
• Now the mean of all the rows of 𝑋ሶ is zero
KASHIF JAVED
EED, UET, Lahore
40
Some Terms
4 1
2 3
• 𝑋=
5 4
1 0
• 𝜇 = [3 2]
1 −1
ሶ −1 1
• 𝑋=
2 2 KASHIF JAVED
−2 −2 EED, UET, Lahore
41
Some Terms
• Sample covariance matrix for 𝑅 is
1
𝑉𝑎𝑟 𝑅 = 𝑋ሶ 𝑇 𝑋ሶ
𝑛
• This is the simplest way to remember how to compute a covariance matrix
for QDA
• Imagine you have a design matrix 𝑋𝐶 that contains only the sample points
of class 𝐶; then you have
1 KASHIF JAVED
Σ𝐶^ = 𝑋ሶ 𝑇 𝑋ሶ
𝑛 EED, UET, Lahore
42
Some Terms
2.5 1.5
• 𝑉𝑎𝑟(𝑅) =
1.5 2.5
KASHIF JAVED
EED, UET, Lahore
43
Some Terms
• When we have points from an anisotropic Gaussian distribution,
sometimes it’s useful to perform a linear transformation that maps them to
an axis-aligned distribution, or maybe even to an isotropic distribution.
• decorrelating 𝑋ሶ : applying rotation 𝑍 = 𝑋V,
ሶ where 𝑉𝑎𝑟 𝑅 = 𝑉Λ𝑉 𝑇
• rotates the sample points to the eigenvector coordinate system
KASHIF JAVED
EED, UET, Lahore
44
Some Terms
−1/ 2 1/ 2 1 0 −1/ 2 1/ 2
• 𝑉𝑎𝑟 𝑅 =
1/ 2 1/ 2 0 4 1/ 2 1/ 2
− 2 0
ሶ 2 0
• 𝑍 = 𝑋V =
0 2 2
0 −2 2 KASHIF JAVED
EED, UET, Lahore
45
Some Terms
• Then 𝑉𝑎𝑟 𝑍 = Λ
• 𝑍 has a diagonal covariance. If 𝑋𝑖 ~ 𝒩 𝜇, 𝛴 , then approximately,
𝑍𝑖 ~ 𝒩 0, Λ
• Proof:
1 𝑇 1 𝑇 𝑇
𝑉𝑎𝑟 𝑍 = 𝑍 𝑍 = 𝑉 𝑋ሶ 𝑋𝑉
ሶ = 𝑉 𝑇 𝑉𝑎𝑟 𝑅 𝑉 = 𝑉 𝑇 𝑉Λ𝑉 𝑇 𝑉 = Λ
𝑛 𝑛
KASHIF JAVED
EED, UET, Lahore
46
Some Terms
1 0
• 𝑉𝑎𝑟 𝑍 =
0 4
KASHIF JAVED
EED, UET, Lahore
47
Some Terms
KASHIF JAVED
EED, UET, Lahore
48
Some Terms
• sphering 𝑋ሶ : applying transform 𝑊 = 𝑋ሶ 𝑉𝑎𝑟(𝑅)−1/2
• Recall that Σ −1/2 maps ellipsoids to spheres
1 −1 1 −1
−1 1 −1 1
• 𝑋ሶ = 𝑊=
2 2 1 1
−2 −2 −1 −1
KASHIF JAVED
EED, UET, Lahore
49
Some Terms
• whitening 𝑋: centering + sphering, 𝑋 → 𝑊
• Then 𝑊 has covariance matrix 𝐼.
• If 𝑋𝑖 ~ 𝒩 𝜇, 𝛴 , then approximately, 𝑊𝑖 ~ 𝒩 0, 𝐼
1 0
• 𝑐𝑜𝑣 𝑊 =
0 1
KASHIF JAVED
EED, UET, Lahore
50
Some Terms
• Whitening input data is often used with other machine learning algorithms,
like SVMs and neural networks.
• The idea is that some features may be much bigger than others—for
instance, because they’re measured in different units
• SVMs penalize violations by large features more heavily than they
penalize small features.
KASHIF JAVED
EED, UET, Lahore
51
Some Terms
• Whitening the data before you run an SVM puts the features on an equal
basis
• One nice thing about discriminant analysis is that whitening is built in.
KASHIF JAVED
EED, UET, Lahore
52