Machine Learning CS 4641
Gaussian Mixture Model
Nakul Gopalan
Georgia Tech
Some of the slides are based on slides from Jiawei Han Chao Zhang, Mahdi Roozbahani and Barnabás Póczos.
Outline
• Overview
• Gaussian Mixture Model
• The Expectation-Maximization Algorithm
Recap
Conditional probabilities:
𝑝 𝐴, 𝐵 = 𝑝 𝐴 𝐵 𝑝 𝐵 = 𝑝 𝐵 𝐴 𝑝(𝐴)
Bayes rule:
𝑝(𝐴, 𝐵) 𝑝 𝐵 𝐴 𝑝(𝐴)
𝑝 𝐴|𝐵 = =
𝑝(𝐵) 𝑝(𝐵)
𝑝 𝐴 = 1 = σ𝐾
𝑖=1 𝑝(𝐴 = 1, 𝐵𝑖 )= σ𝐾
𝑖=1 𝑝 𝐴 𝐵𝑖 𝑝(𝐵𝑖 )
Tomorrow=Rainy Tomorrow=Cold P(Today)
Today=Rainy 4/9 2/9 [4/9 + 2/9] = 2/3
Today=Cold 2/9 1/9 [2/9 + 1/9] = 1/3
P(Tomorrow) [4/9 + 2/9] = 2/3 [2/9 + 1/9] = 1/3
P(Tomorrow = Rainy) =
Hard Clustering Can Be Difficult
• Hard Clustering: K-Means, Hierarchical Clustering, DBSCAN
Towards Soft Clustering
Outline
• Overview
• Gaussian Mixture Model
• The Expectation-Maximization Algorithm
Gaussian Distribution
1-d Gaussian
1 𝑥−𝜇 2
−
𝑁 𝜇, 𝜎 = 𝑒 2𝜎2
2𝜋𝜎 2
Mixture Models
• Formally a Mixture Model is the weighted sum of a number of
pdfs where the weights are determined by a distribution,
What is f in GMM?
𝜋0 𝜋1 𝜋2
𝑥
𝑥
𝑓0 (𝑥)
𝑥
𝑓1 (𝑥)
𝑥
𝑓2 (𝑥)
𝜋0 𝜋1 𝜋2 𝑥
Why 𝑝(𝑥) is a pdf?
Why GMM?
It creates a new pdf for us to generate random variables. It is
a generative model.
It clusters different components using a Gaussian distribution.
So it provides us the inferring opportunity. Soft assignment!!
Some notes:
Is summation of a bunch of Gaussians a Gaussian
itself?
p(x) is a Probability density function or it is also called a marginal
distribution function.
p(x) = the density of selecting a data point from the pdf which is
created from a mixture model. Also, we know that the area under
a density function is equal to 1.
Mixture Models are Generative
• Generative simply means dealing with joint probability 𝑝 𝑥, 𝑧
p x = 𝜋0 𝑓0 (𝑥) + 𝜋1 𝑓1 (𝑥) + ⋯ + 𝜋𝑘 𝑓𝑘 (𝑥)
Let’s say 𝑓(. ) is a Gaussian distribution
p x = 𝜋0 𝑁 𝑋 𝜇0 , 𝜎0 + 𝜋1 𝑁 𝑋 𝜇1 , 𝜎1 + ⋯ + 𝜋𝑘 𝑁(𝑋|𝜇𝑘 , 𝜎𝑘 )
𝑝 𝑥 = 𝑁(𝑥|𝜇𝑘 , 𝜎𝑘 )𝜋𝑘
𝑘
𝑝 𝑥 = 𝑝 𝑥 𝑧𝑘 𝑝(𝑧𝑘 ) 𝑧𝑘 is component 𝑘
𝑘
𝑝(𝑥) = 𝑝(𝑥, 𝑧𝑘 )
𝑘
GMM with graphical model concept
𝐾
𝜋𝑘 𝑧𝑛𝑘 𝑍𝑘 is the latent variable
𝑍𝑘 𝑝 𝑧𝑛𝑘 𝜋𝑘 = ෑ 𝜋𝑘 1-of-K representation
𝑘=1
𝜃
𝐾 𝑧𝑛𝑘
𝜇𝑘 𝑋𝑛 𝑝 𝑥 𝑧𝑛𝑘 , 𝜋, 𝜇, Σ = ෑ 𝑁 𝑥 𝜇𝑘 , Σ𝑘
𝑘=1
N
Given 𝑧, 𝜋, 𝜇, and Σ, what is the
probability of x in component k
Σ𝑘
𝜋0 𝜋1 𝜋2 𝑥
What is soft assignment?
𝜋0 𝜋1 𝜋2
𝑥
𝑥
What is the probability of a datapoint 𝑥 in each component?
How many components we have here? 3
How many probability distributions? 3
What is the sum value of the
3 probabilities for each 1
datapoint?
How to calculate the probability of datapoints in the
first component (inferring)?
p x = 𝜋0 𝑁 𝑋 𝜇0 , 𝜎0 + 𝜋1 𝑁 𝑋 𝜇1 , 𝜎1 + 𝜋2 𝑁(𝑋|𝜇2 , 𝜎2 )
Let’s calculate the responsibility of the first component among the rest for one point x
Let’s call that 𝜏0
𝑁 𝑋 𝜇0 , 𝜎0 𝜋0
𝜏0 =
𝑁 𝑋 𝜇0 , 𝜎0 𝜋0 + 𝑁 𝑋 𝜇1 , 𝜎1 𝜋1 + 𝑁 𝑋 𝜇2 , 𝜎2 𝜋2
𝑝 𝑥 𝑧0 𝑝(𝑧0 )
𝜏0 =
𝑝 𝑥 𝑧0 𝑝(𝑧0 ) + 𝑝 𝑥 𝑧1 𝑝(𝑧1 ) + 𝑝 𝑥 𝑧1 𝑝(𝑧1 )
𝑝(𝑥, 𝑧0 ) 𝑝(𝑥, 𝑧0 )
𝜏0 = 𝑘=2 = = 𝑝(𝑧0 |𝑥)
σ𝑘=0 𝑝(𝑥, 𝑧𝑘 ) 𝑝(𝑥)
Given a datapoint x, what is probability of that datapoint in component 0
If I have 100 datapoints and 3 components, what is the size of 𝜏? 100X3
Inferring Cluster Membership
• We have representations of the joint 𝑝(𝑥, 𝑧𝑛𝑘 |𝜃) and the
marginal, 𝑝(𝑥|𝜃)
• The conditional of 𝑝 𝑧𝑛𝑘 𝑥, 𝜃) can be derived using Bayes rule.
The responsibility that a mixture component takes for explaining an
observation x.
Mixtures of Gaussians
What is the probability of picking a mixture component (Gaussian model)= 𝑝 𝑧 = 𝜋𝑖
AND
Picking data from that specific mixture component = p(𝑥|𝑧)
z is latent, we observe x, but z is hidden
𝑝 𝑥, 𝑧 = 𝑝 𝑥 𝑧 𝑝(𝑧) ➔Generative model, Joint distribution
𝑝 𝑥, 𝑧 = 𝑁(𝑥|𝜇𝑘 , 𝜎𝑘 )𝜋𝑘
𝜋0 𝜋1 𝜋2
𝑥
What are GMM parameters?
Mean 𝜇𝑘 Variance 𝜎𝑘 Size 𝜋𝑘
Marginal probability distribution
p x|𝜃 = 𝑝(𝑥, 𝑧𝑘 |𝜃) = 𝑝 𝑥 𝑧𝑘 , 𝜃 𝑝(𝑧𝑘 |𝜃 ) = 𝑁(𝑥|𝜇𝑘 , 𝜎𝑘 )𝜋𝑘
𝑘 𝑘 𝑘
𝑓𝑘 (𝑥) 𝜋𝑘
𝑝 𝑧𝑘 |𝜃 = 𝜋𝑘 Select a mixture component with probability 𝜋
𝑝 𝑥|𝑧𝑘 , 𝜃 = 𝑁(𝑥|𝜇𝑘 , 𝜎𝑘 ) Sample from that component’s Gaussian
𝜋0 𝜋1 𝜋2
𝑥
How about GMM for multimodal distribution?
Gaussian Mixture Model
Why having “Latent variable”
• A variable can be unobserved (latent) because:
it is an imaginary quantity meant to provide some simplified and
abstractive view of the data generation process.
- e.g., speech recognition models, mixture models (soft clustering)…
it is a real-world object and/or phenomena, but difficult or impossible
to measure
- e.g., the temperature of a star, causes of a disease, evolutionary ancestors …
it is a real-world object and/or phenomena, but sometimes wasn’t
measured, because of faulty sensors, etc.
• Discrete latent variables can be used to partition/cluster data
into sub-groups.
• Continuous latent variables (factors) can be used for
dimensionality reduction (factor analysis, etc).
Latent variable representation
p x|𝜃 = 𝑝(𝑥, 𝑧𝑛𝑘 |𝜃) = 𝑝(𝑧𝑛𝑘 |𝜃)𝑝 𝑥 𝑧𝑛𝑘 , 𝜃 = 𝜋𝑘 𝑁(𝑥|𝜇𝑘 , Σ𝑘 )
𝑘 𝑘
𝑘=0
𝐾 𝐾
𝑧𝑛𝑘 𝑧𝑛𝑘
𝑝(𝑧𝑛𝑘 |𝜃) = ෑ 𝜋𝑘 𝑝 𝑥 𝑧𝑛𝑘 , 𝜃 = ෑ 𝑁 𝑥 𝜇𝑘 , Σ𝑘
𝑘=1 𝑘=1
Why having the latent variable?
The distribution that we can model using a mixture of Gaussian components is much
more expressive than what we could have modeled using a single component.
Well, we don’t know 𝜋𝑘 , 𝜇𝑘 , Σk
What should we do?
We use a method called “Maximum Likelihood Estimation” (MLE)
to solve the problem.
𝐾
p x = p x|𝜃 = 𝑝(𝑥, 𝑧𝑘 |𝜃) = 𝑝(𝑧𝑘 |𝜃)𝑝 𝑥 𝑧𝑘 , 𝜃 = 𝜋𝑘 𝑁(𝑥|𝜇𝑘 , Σ𝑘 )
𝑘 𝑘
𝑘=0
Let’s identify a likelihood function, why?
Because we use likelihood function to optimize the probabilistic model
parameters!
𝑁 𝑁 𝐾
arg max 𝑝 𝑥|𝜃 = 𝑝 𝑥 𝜋, 𝜇, Σ = ෑ 𝑝 𝑥𝑛 |𝜃 = ෑ 𝜋𝑘 𝑁(𝑥𝑛 |𝜇𝑘 , Σ𝑘 )
𝑛=1 𝑛=1 𝑘=0
𝑁 𝑁 𝐾
arg max 𝑝 𝑥 = 𝑝 𝑥 𝜋, 𝜇, Σ = ෑ 𝑝 𝑥𝑛 |𝜃 = ෑ 𝜋𝑘 𝑁(𝑥𝑛 |𝜇𝑘 , Σ𝑘 )
𝑛=1 𝑛=1 𝑘=0
ln[𝑝 𝑥 ] = ln[𝑝 𝑥 𝜋, 𝜇, Σ ]
• As usual: Identify a likelihood function
• And set partials to zero…
Maximum Likelihood of a GMM
• Optimization of means.
)
Maximum Likelihood of a GMM
• Optimization of covariance
Maximum Likelihood of a GMM
• Optimization of mixing term
(𝑧𝑛𝑘 )
MLE of a GMM
Not a closed form solution!!
(𝑧𝑛𝑘 ) 𝜏 is not known exactly
What next?
Outline
• Overview
• Gaussian Mixture Model
• The Expectation-Maximization Algorithm
EM for GMMs
• E-step: Evaluate the Responsibilities
EM for GMMs
• M-Step: Re-estimate Parameters
Expectation Maximization
• Expectation Maximization (EM) is a general algorithm to deal with
hidden variables.
• Two steps:
E-Step: Fill-in hidden values using inference
M-Step: Apply standard MLE method to estimate parameters
• EM always converges to a local minimum of the likelihood.
EM for Gaussian Mixture Model:
EM for Gaussian Mixture Model: Example
EM for Gaussian Mixture Model: Example
EM for Gaussian Mixture Model: Example
EM for Gaussian Mixture Model: Example
EM for Gaussian Mixture Model: Example
EM for Gaussian Mixture Model: Example
EM for Gaussian Mixture Model: Example
Demo
• Demo link: https://lukapopijac.github.io/gaussian-mixture-
model/
EM Algorithm for GMM (matrix form)
Book : C.M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006
EM for GMMs
• M-Step: Re-estimate Parameters
EM Algorithm for GMM (matrix form)
𝜸(𝒛𝒏𝒌 ) 𝜸(𝒛𝒏𝒌 )
𝒌 𝒌
𝒌 𝒌
𝜸(𝒛𝒏𝒌 )
𝒌
𝜸(𝒛
(𝒛𝒏𝒌 𝜸(𝒛𝒏𝒌 )
𝒏𝒌 )
𝒌
Book : C.M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006
Relationship to K-means
• K-means makes hard decisions.
Each data point gets assigned to a single cluster.
• GMM/EM makes soft decisions.
Each data point can yield a posterior p(z|x)
• K-means is a special case of EM.
General form of EM
• Given a joint distribution over observed and latent variables:
• Want to maximize:
1. Initialize parameters:
2. E Step: Evaluate:
3. M-Step: Re-estimate parameters (based on expectation of complete-
data log likelihood)
= 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝔼[ln 𝑝(𝑥, 𝑧|𝜃 ]
4. Check for convergence of params or likelihood
Will lead to maximizing
this
Maximizing this
The first term is the expected complete log likelihood and the
second term, which does not depend on 𝜃, is the entropy.
Thus, in the M-step, maximizing with respect to 𝜃
for fixed q we only need to consider the first term:
EM for Gaussian Mixture Model: Example
covariance_type="diag“ or "spherical“ or “full”
Source: Python Data Science Handbook by Jake VanderPlas
𝜇𝑜𝑢𝑡2 (𝑋𝑖 ) Silhouette
Coefficient
𝑚𝑖𝑛
𝜇𝑜𝑢𝑡 𝑋𝑖 = min{𝜇𝑜𝑢𝑡2 𝑋𝑖 , 𝜇𝑜𝑢𝑡1 (𝑋𝑖 )}
𝜇𝑖𝑛 (𝑋𝑖 ) Xi
𝜇𝑜𝑢𝑡1 (𝑋𝑖 )
Silhouette Coefficient
The Silhouette Coefficient for clustering C:
SC close to 1 implies a good clustering (Points are close to their own
clusters but far from other clusters)
54
Take-Home Messages
• The generative process of Gaussian Mixture Model
• Inferring cluster membership based on a learned GMM
• The general idea of Expectation-Maximization
• Expectation-Maximization for GMM
• Silhouette Coefficient