Clustering and
Gaussian Mixture Model
Dr. Sayak Roychowdhury
Department of Industrial & Systems Engineering,
IIT Kharagpur
Reference
• Bishop, C. M. (2006). Pattern recognition and machine
learning. Springer google schola, 2, 5-43.
Application of K-means Clustering
• Image segmentation and compression
• The goal of segmentation is to partition an image into regions each of which has a
reasonably homogeneous visual appearance or which corresponds to objects or
parts of objects
• Each pixel in an image is a point in a 3-dimensional space comprising the intensities
of the RGB channels
• Running K-means to convergence, for any particular value of K, by re-drawing the
image replacing each pixel vector with the {R, G,B} intensity triplet given by the
centre 𝜇𝑘 to which that pixel has been assigned.
• Data Compressioning: K-means for lossy data compression
• Each data point is approximated by nearest cluster centre 𝜇𝑘
• This framework is often called vector quantization, and the vectors 𝜇𝑘 are called
code-book vectors
Image Segmentation with K-means
Bishop, C. M. (2006).
Pattern recognition
and machine
learning. Springer
google schola, 2, 5-43.
Gaussian Distribution
• Univariate Gaussian Distribution:
1 1 2
• 𝑓 𝑥|𝜇, 𝜎 = exp − 2 𝑥−𝜇
𝜎 2𝜋 2𝜎
• Multivariate Gaussian Distribution:
1 1
−2 𝑥−𝜇 𝑇 Σ−1 𝑥−𝜇
𝑓 𝑥|𝜇, Σ = 𝑝 1 𝑒
2𝜋 2 Σ 2
Gaussian Mixture
Bishop, C. M. (2006). Pattern recognition and machine
learning. Springer google schola, 2, 5-43.
Gaussian Mixture
3 gaussian distribution
That generated the datapoints
Bishop, C. M. (2006). Pattern recognition and machine
learning. Springer google schola, 2, 5-43.
Gaussian Mixture
3 gaussian distribution Clustering using estimated
that generated the datapoints posterior probability
of clusters using GMM
Bishop, C. M. (2006). Pattern recognition and machine
learning. Springer google schola, 2, 5-43.
Maximum Likelihood for Parameter
Estimation
1 𝑇 −1
𝑝
ln 𝑓𝑘 𝑥|𝜇𝑘 , Σ𝑘 = − l𝑛 Σ𝑘 − 𝑥 − 𝜇𝑘 Σ𝑘 𝑥 − 𝜇𝑘 − l𝑛 𝜋
2 2
Differentiating and equating to 0
𝑥𝑖
𝜇Ƹ 𝑘 = σ𝑔𝑖 =𝑘
𝑁𝑘
𝑥𝑖 −ෝ 𝜇𝑘 𝑇
𝜇𝑘 𝑥𝑖 −ෝ
k =
Σ 𝐾
σ𝑘=1 σ𝑔𝑖 =𝑘
𝑁 𝑘
Where 𝑁𝑘 is the number of datapoints in 𝑘𝑡ℎ cluster
Gaussian Mixture
• Linear superposition of Gaussians:
𝐾
𝑓(𝑥) = 𝑤𝑘 𝒩(𝑥|𝜇𝑘 , Σ𝑘 )
𝑘=1
Normalization and positivity of weights (mixing coefficients):
0 ≤ 𝑤𝑘 ≤ 1, σ𝐾 𝑘=1 𝑤𝑘 = 1
• Log-likelihood:
𝑁 𝑁 𝐾
ln 𝑓(𝑋|𝜇, Σ, 𝑊) = ln 𝑓 𝑥𝑖 = ln 𝑤𝑘 𝒩 𝑥 𝜇𝑘 , Σ𝑘
𝑖=1 𝑖=1 𝑘=1
Responsibilities
• The mixing coefficients can be thought of as prior probabilities
• For a given value of ‘x’, the posterior probabilities can be calculated, which are
also called “responsibilities”
• Using Bayes rule:
𝑓 𝑥𝑘 𝑓 𝑘 𝑤𝑘 𝑓𝑘 𝑥
𝛾𝑘 𝑥 = 𝑓 𝑘 𝑥 = = σ𝑙 𝑤𝑙 𝑓𝑙 𝑥
𝑓(𝑥)
𝑤𝑘 𝒩 𝑥 𝜇𝑘 , Σ𝑘
= 𝐾
σ𝑙=1 𝑤𝑙 𝒩(𝑥|𝜇𝑙 , Σ𝑙 )
𝑁𝑘
where 𝑤𝑘 =
𝑁
𝛾𝑘 𝑥 is also called latent variable here.
Expectation Maximization (EM) Algorithm
• EM algorithm is an iterative optimization technique
• Estimation step: for the given parameter values, compute the
expected values of the latent variable
• Maximization step: update the parameters of the model based on the
calculated value of the latent variable
Expectation Maximization (EM) Algorithm
• Given a Gaussian Mixture Model, the goal is to maximize the
likelihood function by varying the means and covariances and the
mixing coefficients
• Initialize 𝜇𝑗 , Σ𝑗 and mixing coefficients 𝑤𝑗 and evaluate initial log-
likelihood value
• Expectation step: Evaluate responsibilities using current parameter
values:
𝑤𝑘 𝒩 𝑥 𝜇𝑘 , Σ𝑘
𝛾𝑘 𝑥 = σ𝐾
𝑙=1 𝑤𝑙 𝒩(𝑥|𝜇𝑙 ,Σ𝑙 )
Expectation Maximization (EM) Algorithm
• Maximization step: Reestimate the parameters using current
responsibilities:
σ𝑁
𝑛=1 𝛾 𝑧𝑛𝑘 𝑥𝑛
𝜇𝑘𝑛𝑒𝑤 = , where 𝑁𝑘 = σ𝑁
𝑛=1 𝛾 𝑧𝑛𝑘
𝑁𝑘
• The mean 𝜇𝑘 for the kth Gaussian component is obtained by taking a
weighted mean of all of the points in the data set, in which the
weighting factor for data point 𝒙𝒏 is given by the posterior probability
𝛾 𝑧𝑛𝑘 that component k was responsible for generating 𝒙𝒏 .
Expectation Maximization (EM) Algorithm
• Setting derivative of ln 𝑓(𝑋|𝜇, Σ, 𝑊) equal to 0 w.r.t. Σ𝑘
𝑇
𝜇𝑘𝑛𝑒𝑤 𝑥𝑛 −ෝ
𝛾 𝑧𝑛𝑘 𝑥𝑛 −ෝ 𝜇𝑘𝑛𝑒𝑤
• Σ𝑘new = σ𝑁
𝑛=1 𝑁𝑘
• Finally maximize ln 𝑓(𝑋|𝜇, Σ, 𝑊), with respect to 𝑤𝑘 subject to constraint
𝐾
𝑤𝑘 = 1
𝑘=1
This can be achieved using Lagrange multiplier and maximizing
𝐾
ln 𝑓(𝑋|𝜇, Σ, 𝑊) + 𝜆( 𝑤𝑘 − 1)
𝑘=1
𝑁𝑘
Resulting 𝑤𝑘𝑛𝑒𝑤 = , where 𝑁𝑘 = σ𝑁
𝑛=1 𝛾 𝑧𝑛𝑘
𝑁
• Evaluate ln 𝑓(𝑋|𝜇, Σ, 𝑊) = σ𝑁 𝑁 𝐾
𝑖=1 ln 𝑓 𝑥𝑖 = σ𝑖=1 ln σ𝑘=1 𝑤𝑘 𝒩 𝑥 𝜇𝑘 , Σ𝑘
• Iterate through E-step and M-step.
Expectation Maximization (EM)
Bishop, C. M. (2006).
Pattern recognition and
machine
learning. Springer
google schola, 2, 5-43.
EM Algorithm
• Since K-means is faster, it is common to run the K-means algorithm to
find a suitable initialization for a Gaussian mixture model that is
subsequently adapted using EM.