Probabilistic Machine Learning
Lecture 5: Expectation maximization
Pekka Marttinen
Aalto University
February, 2025
Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 1 / 16
Lecture 5 overview
Gaussian mixture models (GMMs), recap
EM algorithm
EM for Gaussian mixture models
Suggested reading: Bishop: Pattern Recognition and Machine
Learning
p. 110-113 (2.3.9): Mixtures of Gaussians
simple_example.pdf
p. 430-443: EM for Gaussian mixtures
Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 2 / 16
GMMs, latent variable representation
Introduce latent variables zn =(zn1 , . . . , znK ) which spci…es the
component k of observation xn
1 , 0, . . . , 0)T
zn = (0, . . . , 0, |{z}
k th elem.
De…ne
K K
p (zn ) = ∏ πkz nk
and p (xn jzn ) = ∏ N (xn jµk , Σk )z nk
k =1 k =1
Then the marginal distribution p (xn ) is a GMM:
K
p (xn ) = ∑ πk N (xn jµk , Σk )
k =1
Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 3 / 16
GMM: responsibilities, complete data
Posterior probability (responsibility) p (znk = 1jxn ) that observation
xn was generated by component k
πk N (xn jµk , Σk )
γ(znk ) p (znk = 1jxn ) =
∑ j = 1 π j N ( xn j µ j , Σ j )
K
Complete data: latent variables z and data x together: (x, z)
Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 4 / 16
Idea of the EM algorithm (1/2)
Let X denote the observed data, and θ model parameters. The goal
b
in maximum likelihood is to …nd θ:
θb = arg max flog p (X jθ )g
θ
If model contains latent variables Z , the log-likelihood is given by
( )
log p (X jθ ) = log ∑ p (X , Z j θ ) ,
Z
which may be di¢ cult to maximize analytically
Possible solutions: 1) numerical optimization, 2) the EM algorithm
(expectation-maximization)
Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 5 / 16
Idea of the EM algorithm (2/2)
X : observed data, Z : unobserved latent variables
fX , Z g: complete data, X : incomplete data
In EM algorithm, we assume that the complete data log-likelihood:
log p (X , Z jθ )
is easy to maximize.
Problem: Z is not observed
Solution: maximize
Q (θ, θ0 ) EZ jX ,θ0 [log p (X , Z jθ )]
= ∑ p (Z jX , θ0 ) log p (X , Z jθ )
Z
where p (Z jX , θ0 ) is the posterior distribution of the latent variables
computed using the current parameter estimate θ0
Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 6 / 16
Illustration of the EM algorithm for GMMs
Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 7 / 16
EM algorithm in detail
Goal: maximize log p (X jθ ) w.r.t. θ
1 Initialize θ0
2 E-step Evaluate p (Z jX , θ0 ), and then compute
Q (θ, θ0 ) = EZ jX ,θ0 [log p (X , Z jθ )] = ∑ p (Z jX , θ0 ) log p (X , Z jθ )
Z
3 M-step Evaluate θ new using
θ new = arg max Q (θ, θ0 ).
θ
Set θ0 θ new
4 Repeat E and M steps until convergence
Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 8 / 16
Why EM works
Figure: 11.16 in Murphy (2012)
As a function of θ, Q (θ, θ0 ) is a lower bound of the log-likelihood
log p (x jθ ) (plus a constant, see Bishop, Ch. 9.4).
EM iterates between 1) updating the lower bound (E-step), 2)
maximizing the lower bound (M-step).
Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 9 / 16
EM algorithm, comments
In general, Z does not have to be discrete, just replace the
summation in Q (θ, θ0 ) by integration.
EM-algorithm can be used to compute the MAP (maximum a
posteriori) estimate by maximizing in the M-step Q (θ, θ0 ) + log p (θ ).
In general, EM-algorithm is applicable when the observed data X can
be augmented into complete data fX , Z g such that log p (X , Z jθ ) is
easy to maximize; Z does not have to be latent variables but can
represent, for example, unobserved values of missing or censored
observations.
Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 10 / 16
EM algorithm, simple example
Consider N independent observations x = (x1 , . . . , xN ) from a
two-component mixture of univariate Gaussians
1 1
p (xn jθ ) = N (xn j0, 1) + N (xn jθ, 1). (1)
2 2
One unknown parameter, θ, the mean of the second component.
Goal: estimate
θb = arg max flog p (xjθ )g .
θ
simple_example.pdf
Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 11 / 16
EM algorithm for GMMs
k =1 π k N (x j µ k , Σ k )
p (x) = ∑K
1 Initialize parameter µk , Σk and mizing coe¢ cients πk . Repeat until
convergence:
2 E-step: Evaluate the responsibilities using current parameter values
πk N (xn jµk , Σk )
γ(znk ) =
∑ j = 1 π k N ( xn j µ k , Σ j )
K
3 M-step: Re-estimate the parameters using the current responsibilities
N
1
µnew
k =
Nk ∑ γ(znk )xn
n =1
N
1
Σnew
k =
Nk ∑ γ(znk )(xn µnew
k )(xn µnew
k )
T
n =1
N
πknew = k
N
Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 12 / 16
Derivation of the EM algorithm for GMMs
In the M-step the formulas for µnew
k and Σnew
k are obtained by
di¤erentiating the expected complete data log-likelihood Q (θ, θ0 )
with respect to the particular parameters, and setting the derivatives
to zero.
The formula for πknew can be derived by maximizing Q (θ, θ0 ) under
the constraint ∑K
k = πk = 1. This can be done using the Lagrange
multipliers.
Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 13 / 16
EM for GMM, caveats
EM converges to a local optimum. In fact, the ML estimation for
GMMs is not well-de…ned due to singularities: if σk ! 0 for
components k with a single data point, likelihood goes to in…nity
(…g). Remedy: prior on σk .
Label-switching: non-identi…ability due to the fact that cluster labels
can be switched and likelihood remains the same.
In practice it is recommended to initialize the EM for the GMM by
k-means.
Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 14 / 16
GMM vs. k-means
"Why use GMMs and not just k-means?"
from Wikipedia
1 Clusters can be of di¤erent sizes and shapes
2 Probabilistic assignment of data items to clusters
3 Possibility to include prior knowledge (structure of the model/prior
distributions on the parameters)
Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 15 / 16
Important points
ML-estimation of GMMs can be done using numerical optimization or
the EM algorithm.
The main idea of the EM algorithm is to maximize the expectation of
the complete data log-likelihood, where the expectation is computed
with respect to the current posterior distributions (responsibilites) of
the latent variables.
Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 16 / 16