ExpectationMaximization Algorithm and Applications
Eugene Weinstein Courant Institute of Mathematical Sciences Nov 14th, 2006
List of Concepts
Maximum-Likelihood Estimation (MLE) Expectation-Maximization (EM) Conditional Probability Mixture Modeling Gaussian Mixture Models (GMMs) String edit-distance Forward-backward algorithms
2/31
Overview
Expectation-Maximization Mixture Model Training Learning String Edit-Distance
3/31
One-Slide MLE Review
Say I give you a coin with But I dont tell you the value of Now say I let you flip the coin n times
You get h heads and n-h tails
What is the natural estimate of ?
This is
More formally, the likelihood of is governed by a binomial distribution:
Can prove is the maximum-likelihood estimate of Differentiate with respect to , set equal to 0
4/31
EM Motivation
So, to solve any ML-type problem, we analytically maximize the likelihood function?
Seems to work for 1D Bernoulli (coin toss) Also works for 1D Gaussian (find , 2 )
Not quite
Distribution may not be well-behaved, or have too many parameters Say your likelihood function is a mixture of 1000 1000dimensional Gaussians (1M parameters) Direct maximization is not feasible
Solution: introduce hidden variables to
Simplify the likelihood function (more common) Account for actual missing data
5/31
Hidden and Observed Variables
Observed variables: directly measurable from the data, e.g.
The waveform values of a speech recording Is it raining today? Did the smoke alarm go off?
Hidden variables: influence the data, but not trivial to measure
The phonemes that produce a given speech recording P (rain today | rain yesterday) Is the smoke alarm malfunctioning?
6/31
Expectation-Maximization
Model dependent random variables:
Observed variable x Unobserved (hidden) variable y that generates x
Assume probability distributions:
represents set of all parameters of distribution
Repeat until convergence
E-step: Compute expectation of
( , : old, new distribution parameters) M-step: Find that maximizes Q
7/31
Conditional Expectation Review
Let X, Y be r.v.s drawn from the distributions P(x) and P(y) Conditional distribution given by: Then For function h(Y ): Given a particular value of X (X=x):
8/31
Maximum Likelihood Problem
Want to pick that maximizes the loglikelihood of the observed (x) and unobserved (y) variables given
Observed variable x Previous parameters
Conditional expectation of given x and is
9/31
EM Derivation
Lemma (Special case of Jensens Inequality): Let p(x), q(x) be probability distributions. Then
Proof: rewrite as:
Interpretation: relative entropy non-negative
10/31
EM Derivation
EM Theorem:
If then
Proof:
By some algebra and lemma, So, if this quantity is positive, so is
11/31
EM Summary
Repeat until convergence
E-step: Compute expectation of
( , : old, new distribution parameters) M-step: Find that maximizes Q
EM Theorem:
If then
Interpretation
As long as we can improve the expectation of the log-likelihood, EM improves our model of observed variable x Actually, its not necessary to maximize the expectation, just need to make sure that it increases this is called Generalized EM
12/31
EM Comments
In practice, the x is series of data points
To calculate expectation, can assume i.i.d and sum over all points:
Problems with EM?
Local maxima Need to bootstrap training process (pick a )
When is EM most useful?
When model distributions easy to maximize (e.g., Gaussian mixture models)
EM is a meta-algorithm, needs to be adapted to particular application
13/31
Overview
Expectation-Maximization Mixture Model Training Learning String Distance
14/31
EM Applications: Mixture Models
Gaussian/normal distribution
Parameters: mean and variance 2 In the multi-dimensional case, assume isotropic Gaussian: same variance in all dimensions We can model arbitrary distributions with density mixtures
15/31
Density Mixtures
Combine m elementary densities to model a complex data distribution
kth Gaussian parametrized by
16/31
Density Mixtures
Combine m elementary densities to model a complex data distribution
17/31
Density Mixtures
Combine m elementary densities to model a complex data distribution
Log-likelihood function of the data x given
Log of sum hard to optimize analytically! Instead, introduce hidden variable y
: x generated by Gaussian k
EM formulation: maximize
18/31
Gaussian Mixture Model EM
Goal: maximize n (observed) data points: n (hidden) labels:
: xi generated by Gaussian k
Several pages of math later, we get: E step: compute likelihood of
M step: update k, k, k for each Gaussian k=1..m
19/31
GMM-EM Discussion
Summary: EM naturally applicable to training probabilistic models EM is a generic formulation, need to do some hairy math to get to implementation Problems with GMM-EM?
Local maxima Need to bootstrap training process (pick a )
GMM-EM applicable to enormous number of pattern recognition tasks: speech, vision, etc. Hours of fun with GMM-EM
20/31
Overview
Expectation-Maximization Mixture Model Training Learning String Distance
21/31
String Edit-Distance
Notation: Operate on two strings: Edit-distance: transform one string into another using
Substitution: kitten bitten, cost Insertion: cop crop, cost Deletion: learn earn, cost
Can compute efficiently recursively
22/31
Stochastic String Edit-Distance
Instead of setting costs, model edit operation sequence as a random process Edit operations selected according to a probability distribution For edit operation sequence View string edit-distance as
memoryless (Markov): stochastic: random process according to () is governed by a true probability distribution transducer:
23/31
Edit-Distance Transducer
Arc label a:b/0 means input a, output b and weight 0 Assume
24/31
Two Distances
Define yield of an edit sequence (zn#) as the set of all strings hx,yi such that zn# turns x into y Viterbi edit-distance: negative loglikelihood of most likely edit sequence Stochastic edit-distance: negative loglikelihood of all edit sequences from x to y
25/31
Evaluating Likelihood
Viterbi: Stochastic: Both require calculation of possible edit sequences
over all
possibilities (three edit operations)
However, memoryless assumption allows us to compute likelihood efficiently Use the forward-backward method!
26/31
Forward
Evaluation of forward probabilities : likelihood of picking an edit sequence that generates the prefix pair Memoryless assumption allows efficient recursive computation:
27/31
Backward
Evaluation of backward probabilities : likelihood of picking an edit sequence that generates the suffix pair Memoryless assumption allows efficient recursive computation:
28/31
EM Formulation
Edit operations selected according to a probability distribution So, EM has to update based on occurrence counts of each operation (similar to coin-tossing example) Idea: accumulate expected counts from forward, backward variables (z): expected count of edit operation z
29/31
EM Details
(z): expected count of edit operation z e.g,
30/31
References
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society B, 39(1), 1977 pp. 1-38. C. F. J. Wu, On the Convergence Properties of the EM Algorithm, The Annals of Statistics, 11(1), Mar 1983, pp. 95-103. F. Jelinek, Statistical Methods for Speech Recognition, 1997 M. Collins, The EM Algorithm, 1997 J. A. Bilmes, A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, Technical Report, University of Berkeley, TR-97-021, 1998 E. S. Ristad and P. N. Yianilos, Learning string edit distance, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(2), 1998, pp. 522-532. L.R. Rabiner. A tutorial on HMM and selected applications in speech recognition, In Proc. IEEE, 77(2), 1989, pp. 257-286. A. D'Souza, Using EM To Estimate A Probablity [sic] Density With A Mixture Of Gaussians M. Mohri. Edit-Distance of Weighted Automata, in Proc. Implementation and Application of Automata, (CIAA) 2002, pp. 1-23 J. Glass, Lecture Notes, MIT class 6.345: Automatic Speech Recognition, 2003 Carlo Tomasi, Estimating Gaussian Mixture Densities with EM A Tutorial, 2004 Wikipedia
31/31