Latent Variable Models
Stefano Ermon
Stanford University
Lecture 5
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 1 / 29
Recap of last lecture
1 Autoregressive models:
Chain rule based factorization is fully general
Compact representation via conditional independence and/or neural
parameterizations
2 Autoregressive models Pros:
Easy to evaluate likelihoods
Easy to train
3 Autoregressive models Cons:
Requires an ordering
Generation is sequential
Cannot learn features in an unsupervised way
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 2 / 29
Plan for today
1 Latent Variable Models
Mixture models
Variational autoencoder
Variational inference and learning
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 3 / 29
Latent Variable Models: Motivation
1 Lots of variability in images x due to gender, eye color, hair color,
pose, etc. However, unless images are annotated, these factors of
variation are not explicitly available (latent).
2 Idea: explicitly model these factors using latent variables z
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 4 / 29
Latent Variable Models: Motivation
1 Only shaded variables x are observed in the data (pixel values)
2 Latent variables z correspond to high level features
If z chosen properly, p(x|z) could be much simpler than p(x)
If we had trained this model, then we could identify features via
p(z | x), e.g., p(EyeColor = Blue|x)
3 Challenge: Very difficult to specify these conditionals by hand
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 5 / 29
Deep Latent Variable Models
Use neural networks to model the conditionals (deep latent variable
models):
1 z ∼ N (0, I )
2 p(x | z) = N (µθ (z), Σθ (z)) where µθ ,Σθ are neural networks
Hope that after training, z will correspond to meaningful latent
factors of variation (features). Unsupervised representation learning.
As before, features can be computed via p(z | x)
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 6 / 29
Mixture of Gaussians: a Shallow Latent Variable Model
Mixture of Gaussians. Bayes net: z → x.
1 z ∼ Categorical(1, · · · , K )
2 p(x | z = k) = N (µk , Σk )
Generative process
1 Pick a mixture component k by sampling z
2 Generate a data point by sampling from that Gaussian
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 7 / 29
Mixture of Gaussians: a Shallow Latent Variable Model
Mixture of Gaussians:
1 z ∼ Categorical(1, · · · , K )
2 p(x | z = k) = N (µk , Σk )
Clustering: The posterior p(z | x) identifies the mixture component
Unsupervised learning: We are hoping to learn from unlabeled data
(ill-posed problem)
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 8 / 29
Unsupervised learning
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 9 / 29
Unsupervised learning
Shown is the posterior probability that a data point was generated by the
i-th mixture component, P(z = i|x)
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 10 / 29
Unsupervised learning
Unsupervised clustering of handwritten digits.
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 11 / 29
Mixture models
Alternative motivation: Combine simple models into a more complex
and expressive one
X X K
X
p(x) = p(x, z) = p(z)p(x | z) = p(z = k) N (x; µk , Σk )
| {z }
z z k=1 component
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 12 / 29
Variational Autoencoder
A mixture of an infinite number of Gaussians:
1 z ∼ N (0, I )
2 p(x | z) = N (µ (z), Σ (z)) where µ ,Σ are neural networks
θ θ θ θ
µθ (z) = σ(Az + c) = (σ(a1 z + c1 ),σ(a2 z + c2 )) = (µ1 (z), µ2
(z))
exp(σ(b1 z+d1 )) 0
Σθ (z) = diag (exp(σ(Bz + d))) = 0 exp(σ(b2 z+d2 ))
θ = (A, B, c, d)
3 Even though p(x | z) is simple, the marginal p(x) is very
complex/flexible
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 13 / 29
Recap
Latent Variable Models
Allow us to define complex models p(x) in terms of simpler building
blocks p(x | z)
Natural for unsupervised learning tasks (clustering, unsupervised
representation learning, etc.)
No free lunch: much more difficult to learn compared to fully observed,
autoregressive models
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 14 / 29
Marginal Likelihood
Suppose some pixel values are missing at train time (e.g., top half)
Let X denote observed random variables, and Z the unobserved ones (also
called hidden or latent)
Suppose we have a model for the joint distribution (e.g., PixelCNN)
p(X, Z; θ)
What is the probability p(X = x̄; θ) of observing a training data point x̄?
X X
p(X = x̄, Z = z; θ) = p(x̄, z; θ)
z z
Need to consider all possible ways to complete the image (fill green part)
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 15 / 29
Variational Autoencoder Marginal Likelihood
A mixture of an infinite number of Gaussians:
1 z ∼ N (0, I )
2 p(x | z) = N (µ (z), Σ (z)) where µ ,Σ are neural networks
θ θ θ θ
3 Z are unobserved at train time (also called hidden or latent)
4 Suppose we have a model for the joint distribution. What is the
probability p(X = x̄; θ) of observing a training data point x̄?
Z Z
p(X = x̄, Z = z; θ)dz = p(x̄, z; θ)dz
z z
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 16 / 29
Partially observed data
Suppose that our joint distribution is
p(X, Z; θ)
We have a dataset D, where for each datapoint the X variables are observed
(e.g., pixel values) and the variables Z are never observed (e.g., cluster or
class id.). D = {x(1) , · · · , x(M) }.
Maximum likelihood learning:
Y X X X
log p(x; θ) = log p(x; θ) = log p(x, z; θ)
x∈D x∈D x∈D z
P
Evaluating log z p(x, z; θ) can be intractable. Suppose we have 30 binary
latent features, z ∈ {0, 1}30 . Evaluating R z p(x, z; θ) involves a sum with
P
230 terms. For continuous variables, log z p(x, z; θ)dz is often intractable.
Gradients ∇θ also hard to compute.
Need approximations. One gradient evaluation per training data point
x ∈ D, so approximation needs to be cheap.
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 17 / 29
First attempt: Naive Monte Carlo
Likelihood function pθ (x) for Partially Observed Data is hard to compute:
X X 1
pθ (x) = pθ (x, z) = |Z| pθ (x, z) = |Z|Ez∼Uniform(Z) [pθ (x, z)]
|Z|
All values of z z∈Z
We can think of it as an (intractable) expectation. Monte Carlo to the rescue:
1 Sample z(1) , · · · , z(k) uniformly at random
2 Approximate expectation with sample average
k
X 1X
pθ (x, z) ≈ |Z| pθ (x, z(j) )
z
k
j=1
Works in theory but not in practice. For most z, pθ (x, z) is very low (most
completions don’t make sense). Some completions have large pθ (x, z) but we will
never ”hit” likely completions by uniform random sampling. Need a clever way to
select z(j) to reduce variance of the estimator.
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 18 / 29
Second attempt: Importance Sampling
Likelihood function pθ (x) for Partially Observed Data is hard to compute:
X q(z)
X pθ (x, z)
pθ (x) = pθ (x, z) = pθ (x, z) = Ez∼q(z)
q(z) q(z)
All possible values of z z∈Z
Monte Carlo to the rescue:
1 Sample z(1) , · · · , z(k) from q(z)
2 Approximate expectation with sample average
k
1 X pθ (x, z(j) )
pθ (x) ≈
k
j=1
q(z(j) )
What is a good choice for q(z)? Intuitively, frequently sample z
(completions) that are likely given x under pθ (x, z).
3 This is an unbiased estimator of pθ (x)
k (j)
1 X pθ (x, z ) = pθ (x)
Ez(j) )∼q(z) (j) )
k q(z
j=1
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 19 / 29
Estimating log-likelihoods
Likelihood function pθ (x) for Partially Observed Data is hard to compute:
X q(z)
X pθ (x, z)
pθ (x) = pθ (x, z) = pθ (x, z) = Ez∼q(z)
q(z) q(z)
All possible values of z z∈Z
Monte Carlo to the rescue:
1 Sample z(1) , · · · , z(k) from q(z)
2 Approximate expectation with sample average (unbiased estimator):
k
1 X pθ (x, z(j) )
pθ (x) ≈
k
j=1
q(z(j) )
Recall that for training, we need the log-likelihood log (pθ (x)). We could estimate
it as:
k (j)
pθ (x, z(1) )
1 X pθ (x, z ) k=1
log (pθ (x)) ≈ log ≈ log
k
j=1
q(z(j) ) q(z(1) )
h i h i
(x,z(1) ) (x,z(1) )
However, it’s clear that Ez(1) ∼q(z) log pθq(z (1) ) ̸= log Ez(1) ∼q(z) pθq(z (1) )
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 20 / 29
Evidence Lower Bound
Log-Likelihood function for Partially Observed Data is hard to compute:
! !
X X q(z) pθ (x, z)
log pθ (x, z) = log pθ (x, z) = log Ez∼q(z)
q(z) q(z)
z∈Z z∈Z
log() is a concave function. log(px + (1 − p)x ′ ) ≥ p log(x) + (1 − p) log(x ′ ).
Idea: use Jensen Inequality (for concave functions)
!
X X
log Ez∼q(z) [f (z)] = log q(z)f (z) ≥ q(z) log f (z)
z z
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 21 / 29
Evidence Lower Bound
Log-Likelihood function for Partially Observed Data is hard to compute:
! !
X X q(z) pθ (x, z)
log pθ (x, z) = log pθ (x, z) = log Ez∼q(z)
q(z) q(z)
z∈Z z∈Z
log() is a concave function. log(px + (1 − p)x ′ ) ≥ p log(x) + (1 − p) log(x ′ ).
Idea: use Jensen Inequality (for concave functions)
X X
log(Ez∼q(z) [f (z)]) = log( q(z)f (z)) ≥ q(z) log f (z) = Ez∼q(z) [log f (z)]
z z
pθ (x,z)
Choosing f (z) = q(z)
pθ (x, z) pθ (x, z)
log Ez∼q(z) ≥ Ez∼q(z) log
q(z) q(z)
Called Evidence Lower Bound (ELBO).
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 22 / 29
Variational inference
Suppose q(z) is any probability distribution over the hidden variables
Evidence lower bound (ELBO) holds for any q
X pθ (x, z)
log p(x; θ) ≥ q(z) log
z
q(z)
X X
= q(z) log pθ (x, z) − q(z) log q(z)
z z
| {z }
Entropy H(q) of q
X
= q(z) log pθ (x, z) + H(q)
z
Equality holds if q = p(z|x; θ)
X
log p(x; θ)= q(z) log p(z, x; θ) + H(q)
z
(Aside: This is what we compute in the E-step of the EM algorithm)
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 23 / 29
Why is the bound tight
We derived this lower bound that holds holds for any choice of q(z):
X p(x, z; θ)
log p(x; θ) ≥ q(z) log
z
q(z)
If q(z) = p(z|x; θ) the bound becomes:
X p(x, z; θ) X p(z|x; θ)p(x; θ)
p(z|x; θ) log = p(z|x; θ) log
z
p(z|x; θ) z
p(z|x; θ)
X
= p(z|x; θ) log p(x; θ)
z
X
= log p(x; θ) p(z|x; θ)
z
| {z }
=1
= log p(x; θ)
Confirms our previous importance sampling intuition: we should
choose likely completions.
What if the posterior p(z|x; θ) is intractable to compute? How loose
is the bound?
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 24 / 29
Variational inference continued
Suppose q(z) is any probability distribution over the hidden variables.
A little bit of algebra reveals
X
DKL (q(z)∥p(z|x; θ)) = − q(z) log p(z, x; θ) + log p(x; θ) − H(q) ≥ 0
z
Rearranging, we re-derived the Evidence lower bound (ELBO)
X
log p(x; θ) ≥ q(z) log p(z, x; θ) + H(q)
z
Equality holds if q = p(z|x; θ) because DKL (q(z)∥p(z|x; θ))=0
X
log p(x; θ)= q(z) log p(z, x; θ) + H(q)
z
In general, log p(x; θ) = ELBO + DKL (q(z)∥p(z|x; θ)). The closer
q(z) is to p(z|x; θ), the closer the ELBO is to the true log-likelihood
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 25 / 29
The Evidence Lower bound
What if the posterior p(z|x; θ) is intractable to compute?
Suppose q(z; ϕ) is a (tractable) probability distribution over the hidden
variables parameterized by ϕ (variational parameters)
For example, a Gaussian with mean and covariance specified by ϕ
q(z; ϕ) = N (ϕ1 , ϕ2 )
Variational inference: pick ϕ so that q(z; ϕ) is as close as possible to
p(z|x; θ). In the figure, the posterior p(z|x; θ) (blue) is better approximated
by N (2, 2) (orange) than N (−4, 0.75) (green)
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 26 / 29
A variational approximation to the posterior
Assume p(xtop , xbottom ; θ) assigns high probability to images that look like
digits. In this example, we assume z = xtop are unobserved (latent)
Suppose q(xtop ; ϕ) is a (tractable) probability distribution over the hidden
variables (missing pixels in this example) xtop parameterized by ϕ
(variational parameters)
Y top top
q(xtop ; ϕ) = (ϕi )xi (1 − ϕi )(1−xi )
unobserved variables xtop
i
Is ϕi = 0.5 ∀i a good approximation to the posterior p(xtop |xbottom ; θ)? No
Is ϕi = 1 ∀i a good approximation to the posterior p(xtop |xbottom ; θ)? No
Is ϕi ≈ 1 for pixels i corresponding to the top part of digit 9 a good
approximation? Yes
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 27 / 29
The Evidence Lower bound
X
log p(x; θ) ≥ q(z; ϕ) log p(z, x; θ) + H(q(z; ϕ)) = L(x; θ, ϕ)
| {z }
z
ELBO
= L(x; θ, ϕ) + DKL (q(z; ϕ)∥p(z|x; θ))
The better q(z; ϕ) can approximate the posterior p(z|x; θ), the smaller
DKL (q(z; ϕ)∥p(z|x; θ)) we can achieve, the closer ELBO will be to
log p(x; θ). Next: jointly optimize over θ and ϕ to maximize the ELBO
over a dataset
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 28 / 29
Summary
Latent Variable Models Pros:
Easy to build flexible models
Suitable for unsupervised learning
Latent Variable Models Cons:
Hard to evaluate likelihoods
Hard to train via maximum-likelihood
Fundamentally, the challenge is that posterior inference p(z | x) is hard.
Typically requires variational approximations
Alternative: give up on KL-divergence and likelihood (GANs)
Stefano Ermon (AI Lab) Deep Generative Models Lecture 5 29 / 29