Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views13 pages

Revision IIT

The document provides an overview of deep generative modeling, focusing on the principles of estimating unknown distributions and sampling from them using neural networks. It discusses variational divergence minimization, particularly f-divergence, and introduces Generative Adversarial Networks (GANs) as a method for optimizing the generation of data. Additionally, it covers various types of GANs, including Classifier-Guided Generative Samplers and Conditional GANs, while addressing challenges in adversarial learning and the manifold hypothesis in data distribution.

Uploaded by

Rishabh Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views13 pages

Revision IIT

The document provides an overview of deep generative modeling, focusing on the principles of estimating unknown distributions and sampling from them using neural networks. It discusses variational divergence minimization, particularly f-divergence, and introduces Generative Adversarial Networks (GANs) as a method for optimizing the generation of data. Additionally, it covers various types of GANs, including Classifier-Guided Generative Samplers and Conditional GANs, while addressing challenges in adversarial learning and the manifold hypothesis in data distribution.

Uploaded by

Rishabh Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Rishabh Indoria

21F3001823 Generative AI July 23, 2025

https://github.com/Chandan-IISc/IITM_GenAI/tree/main

1 Introduction to Deep Generative Modeling


1.1 Broad Recipe
• Data: D = {x1 , x2 , . . . , xn } ∼ iid px , where px is unknown.
xi ∈ Rd , d-dimensionality of the data.
• Since all the data points are drawn from an unknown distribution. We say X is a random variable with a
distribution Px

• xi are instance of a vector-valued Random variable of size d.


• Goal: Estimate Px and learn to sample from it.
• General Principle of Generative Models
1. Assume a parametric family on Px , denoted by Pθ
Pθ : Represented using Deep Neural Networks (Model)
2. Define and estimate a divergence (distance) metric between Pθ and Px .
3. Solve an optimization problem over the parameters of Pθ , to minimize the above divergence metric.
• Example: Suppose there is a RV z ∈ Rk and has some arbitrary but known distribution, z ∼ N (0, I)
Suppose gθ (z) : Z → X, then x̂ = gθ (z) has a different distribution than that of z and the distribution of x̂,
depends on the function gθ (·).
Suppose gθ (z) is a Neural Network. Denote the density of x̂ = gθ (z) using Pθ (x̂).
Suppose D(Px ∥Pθ ) denote a divergence measure between Px and Pθ . D ≥ 0, D = 0 iff Px = Pθ

θ∗ = arg min D(Px ∥Pθ )


θ

Upon solving the above optimization problem, the distribution Px is implicitly estimated by gθ (z) and one
can sample from Px using gθ (z)
A sample from z ∼ N (0, I), passed through gθ (z) would produce a sample from pθ∗ (x̂) which is close to Px ,
hence we end up sampling from Px .
• How to compute the divergence metric without knowing Px and Pθ ?
• What should be the choice of the divergence metric?

• How to choose the gθ (z), in turn Pθ ?


• How to solve the optimization problem of minimizing the divergence metric?

1.2 Variational Divergence Minimization


• f-Divergence: Given two probability distribution functions with the corresponding density functions
denoted by Px and Pθ , the f-divergence between them is defined as follows
Z  
Px (x)
Df (Px ∥Pθ ) = Pθ (x)f dx
Pθ (x)

where f (u) : R+ → R, covex, left-semi continuous, f (1) = 0


x: space on which the Px and Pθ are supported
• Properties of f-divergence

1. Df (·) ≥ 0 for any choice of f (·)


2. Df (Px ∥Pθ ) = 0 iff Px = Pθ
• Examples of f-divergence

1
1. f (u) = u log u: KL-divergence
Z  
Px (x)
Df (Px ∥Pθ ) = Pθ (x)f dx
x Pθ (x)
Z  
Px (x) Px (x)
= Pθ (x) log dx
x Pθ (x) Pθ (x)
Z
Px (x)
= Px (x) log dx = DKL
x Pθ (x)
forward KL = DKL (Px ∥Pθ ) ̸= DKL (Pθ ∥Px ) = reverse KL
1 u+1

2. f (u) = 2 u log u − (u + 1) log 2 : JS-divergence
1
3. f (u) = 2 |u − 1|: Total variation distance

• Objective: Algorithm to minimize Df between Px and Pθ , without knowing both of them, but having
samples from both.
• Key Idea: Integrals involving density functions can be approximated using samples drawn from the
distribution
Suppose we want to compute the following integral
Z
I= h(x)Px (x)dx
x

where h(x) is a function and Px (x) the density function. We have samples drawn iid from Px , x1 , x2 , . . . , xn ∼
iid Px
I = EPx [h(x)]

• Law of Large Numbers: Sample mean converges to actual expectation


n
1X
lim h(xi ) ≈ EPx [h(x)]
n→∞ n
i=1

where xi ∼ iid Px
• If the f-divergence can be expressed in terms of expectations of functions w.r.t Px and Pθ , then one can
compute and optimize them.
• Conjugate function for a Convex function: If f (u) is a convex function, then there exists a conjugate
function f ∗ (t) defined as follows
f ∗ (t) = sup {ut − f (u)}
u∈dom f

• Properties of the Conjugate


1. f ∗ (t) is also conjugate
2. [f ∗ (t)]∗ = f (u), conjugate of conjugate gives back original

f (u) = sup {tu − f ∗ (t)}


t∈dom f ∗

• Expressing Df in terms of Expectations over Px and Pθ


Z  
Px (x)
Df (Px ∥Pθ ) = Pθ (x)f dx
Pθ (x)
Zx
= Pθ (x)f (u)dx
x
Z
= Pθ (x) sup{tu − f ∗ (t)}dx
x t
Z  
Px (x)
= Pθ (x) sup{t − f ∗ (t)}dx
x t Pθ (x)
Z
Px (x)
= sup Pθ (x){T (x) − f ∗ (T (x))}dx
T (x)∈T x Pθ (x)
where T : X → dom f ∗ (space of functions containing solutions for the inner optimization problem)
Z
Px (x)
Df ≥ sup Pθ (x){T (x) − f ∗ (T (x))}dx
T (x)∈T x Pθ (x)

2
The space of functions T, that we are optimizing over may not contain the optimal T ∗ (x), that is the
solution for the inner optimization problem
Z
Px (x)
Df ≥ sup Pθ (x){T (x) − f ∗ (T (x))}dx
T (x)∈T x Pθ (x)
Z Z 
≥ sup Px (x)T (x)dx − Pθ (x)f ∗ (T (x))dx
T (x) x x

≥ sup [EPx [T (x)] − EPθ [f ∗ (T (x))]]


T (x)

• Best we can do is minimize the lower bound of Df rather than minimizing Df itself.

θ∗ = arg min[max (EPx [T (x)] − EPθ [f ∗ (T (x))])]


θ T (x)

• Optimizing over a space of functions cannot be done analytically


• In practice, the space of functions T is represented by neural networks Tw (x), where w are the parameters
of the new network.
• With this, the objective becomes

θ∗ , w∗ = arg min max (EPx [T (x)] − EPθ [f ∗ (T (x))]) = arg min max J(θ, w)
θ w θ w

• These alternative minimization and maximization problems are called Saddle Point Optimization
• Solving these kinds of problems is difficult
• Any Saddle point optimization is also called Adversarial Problem
• The θ network is called the Generator Network and the w network is called the Critic/Discriminator
Network
• We have to make sure that the T network maps X → dom f ∗ . In practice, we take Tw (x) = σf (Vw (x))
where σf is a f-divergence specific activation, and Vw (x) : X → R
• Hence, the final optimization problem becomes

J(θ, w) = EPx [σf (Vw (x))] − EPθ [f ∗ (σf (Vw (x)))]

2 Generative Adversarial Networks


2.1 Introduction and Formulation
• For GANs, the f-divergence is as follows
f (u) = u log u − (u + 1) log(u + 1) (Similar to JSD)

f (t) = − log(1 − exp(t)), dom f = R−

σf (v) = − log(1 + e−v )

• We can now write J using the above


J(θ, w) = EPx [σf (Vw (x))] − EPθ [f ∗ (σf (Vw (x)))]
JGAN (θ, w) = EPx [log(Dw (x))] + EPθ [log(1 − Dw (x̂))]
1
where Dw (x) = ← sigmoid function
(1 + e−Vw (x) )

• Consider the input: D = {x1 , x2 , x3 , . . . , xn } ∼ iid Px


One gradient descent step through the Discriminator, considering the Generator as constant
w∗ = arg max EPx [log(Dw (x))] + EPθ [log(1 − Dw (x̂))]
w
 
B1 B2
1 X 1 X
≈ arg max  log(Dw (xi )) + log(1 − Dw (x̂j ))
w B1 i=1 B2 j=1

wt+1 ← wt + α1 ∇w JGAN (θ, w)

3
One gradient descent step through the Generator, considering the Discriminator as constant

θ∗ = arg min EPx [log(Dw (x))] + EPθ [log(1 − Dw (x̂))]


θ
 
B1 B2
1 X 1 X
≈ arg min  log(Dw (xi )) + log(1 − Dw (x̂j ))
θ B1 i=1 B2 j=1
 
B2
1 X
≈ arg min  log(1 − Dw (gθ (zj ))) (Because first term doesn’t depend on θ)
θ B2 j=1

θt+1 ← θt − α2 ∇θ JGAN (θ, w)

• To Train the Discriminator: Keep θ constant, D = {x1 , x2 , . . . , xn }. Pass B = {x1 , x2 , . . . , xB1 } ⊂ D


through the Discriminator network Dw
Sample z1 , z2 , . . . , zB2 ∼ N (0, I) and pass through the Generator network gθ (·) with fixed θ then pass these
x̂ through the Discriminator network.
Once we have these values, simply calculate JGAN and perform one step of Gradient Ascent.
• To Train the Generator: Keep w constant. Sample z1 , z2 , . . . , zB2 ∼ N (0, I) and pass through the Generator
network gθ (·), then pass these x̂ through the Discriminator network.
Once we have these values, simply calculate JGAN and perform one step of Gradient Descent.
• Typically, training alternates between the generator and the discriminator.
• Typically, there is no well-defined stopping criterion; it’s based on the quality of outputs.

2.2 Classifier-Guided Generative Sampler


• Suppose there is a binary classifier (
1 if x ∼ Px
Dw (x) =
0 if x ∼ Pθ

• Can Dw (x) be used for making Pθ and Px closer?


• Tweak θ (parameters of gθ ) till the classifier ’fails’ to distinguish between the samples of Px and Pθ
• However, failure of the classifier need not imply Px = Pθ
• Hence, the classifier has to be simultaneously tweaked along with the generator
• Suppose Dw : X → [0, 1] represents the likelihood of the sample x coming from Px . The objective is the
following
w∗ = arg max EPx [log(Dw (x))] = Maximize the likelihood of x ∼ Px
w

Classifier should also maximize the likelihood of x̂ not coming from Px when x̂ ∼ Pθ

w∗ = arg max EPθ [log(1 − Dw (x̂))] = Maximize the likelihood of x̂ ∼ Pθ


w

The combined objective for the classifier training is

w∗ = arg max EPx [log(Dw (x))] + EPθ [log(1 − Dw (x̂))] = J(θ, w)


w

This is just the lower bound we constructed previously


• The objective for the generator gθ (z) is such that the classifier has to ’fail’. Invert the optimization for the
classifier
θ∗ = arg min J(θ, w)
θ

• Overall, we have the following adversarial optimization

θ∗ , w∗ = arg min max[J(θ, w)]


θ w

• Note: This interpretation is not generalizable across different f -divergence


• Deep-Convolution GAN (DCGAN)

4
1. Typical in a GAN, z ∈ Rk , x ∈ Rd , k ≪ d, we keep increasing the size of the input in the generator.
2. Upconvolutional or Transpose Convolutional layers are used in the generators.
3. Typically used for the case when the data is Images.
• Conditional GAN (C-GAN)
1. Once the generator is trained, there is no way to control what kind of images the generator will
generate.
2. Data: D = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} ∼ iid Pxy
3. Objective: Sample from the conditional distribution Px|y (instead of Px )
4. Solution: Estimate Px|y and make Pθ approach Px|y
5. Modify the generator and discriminator to operate on the samples of the conditional distribution.
6. Simply pass an additional input y to the generator.
7. Similarly, the discriminator also takes an additional input y.
8. The objective function becomes the following

J(θ, w) = EPx [log(Dw (x, y))] + EPθ [log(1 − Dw (x̂, y))]

2.3 Improvisations on Adversarial Learning


• f -divergence minimization leads to unstable training.
• Manifold Hypothesis: Distributions of real data (such as images) lie over a lower-dimensional manifold
in the ambient space.
• Manifold can be roughly seen as a lower-dimensional subspace.
• Recall that Px and Pθ are distributions over Rd . Since the real and the generated data lie on lower-
dimensional manifolds. Supports of Px and Pθ will not be aligned with a very high probability.
• It can be shown that a perfect discriminator can be learned when the supports of Px and Pθ are not aligned.
• This implies that the GAN training saturates.
• This is also the reason we keep more steps for training generator as compared to discriminator.

• If we can find a Dw with 100% accuracy, then Df (Px ∥Pθ ) will be independent of the generator parameter θ.
• Solution: Use a ’softer’ divergence metric, that does not saturate when the manifolds of the supports of
Px and Pθ do not align.

• Wasserstein Metric (Optimal Transport): Given two distributions Px and Px̂ ,

W (Px ∥Px̂ ) = min Ex,x̂∼λ ∥x − x̂∥2


λ∈Π(x,x̂)

λ: a joint distribution between Px and Px̂


Π(x, x̂): All possible joint distributions such that
Z
Π(x, x̂)dx = Px̂
x
Z
Π(x, x̂)dx̂ = Px

• Support Px and Px̂ are two 1-D discrete pmf. The mass in Px can be redistributed such that Px transforms
into Px̂ .

• A redistribution scheme can be represented as a joint distribution between Px and Px̂


• Every redistribution scheme is a joint distribution and is called a ”transport-plan”.

• We need a way to quantify the effort involved in the seemingly infinite transport plans.

5
• Suppose some mass is moved from x to x̂

∥x − x̂∥2 : Distance of the movement


Π(x, x̂) : Amount of mass that was moved
Π(x, x̂)∥x − x̂∥ : ”work-done” in moving the mass from x to x̂

Average work done in a transport plan


Z
Π(x, x̂)∥x − x̂∥dxdx̂ = EΠ(x,x̂) ∥x − x̂∥
x,x̂

• Given that multiple transport plans (joint distribution) exist between x and x̂, which of them corresponds
to the least amount of work?
min Eλ(x,x̂) ∥x − x̂∥ = W (Px ∥Px̂ )
λ∈Π(x,x̂)

where Π is a family of joint distributions such that


Z Z
Π(x, x̂)dx̂ = Px Π(x, x̂)dx = Px̂
x̂ x

These two integral conditions make sure that Px is transformed to Px̂


• The closer Px and Px̂ are, the lesser W (Px ∥Px̂ ) will be
• Fact: Wasserstein metric does not saturate unlike f -divergences when the supports of Px and Pθ does not
align
• In Generative models, this can be used as θ∗ = arg minθ W (Px ∥Pθ ). How to minimize W ?
• Kontrovic Rubenstein’s Duality: Wasserstein distance between two distribution is given by

W (Px ∥Pθ ) = max [Ex∼Px [Tw (x)] − Ex̂∼Pt heta [Tw (x̂)]]
∥Tw (x)∥L <1

∥Tw (x)∥L < 1: 1-Lipschitz


∥Tw (x1 ) − Tw (x2 )∥
≤1
∥x1 − x2 ∥
θ∗ , w∗ = arg min max [Ex∼Px [Tw (x)] − Ex̂∼Pθ [Tw (x̂)]]
θ ∥Tw ∥<1

• The method of minimizing Wasserstein metric is called the W-GAN


• Making a neural network a Lipschitz function is an area of research.
• Practically, normalize the weights of Tw such that ∥w∥2 = 1 after each gradient step. This will make Tw
close to 1-Lipchitz.
• Conclusion: Training a WGAN is more stable compared to that of a naive GAN.

2.4 Applications of GAN


• A trained GAN is a function from Z → X .
• Suppose one is interested in inversion of the above function, i.e., given xi ∼ Px , the goal is to find the
corresponding zi
• Inversion is useful for Feature Extraction, if the GAN is trained well, then it has implicitly learned the
distribution or meaningful features. Given a dataset, obtain GAN-inverted vectors and use them as features
for the data.
• Inversion is useful for Data Manipulation/Editing, suppose xi ∼ Px needs to be edited. First, get
zi : xi = gθ∗ (zi ) via inversion, then perform edit zedit = fedit (z), and finally get the output image
xedit = gθ∗ (zedit )
• How to modify a GAN such that the inversion is possible?
• Bi-directional GAN (BiGAN): In addition to the generator and discriminator, there is another function,
called the Encoder or the Invertor network.

Eϕ : X → Z

6
• In BiGAN, the discriminator Dw is designed to classify between the data tuples of the form (x, Eϕ (x)) and
(gθ (z), z).

LBiGAN (θ, w, ϕ) = Ex∼Px [Eẑ∼Pϕ (z) [log(Dw (x, Eϕ (x)))]] + Ez∼Pz [Ex̂∼Pθ (x̂) [log(1 − Dw (gθ (z), z))]]

• Once a Bi-GAN is trained, the gθ∗ (z) is used for generation and Eϕ∗ (x) is used for inversion.
• It can be shown that, the optima of LBi-GAN is achieved when,
Z Z Z Z
Pẑx = Pzx̂ where, Pẑx = Px Pϕ (ẑ|x)dẑdx and Pzx̂ = Pz Pθ (x̂|z)dx̂dz
x ẑ z x̂

• GAN Inversion via Latent Regression: A simple network Eϕ (x̂) = ẑ, and use ∥z − ẑ∥22

LLat-reg = L(θ, w, ϕ) = [Ex∼Px log(Dw (x)) + Ex̂∼Pθ log(1 − Dw (x̂))] + λEx̂∼Pθ (∥z − Eϕ (x̂))∥22 )

where λ is a hyperparameter.
• However, it has been found that modifying the discriminator network and solving for the joint distribution
leads to a better inversion quality.

2.5 Adversarial Learning for Domain Shift


• Suppose we have data from a source distribution Ds

Ds = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} ∼ iid Ps

• Task is to come up with a classifier for source distribution.


• During test time, data comes from a distribution that is different from Ds , DT ̸= Ds

• Classifiers/Regressor trained on DS would fail on DT


• A standard dataset for this is the PAC dataset
• Unsupervised Domain Adaptation: Given Ds sampled from Ps and we are also given the target
distribution DT sampled from PT . we have labels for Ds but not for DT .

• Objective: Given Ds and DT , learn features/classifier such that it performs well on both Ps and PT
• Domain Adversarial Networks (DAN): Has a ϕ(x) network that takes both x and x̂ as input and
outputs fs and fT . We have another discriminator Tw (f ) that takes input fs and fT and outputs a value
between [0, 1]. This is just a GAN that would match Pfs and PfT

ϕ∗ , w∗ = arg min max EPfs [log Dw (fs )] + EPfT [log(1 − Dw (fT ))]
ϕ w

• We are trying to learn ϕ network in such a way that the feature vectors we are getting for the source and
target distribution are the same.

• To ensure that the feature vectors are useful for the classification task, we also learn a classifier simultaneously.
• This classifier hψ only takes input from the source distribution and gives a label from the source distribution.

ϕ∗ , ψ ∗ = arg min BCE(y, hψ (fs ))


ϕ,ψ

• ϕ has two objectives: Adversarial objective and Classification objective


• During Inference, the same classifier hψ∗ can be used on both the source and target distributions as long as
the features from the ϕ∗ network are used in the input.

7
3 Generative Modeling via Variational Auto-Encoding
3.1 Latent Variable Models
• Suppose we have data D = {x1 , x2 , . . . , xn } ∼ iid Px
• Suppose Pθ denotes our parameteric model

• In a latent variable model, Pθ is defined as


X Z
Pθ (x) = Pθ (x, z) or Pθ (x) = Pθ (x, z)dz
z z

depending on whether z is discrete or continuous


• z: Latent/Hidden/Unobserved random variable
• Typically, z is jointly estimated along with the model parameters, θ

• For each xi ∈ D, we assume that there exists a zi , corresponding to each xi


• Suppose z is discrete ∈ {1, 2, . . . , N }, and x ∈ Rd .
For each xi ∈ D, zi |xi represents the latent variable corresponding to xi . Since zi is discrete, each xi is
clustered into one of N categories.
• Gaussian Mixture Models and K-means Clustering are both examples of discrete latent variable models.

• Suppose z is continuous, i.e., z ∈ Rk and x ∈ Rd , Auto-Encoder, k ≪ d


Here, zi |xi represent a feature-vector corresponding to the given xi
• Latent variable models can also be used as generative models.

3.2 General Principle for Learning Latent Variable Models


• Suppose, we are given data D = {xi }ni=1 ∼ iid Px
• Let Pθ (x) = z Pθ (x, z)dz, be the latent variable model
R

• Goal: Estimate the model parameters θ, given D

θ∗ = arg min [DKL (Px ∥Pθ )]


θ
Z 
Px (x)
= arg min Px (x) log dx
θ Pθ (x)
Zx Z 
= arg min Px (x) log Px (x)dx − Px (x) log Pθ (x)dx
θ
 xZ x
= arg min − Px (x) log Pθ (x)dx
θ x
θ∗ = arg max [EPx log Pθ (x)]
θ

log Pθ (x) is known as the Log Likelihood function of Pθ (x), hence this procedure is also know as Maximum
Likelihood Estimation

• Jenson’s Inequality
log E(·) ≥ E(log (·))

• Let us denote log Pθ (x) with L(θ)


Suppose, q(z|x) denote a density function over the latent variable z

L(θ) = log Pθ (x)


Z
= log Pθ (x, z)dz
z

8
Z
q(z|x)
L(θ) = log Pθ (x, z) dz
z q(z|x)
Z  
Pθ (x, z)
= log q(z|x) dz
z q(z|x)
 
Pθ (x, z)
= log Eq(z|x)
q(z|x)
 
Pθ (x, z)
≥ Eq(z|x) log
q(z|x)
L(θ) ≥ Jθ (q), called the Evidence Lower Bound (ELBO)

• Jθ (q) is a function of both the model parameters θ and the density on z, q(z|x)
• q(z|x) is known as variational latent posterior
• θ∗ = arg maxθ,q Jθ (q): Approximate ELBO
• The fundamental problem solved in any latent variable generative model is ELBO maximization
Pθ (x, z)
θ∗ , q ∗ = arg max Eq(z|x) log
θ,q q(z|x)

3.3 Gaussian Mixture Models


• z is discrete, z ∈ {1, 2, . . . , M }

X M
X
Pθ (x) = Pθ (x, z) = Pθ (z = j)Pθ (x|z = j)
z j=1

• In a GMM, Pθ (z = j) = αj , and Pθ (x|z = j) = N (x; µj , Σj )


M
X
Pθ (x) = αj · N (x; µj , Σj )
j=1

• Parameters of GMM: θ = {α1 , . . . , αM , µ1 , . . . , µM , Σ1 . . . , ΣM }


• Since x ∈ Rd , then µj ∈ Rd , and Σj ∈ Rd×d
M
X
0 ≤ αj ≤ 1, αj = 1
j=1

• Goal: Estimate θ, via ELBO optimization


• The optimization can be solved iteratively by optimizing θ and q, one after the other.

Algorithm 1 Expectation Maximization (EM) algorithm


1: Let θt and qt represent the estimates at iteration t, starting arbitrarily
2: for t = 1 to T do

3: qt+1 = arg max Jθt (q), with θt a constant

4: θt+1 = arg maxθ (qt+1 ), with qt+1 a constant
5: end for

• It can be shown that EM ensures that


L(θt+1 ) ≥ L(θt )

• For GMM, it can be shown analytically that

∗ Pθ (x|z)Pθ (z) N (x; µtj , Σtj )αjt


qt+1 = arg max Jθ (q) = Pθ (z|x) = = PM
q Pθ (x) t t t
j=1 αj N (x; µj Σj )

• Then, θt+1

can be found by simple differentiation
• ELBO can be optimized for a latent variable model using the EM algorithm provided Pθ (z|x) can be
computed.

9
3.4 Variational Auto-Encoders
• Goal of a Neural Latent Variable Model
1. Learn a Latent Variable Model with unknown Pθ (z|x)
2. Enable sampling or generation from Pθ (x|z)
3. Enable posterior Inference, compute or estimate q(z|x)
• Consider our objective function
Pθ (x, z)
Jθ (q) = Eq(z|x) log
q(z|x)
Pθ (x|z)Pθ (z)
= Eq(z|x) log
q(z|x)
q(z|x)
= Eq(z|x) log Pθ (x|z) − Eq(z|x) log
Pθ (z)
= Eq(z|x) log Pθ (x|z) − DKL (q(z|x)∥Pθ (z))

• Represent Pθ (x|z) and q(z|x) via neural networks


• Pθ (x|z): Conditional data likelihood
• q(z|x): Variational Latent posterior density
• How to represent probability distributions via neural networks?
1. Deterministic Representation: We have seen in GANs, the generator outputs samples from the
distribution that it models. Can also be done via classifiers.
2. Probabilistic Representation: The neural network will take input x and output the parameters of
the distribution that it models, but nit the samples from it.
• In a variational auto-encoder, the distributions q(z|x) and Pθ (x|z) are represented using probabilistic NNs.
• The neural network that represents the latent posterior is called the Encoder network (qϕ ), and the network
that represents the conditional data likelihood is called the Decoder network (Pθ ).
• There is no direct connection between the Encoder and Decoder
• ϕ and θ denote the weights of the Encoder and Decoder networks, respectively.
• To compute the gradients of Jθ (qϕ ) w.r.t ϕ
∇ϕ Eqϕ (z|x) log Pθ (x|z), z ∼ qϕ (z|x)
Consider a more general expectation of some function fϕ (z) taken w.r.t distribution qϕ (z)
Z
∇ϕ Eqϕ (z) fϕ (z) = ∇ϕ qϕ (z)fϕ (z)dz
z
Z
= ∇ϕ (qϕ (z)fϕ (z))dz
Zz Z
= (∇ϕ qϕ (z))fϕ (z)dz + (∇ϕ fϕ (z))qϕ (z)dz
z z
= is not an expectation & can’t be computed + Eqϕ (z) ∇ϕ fϕ (z)
=⇒ ∇ϕ Eqϕ (z|x) log Pθ (x|z) cannot be computed

• Gradients cannot be back-propagated through the entire pipeline because the sampling procedure is
non-differentiable.
• Reparameterization Trick: Express the distribution qϕ (z|x) in terms of another distribution(auxiliary
RV) which is independent of the parameter ψ
• Suppose there exists an auxiliary RV, ϵ ∼ Pϵ , where Pϵ is not dependent on ϕ. z = g(ϵ), then,
Eqϕ (z) fϕ (z) = EPϵ fϕ (g(ϵ)) : Law of the unconscious statistician or LOTUS

∇ϕ Eqϕ (z) fϕ (z) = ∇ϕ EPϵ fϕ (g(ϵ))


M
1 X
≈ ∇ϕ [fϕ (g(ϵ))] (Law of Large Numbers)
M j=1

10
• qϕ (z|x) = N (z; µϕ (x), Σϕ (x))
Let ϵ ∼ N (0, I), then z = µϕ (x) + Σϕ (x) · ϵ = g(ϵ)
• Inverse CDF method: Suppose z is a RV and Fz (z) denotes its CDF

z = Fz−1 (ϵ), ϵ ∼ [0, 1]

g(ϵ) = Fz−1 (Inverse CDF)

• From an implementation perspective qϕ (z|x) is assumed to be Gaussian.


• Gaussian choice is just a design choice; one can use any RV as long as they can find an appropriate g
function and the corresponding auxiliary RV.
• We wanted the gradient w.r.t ϕ

∇ϕ Eqϕ (z|x) log Pθ (x|z) = ∇ϕ EPϵ log Pθ (x|g(ϵ))


 
M
1 X
= ∇ϕ  log Pθ (x|g(ϵj ))
M j=1

• To compute log Pθ (x|g(ϵj )): Parameterize Pθ (x|z) via some known distribution and use the Decoder to
output its parameters

Pθ (x|z) = N (x; x̂θ (z), I)


 
1  2

= log α exp ∥x − x̂θ (z)∥2
(2π) 2
M
1 X
=⇒ ∇ϕ E log Pϕ (x|z) = ∇ϕ EPϵ log Pθ (x|g(ϵ)) = ∇ϕ ∥xi − x̂θ (zj )∥22
M j=1
where zj = µϕ (xi ) + Σϕ (xi ) · ϵ (ϵ1 , . . . , ϵj ∼ N (0, I))

• To compute ∇ϕ DKL (q(z|x)∥Pθ (z))


Pθ (z): Latent prior, typically assumed to be N (0, I)

DKL (q(z|x)∥Pθ (z)) = DKL (N (µϕ (x), Σϕ (x))∥N (0, I))


1
= − log |Σϕ (x)| − K + tr[Σϕ (x)] + ∥µϕ (x)∥22
2
K is the dimension of z. Then it’s straightforward to compute the gradient of the KL term.
• Forward Pass
1. Given an xi ∈ D, pass it through the Encoder to obtain µϕ (xi ) and Σϕ (xi )
2. Sample ϵ1 , . . . , ϵM ∼ N (0, I) outside the NNs
3. Compute z1 , . . . , zj via reparameterization
4. Pass all of z1 , . . . , zj through the Decoder to compute x̂θ (zj )
1
PM 2
5. Compute M j=1 ∥xi − x̂θ (zj )∥2

• Backward Pass
 PM 
1
1. Compute the gradient ∇ϕ M j=1 ∥xi − x̂θ (zj )∥22
2. Through Decoder, with fixed parameters θ
3. Reparam, I/d or θ
4. Through Encoder, add KL gradient
• Training the Decoder

∇θ Jθ (qϕ ) = ∇θ [Eqϕ (z|x) log Pθ (x|z)] | ∇θ [DKL (qϕ (z|x)|P (z))] = 0

• Backward Pass
 PM 
1
1. Compute the gradient ∇θ M j=1 ∥xi − x̂θ (zj )∥22
2. Through Decoder

11
• Inference with a trained VAE
R
1. Data Generation or Sampling: Consider the following sampling procedure, Pθ (x) = z
Pθ (x, z)dz
– Sample a znew ∼ P (z) N (0, 1)
– Sample a xnew ∼ Pθ (x|z), using Decoder of the VAE
The KL term in ELBO, will ensure that qϕ (z|x) = P (z) = N (0, I) ∀x
If the VAE is well trained, qϕ (z|x) = N (0, 1) ∀x
=⇒ sampling from N (0, 1) =⇒ sampling from qϕ (z|x)
x̂θ (znew ) is the output of Decoder for znew ∼ N (0, I).
Either use x̂θ (znew ) as the novel generated data point, x̂θ (znew ) ∈ Rd or xnew ∼ N (x̂θ (znew ), I) as
the generated point.
2. Feature/Embedding Extraction or Latent Posterior Inference: Consider the trained Encoder
of a VAE. The embedding or the feature vector for xtest is given by either ztest = µϕ (xtest ) or
ztest ∼ (µϕ (xtest ), Σϕ (xtest )).
Embeddings can be used to build another VAE in a smaller dimension.
• VAE can be viewed as an auto-encoder with a regularized latent space (Regularized Auto-Encoder)

• Posterior Collapse in a naive VAE: In a VAE, the second term in ELBO, DKL (qϕ (z|x)∥P (z)) is
minimized. ∀x, qϕ (z|x) is forcefully made to be the same as P (z) (N (0, 1)). The decoder will find it difficult
to differentiate between two input samples xi and xj

qϕ (z|xi ) = qϕ (z|xj ) = N (0, I)

zi ∼ qϕ (z|xi ) & zj ∼ qϕ (z|xj )

• β-VAE: Simply multiply β in the KL term

Jθ (q) = ∥x − x̂θ (z)∥22 −β · DKL (q(z|x)∥Pθ (z))


| {z } | {z }
Reconstruction error Regularization error

0 ≤ β ≤ 1, β - hyper parameter
Higher β → Lead to posterior collapse, better generation
Lower β → Reconstructions are better, generation will suffer because qϕ (z|x) ̸= P (z)
• Vector Quantized VAE: Current SOTA that is used in practice. In Vq-VAE, the latent space is discrete
& vector-quantized using a learnable dictionary.

• The Encoder gives the samples of ze (xi ) directly rather than the parameters of the distribution.
• The Decoder takes in the vector-quantized version of ze , i.e., zq (xi ).
• Suppose there are M latent vectors of K-dim each in the dictionary, L = {z1 , z2 , . . . , zM }, zi ∈ Rk

zq (xi ) = zj ∗ , j ∗ = arg min ∥ze (xi ) − zj ∥22 ← Vector Quantization


j=1,...,M

In practice, the latent space is designed to be a tensor of latent dictionary. The encoder outputs ẑe (xi ) ∈ Rk×p ,
contains p vectors from the latent dictionary, or each column is vector quantized.
• Objective of Vq-VAE
Jθ (q) = ∥x − x̂θ (ẑq )∥22 + ∥ze (xi ) − zq (x)∥22
In addition to the Encoder and Decoder parameters, the latent dictionary is also learned.
• Embedding Extraction: Just the vector quantized version of the output of the encoder given input xtest .

• Sampling in Vq-VAE: To be able to sample from the decoder of Vq-VAE, we need to sample from the
distribution of ẑq (·). To learn to sample from the distribution of ẑq (·), a generative model is fit to samples
of zq , e.g., GMM on the samples from the o/p of Encoder corresponding to all I/p data points. Sample
from the learned distribution of zq and use it as the input to the Decoder.
• Due to the cumbersome nature of sampling, Vq-VAE is typically not used for generation or sampling.

12
4 Denoising Diffusion Probabilistic Models (DDPM)
4.1 Introduction
4.2 Diffusion Models
4.3 Conditional Diffusion Models and Score-based Models

5 Auto-Regressive Models and LLMs


5.1 Introduction
5.2 Models, Sampling, Inference, and Quantization Methods
5.3 Reinforcement Learning based Alignment Methods

13

You might also like