0% found this document useful (0 votes)

11 views13 pages

Revision IIT

The document provides an overview of deep generative modeling, focusing on the principles of estimating unknown distributions and sampling from them using neural networks. It discusses variational divergence minimization, particularly f-divergence, and introduces Generative Adversarial Networks (GANs) as a method for optimizing the generation of data. Additionally, it covers various types of GANs, including Classifier-Guided Generative Samplers and Conditional GANs, while addressing challenges in adversarial learning and the manifold hypothesis in data distribution.

Uploaded by

Rishabh Shah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views13 pages

Revision IIT

Uploaded by

Rishabh Shah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Rishabh Indoria

21F3001823 Generative AI July 23, 2025

https://github.com/Chandan-IISc/IITM_GenAI/tree/main

1 Introduction to Deep Generative Modeling

1.1 Broad Recipe
• Data: D = {x1 , x2 , . . . , xn } ∼ iid px , where px is unknown.
xi ∈ Rd , d-dimensionality of the data.
• Since all the data points are drawn from an unknown distribution. We say X is a random variable with a
distribution Px

• xi are instance of a vector-valued Random variable of size d.

• Goal: Estimate Px and learn to sample from it.
• General Principle of Generative Models
1. Assume a parametric family on Px , denoted by Pθ
Pθ : Represented using Deep Neural Networks (Model)
2. Define and estimate a divergence (distance) metric between Pθ and Px .
3. Solve an optimization problem over the parameters of Pθ , to minimize the above divergence metric.
• Example: Suppose there is a RV z ∈ Rk and has some arbitrary but known distribution, z ∼ N (0, I)
Suppose gθ (z) : Z → X, then x̂ = gθ (z) has a different distribution than that of z and the distribution of x̂,
depends on the function gθ (·).
Suppose gθ (z) is a Neural Network. Denote the density of x̂ = gθ (z) using Pθ (x̂).
Suppose D(Px ∥Pθ ) denote a divergence measure between Px and Pθ . D ≥ 0, D = 0 iff Px = Pθ

θ∗ = arg min D(Px ∥Pθ )

Upon solving the above optimization problem, the distribution Px is implicitly estimated by gθ (z) and one
can sample from Px using gθ (z)
A sample from z ∼ N (0, I), passed through gθ (z) would produce a sample from pθ∗ (x̂) which is close to Px ,
hence we end up sampling from Px .
• How to compute the divergence metric without knowing Px and Pθ ?
• What should be the choice of the divergence metric?

• How to choose the gθ (z), in turn Pθ ?

• How to solve the optimization problem of minimizing the divergence metric?

1.2 Variational Divergence Minimization

• f-Divergence: Given two probability distribution functions with the corresponding density functions
denoted by Px and Pθ , the f-divergence between them is defined as follows
Z
Px (x)
Df (Px ∥Pθ ) = Pθ (x)f dx
Pθ (x)

where f (u) : R+ → R, covex, left-semi continuous, f (1) = 0

x: space on which the Px and Pθ are supported
• Properties of f-divergence

1. Df (·) ≥ 0 for any choice of f (·)

2. Df (Px ∥Pθ ) = 0 iff Px = Pθ
• Examples of f-divergence

1
1. f (u) = u log u: KL-divergence
Z
Px (x)
Df (Px ∥Pθ ) = Pθ (x)f dx
x Pθ (x)
Z
Px (x) Px (x)
= Pθ (x) log dx
x Pθ (x) Pθ (x)
Z
Px (x)
= Px (x) log dx = DKL
x Pθ (x)
forward KL = DKL (Px ∥Pθ ) ̸= DKL (Pθ ∥Px ) = reverse KL
1 u+1

2. f (u) = 2 u log u − (u + 1) log 2 : JS-divergence
1
3. f (u) = 2 |u − 1|: Total variation distance

• Objective: Algorithm to minimize Df between Px and Pθ , without knowing both of them, but having
samples from both.
• Key Idea: Integrals involving density functions can be approximated using samples drawn from the
distribution
Suppose we want to compute the following integral
Z
I= h(x)Px (x)dx
x

where h(x) is a function and Px (x) the density function. We have samples drawn iid from Px , x1 , x2 , . . . , xn ∼
iid Px
I = EPx [h(x)]

• Law of Large Numbers: Sample mean converges to actual expectation

n
1X
lim h(xi ) ≈ EPx [h(x)]
n→∞ n
i=1

where xi ∼ iid Px
• If the f-divergence can be expressed in terms of expectations of functions w.r.t Px and Pθ , then one can
compute and optimize them.
• Conjugate function for a Convex function: If f (u) is a convex function, then there exists a conjugate
function f ∗ (t) defined as follows
f ∗ (t) = sup {ut − f (u)}
u∈dom f

• Properties of the Conjugate

1. f ∗ (t) is also conjugate
2. [f ∗ (t)]∗ = f (u), conjugate of conjugate gives back original

f (u) = sup {tu − f ∗ (t)}

t∈dom f ∗

• Expressing Df in terms of Expectations over Px and Pθ

Z
Px (x)
Df (Px ∥Pθ ) = Pθ (x)f dx
Pθ (x)
Zx
= Pθ (x)f (u)dx
x
Z
= Pθ (x) sup{tu − f ∗ (t)}dx
x t
Z
Px (x)
= Pθ (x) sup{t − f ∗ (t)}dx
x t Pθ (x)
Z
Px (x)
= sup Pθ (x){T (x) − f ∗ (T (x))}dx
T (x)∈T x Pθ (x)
where T : X → dom f ∗ (space of functions containing solutions for the inner optimization problem)
Z
Px (x)
Df ≥ sup Pθ (x){T (x) − f ∗ (T (x))}dx
T (x)∈T x Pθ (x)

2
The space of functions T, that we are optimizing over may not contain the optimal T ∗ (x), that is the
solution for the inner optimization problem
Z
Px (x)
Df ≥ sup Pθ (x){T (x) − f ∗ (T (x))}dx
T (x)∈T x Pθ (x)
Z Z
≥ sup Px (x)T (x)dx − Pθ (x)f ∗ (T (x))dx
T (x) x x

≥ sup [EPx [T (x)] − EPθ [f ∗ (T (x))]]

T (x)

• Best we can do is minimize the lower bound of Df rather than minimizing Df itself.

θ∗ = arg min[max (EPx [T (x)] − EPθ [f ∗ (T (x))])]

θ T (x)

• Optimizing over a space of functions cannot be done analytically

• In practice, the space of functions T is represented by neural networks Tw (x), where w are the parameters
of the new network.
• With this, the objective becomes

θ∗ , w∗ = arg min max (EPx [T (x)] − EPθ [f ∗ (T (x))]) = arg min max J(θ, w)
θ w θ w

• These alternative minimization and maximization problems are called Saddle Point Optimization
• Solving these kinds of problems is difficult
• Any Saddle point optimization is also called Adversarial Problem
• The θ network is called the Generator Network and the w network is called the Critic/Discriminator
Network
• We have to make sure that the T network maps X → dom f ∗ . In practice, we take Tw (x) = σf (Vw (x))
where σf is a f-divergence specific activation, and Vw (x) : X → R
• Hence, the final optimization problem becomes

J(θ, w) = EPx [σf (Vw (x))] − EPθ [f ∗ (σf (Vw (x)))]

2 Generative Adversarial Networks

2.1 Introduction and Formulation
• For GANs, the f-divergence is as follows
f (u) = u log u − (u + 1) log(u + 1) (Similar to JSD)
∗
f (t) = − log(1 − exp(t)), dom f = R−
∗

σf (v) = − log(1 + e−v )

• We can now write J using the above

J(θ, w) = EPx [σf (Vw (x))] − EPθ [f ∗ (σf (Vw (x)))]
JGAN (θ, w) = EPx [log(Dw (x))] + EPθ [log(1 − Dw (x̂))]
1
where Dw (x) = ← sigmoid function
(1 + e−Vw (x) )

• Consider the input: D = {x1 , x2 , x3 , . . . , xn } ∼ iid Px

One gradient descent step through the Discriminator, considering the Generator as constant
w∗ = arg max EPx [log(Dw (x))] + EPθ [log(1 − Dw (x̂))]
w
 
B1 B2
1 X 1 X
≈ arg max  log(Dw (xi )) + log(1 − Dw (x̂j ))
w B1 i=1 B2 j=1

wt+1 ← wt + α1 ∇w JGAN (θ, w)

3
One gradient descent step through the Generator, considering the Discriminator as constant

θ∗ = arg min EPx [log(Dw (x))] + EPθ [log(1 − Dw (x̂))]

θ
 
B1 B2
1 X 1 X
≈ arg min  log(Dw (xi )) + log(1 − Dw (x̂j ))
θ B1 i=1 B2 j=1
 
B2
1 X
≈ arg min  log(1 − Dw (gθ (zj ))) (Because first term doesn’t depend on θ)
θ B2 j=1

θt+1 ← θt − α2 ∇θ JGAN (θ, w)

• To Train the Discriminator: Keep θ constant, D = {x1 , x2 , . . . , xn }. Pass B = {x1 , x2 , . . . , xB1 } ⊂ D

through the Discriminator network Dw
Sample z1 , z2 , . . . , zB2 ∼ N (0, I) and pass through the Generator network gθ (·) with fixed θ then pass these
x̂ through the Discriminator network.
Once we have these values, simply calculate JGAN and perform one step of Gradient Ascent.
• To Train the Generator: Keep w constant. Sample z1 , z2 , . . . , zB2 ∼ N (0, I) and pass through the Generator
network gθ (·), then pass these x̂ through the Discriminator network.
Once we have these values, simply calculate JGAN and perform one step of Gradient Descent.
• Typically, training alternates between the generator and the discriminator.
• Typically, there is no well-defined stopping criterion; it’s based on the quality of outputs.

2.2 Classifier-Guided Generative Sampler

• Suppose there is a binary classifier (
1 if x ∼ Px
Dw (x) =
0 if x ∼ Pθ

• Can Dw (x) be used for making Pθ and Px closer?

• Tweak θ (parameters of gθ ) till the classifier ’fails’ to distinguish between the samples of Px and Pθ
• However, failure of the classifier need not imply Px = Pθ
• Hence, the classifier has to be simultaneously tweaked along with the generator
• Suppose Dw : X → [0, 1] represents the likelihood of the sample x coming from Px . The objective is the
following
w∗ = arg max EPx [log(Dw (x))] = Maximize the likelihood of x ∼ Px
w

Classifier should also maximize the likelihood of x̂ not coming from Px when x̂ ∼ Pθ

w∗ = arg max EPθ [log(1 − Dw (x̂))] = Maximize the likelihood of x̂ ∼ Pθ

The combined objective for the classifier training is

w∗ = arg max EPx [log(Dw (x))] + EPθ [log(1 − Dw (x̂))] = J(θ, w)

This is just the lower bound we constructed previously

• The objective for the generator gθ (z) is such that the classifier has to ’fail’. Invert the optimization for the
classifier
θ∗ = arg min J(θ, w)
θ

• Overall, we have the following adversarial optimization

θ∗ , w∗ = arg min max[J(θ, w)]

θ w

• Note: This interpretation is not generalizable across different f -divergence

• Deep-Convolution GAN (DCGAN)

4
1. Typical in a GAN, z ∈ Rk , x ∈ Rd , k ≪ d, we keep increasing the size of the input in the generator.
2. Upconvolutional or Transpose Convolutional layers are used in the generators.
3. Typically used for the case when the data is Images.
• Conditional GAN (C-GAN)
1. Once the generator is trained, there is no way to control what kind of images the generator will
generate.
2. Data: D = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} ∼ iid Pxy
3. Objective: Sample from the conditional distribution Px|y (instead of Px )
4. Solution: Estimate Px|y and make Pθ approach Px|y
5. Modify the generator and discriminator to operate on the samples of the conditional distribution.
6. Simply pass an additional input y to the generator.
7. Similarly, the discriminator also takes an additional input y.
8. The objective function becomes the following

J(θ, w) = EPx [log(Dw (x, y))] + EPθ [log(1 − Dw (x̂, y))]

2.3 Improvisations on Adversarial Learning

• f -divergence minimization leads to unstable training.
• Manifold Hypothesis: Distributions of real data (such as images) lie over a lower-dimensional manifold
in the ambient space.
• Manifold can be roughly seen as a lower-dimensional subspace.
• Recall that Px and Pθ are distributions over Rd . Since the real and the generated data lie on lower-
dimensional manifolds. Supports of Px and Pθ will not be aligned with a very high probability.
• It can be shown that a perfect discriminator can be learned when the supports of Px and Pθ are not aligned.
• This implies that the GAN training saturates.
• This is also the reason we keep more steps for training generator as compared to discriminator.

• If we can find a Dw with 100% accuracy, then Df (Px ∥Pθ ) will be independent of the generator parameter θ.
• Solution: Use a ’softer’ divergence metric, that does not saturate when the manifolds of the supports of
Px and Pθ do not align.

• Wasserstein Metric (Optimal Transport): Given two distributions Px and Px̂ ,

W (Px ∥Px̂ ) = min Ex,x̂∼λ ∥x − x̂∥2

λ∈Π(x,x̂)

λ: a joint distribution between Px and Px̂

Π(x, x̂): All possible joint distributions such that
Z
Π(x, x̂)dx = Px̂
x
Z
Π(x, x̂)dx̂ = Px
x̂

• Support Px and Px̂ are two 1-D discrete pmf. The mass in Px can be redistributed such that Px transforms
into Px̂ .

• A redistribution scheme can be represented as a joint distribution between Px and Px̂

• Every redistribution scheme is a joint distribution and is called a ”transport-plan”.

• We need a way to quantify the effort involved in the seemingly infinite transport plans.

5
• Suppose some mass is moved from x to x̂

∥x − x̂∥2 : Distance of the movement

Π(x, x̂) : Amount of mass that was moved
Π(x, x̂)∥x − x̂∥ : ”work-done” in moving the mass from x to x̂

Average work done in a transport plan

Z
Π(x, x̂)∥x − x̂∥dxdx̂ = EΠ(x,x̂) ∥x − x̂∥
x,x̂

• Given that multiple transport plans (joint distribution) exist between x and x̂, which of them corresponds
to the least amount of work?
min Eλ(x,x̂) ∥x − x̂∥ = W (Px ∥Px̂ )
λ∈Π(x,x̂)

where Π is a family of joint distributions such that

Z Z
Π(x, x̂)dx̂ = Px Π(x, x̂)dx = Px̂
x̂ x

These two integral conditions make sure that Px is transformed to Px̂

• The closer Px and Px̂ are, the lesser W (Px ∥Px̂ ) will be
• Fact: Wasserstein metric does not saturate unlike f -divergences when the supports of Px and Pθ does not
align
• In Generative models, this can be used as θ∗ = arg minθ W (Px ∥Pθ ). How to minimize W ?
• Kontrovic Rubenstein’s Duality: Wasserstein distance between two distribution is given by

W (Px ∥Pθ ) = max [Ex∼Px [Tw (x)] − Ex̂∼Pt heta [Tw (x̂)]]
∥Tw (x)∥L <1

∥Tw (x)∥L < 1: 1-Lipschitz

∥Tw (x1 ) − Tw (x2 )∥
≤1
∥x1 − x2 ∥
θ∗ , w∗ = arg min max [Ex∼Px [Tw (x)] − Ex̂∼Pθ [Tw (x̂)]]
θ ∥Tw ∥<1

• The method of minimizing Wasserstein metric is called the W-GAN

• Making a neural network a Lipschitz function is an area of research.
• Practically, normalize the weights of Tw such that ∥w∥2 = 1 after each gradient step. This will make Tw
close to 1-Lipchitz.
• Conclusion: Training a WGAN is more stable compared to that of a naive GAN.

2.4 Applications of GAN

• A trained GAN is a function from Z → X .
• Suppose one is interested in inversion of the above function, i.e., given xi ∼ Px , the goal is to find the
corresponding zi
• Inversion is useful for Feature Extraction, if the GAN is trained well, then it has implicitly learned the
distribution or meaningful features. Given a dataset, obtain GAN-inverted vectors and use them as features
for the data.
• Inversion is useful for Data Manipulation/Editing, suppose xi ∼ Px needs to be edited. First, get
zi : xi = gθ∗ (zi ) via inversion, then perform edit zedit = fedit (z), and finally get the output image
xedit = gθ∗ (zedit )
• How to modify a GAN such that the inversion is possible?
• Bi-directional GAN (BiGAN): In addition to the generator and discriminator, there is another function,
called the Encoder or the Invertor network.

Eϕ : X → Z

6
• In BiGAN, the discriminator Dw is designed to classify between the data tuples of the form (x, Eϕ (x)) and
(gθ (z), z).

LBiGAN (θ, w, ϕ) = Ex∼Px [Eẑ∼Pϕ (z) [log(Dw (x, Eϕ (x)))]] + Ez∼Pz [Ex̂∼Pθ (x̂) [log(1 − Dw (gθ (z), z))]]

• Once a Bi-GAN is trained, the gθ∗ (z) is used for generation and Eϕ∗ (x) is used for inversion.
• It can be shown that, the optima of LBi-GAN is achieved when,
Z Z Z Z
Pẑx = Pzx̂ where, Pẑx = Px Pϕ (ẑ|x)dẑdx and Pzx̂ = Pz Pθ (x̂|z)dx̂dz
x ẑ z x̂

• GAN Inversion via Latent Regression: A simple network Eϕ (x̂) = ẑ, and use ∥z − ẑ∥22

LLat-reg = L(θ, w, ϕ) = [Ex∼Px log(Dw (x)) + Ex̂∼Pθ log(1 − Dw (x̂))] + λEx̂∼Pθ (∥z − Eϕ (x̂))∥22 )

where λ is a hyperparameter.
• However, it has been found that modifying the discriminator network and solving for the joint distribution
leads to a better inversion quality.

2.5 Adversarial Learning for Domain Shift

• Suppose we have data from a source distribution Ds

Ds = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} ∼ iid Ps

• Task is to come up with a classifier for source distribution.

• During test time, data comes from a distribution that is different from Ds , DT ̸= Ds

• Classifiers/Regressor trained on DS would fail on DT

• A standard dataset for this is the PAC dataset
• Unsupervised Domain Adaptation: Given Ds sampled from Ps and we are also given the target
distribution DT sampled from PT . we have labels for Ds but not for DT .

• Objective: Given Ds and DT , learn features/classifier such that it performs well on both Ps and PT
• Domain Adversarial Networks (DAN): Has a ϕ(x) network that takes both x and x̂ as input and
outputs fs and fT . We have another discriminator Tw (f ) that takes input fs and fT and outputs a value
between [0, 1]. This is just a GAN that would match Pfs and PfT

ϕ∗ , w∗ = arg min max EPfs [log Dw (fs )] + EPfT [log(1 − Dw (fT ))]
ϕ w

• We are trying to learn ϕ network in such a way that the feature vectors we are getting for the source and
target distribution are the same.

• To ensure that the feature vectors are useful for the classification task, we also learn a classifier simultaneously.
• This classifier hψ only takes input from the source distribution and gives a label from the source distribution.

ϕ∗ , ψ ∗ = arg min BCE(y, hψ (fs ))

ϕ,ψ

• ϕ has two objectives: Adversarial objective and Classification objective

• During Inference, the same classifier hψ∗ can be used on both the source and target distributions as long as
the features from the ϕ∗ network are used in the input.

7
3 Generative Modeling via Variational Auto-Encoding
3.1 Latent Variable Models
• Suppose we have data D = {x1 , x2 , . . . , xn } ∼ iid Px
• Suppose Pθ denotes our parameteric model

• In a latent variable model, Pθ is defined as

X Z
Pθ (x) = Pθ (x, z) or Pθ (x) = Pθ (x, z)dz
z z

depending on whether z is discrete or continuous

• z: Latent/Hidden/Unobserved random variable
• Typically, z is jointly estimated along with the model parameters, θ

• For each xi ∈ D, we assume that there exists a zi , corresponding to each xi

• Suppose z is discrete ∈ {1, 2, . . . , N }, and x ∈ Rd .
For each xi ∈ D, zi |xi represents the latent variable corresponding to xi . Since zi is discrete, each xi is
clustered into one of N categories.
• Gaussian Mixture Models and K-means Clustering are both examples of discrete latent variable models.

• Suppose z is continuous, i.e., z ∈ Rk and x ∈ Rd , Auto-Encoder, k ≪ d

Here, zi |xi represent a feature-vector corresponding to the given xi
• Latent variable models can also be used as generative models.

3.2 General Principle for Learning Latent Variable Models

• Suppose, we are given data D = {xi }ni=1 ∼ iid Px
• Let Pθ (x) = z Pθ (x, z)dz, be the latent variable model
R

• Goal: Estimate the model parameters θ, given D

θ∗ = arg min [DKL (Px ∥Pθ )]

θ
Z
Px (x)
= arg min Px (x) log dx
θ Pθ (x)
Zx Z
= arg min Px (x) log Px (x)dx − Px (x) log Pθ (x)dx
θ
xZ x
= arg min − Px (x) log Pθ (x)dx
θ x
θ∗ = arg max [EPx log Pθ (x)]
θ

log Pθ (x) is known as the Log Likelihood function of Pθ (x), hence this procedure is also know as Maximum
Likelihood Estimation

• Jenson’s Inequality
log E(·) ≥ E(log (·))

• Let us denote log Pθ (x) with L(θ)

Suppose, q(z|x) denote a density function over the latent variable z

L(θ) = log Pθ (x)

Z
= log Pθ (x, z)dz
z

• Jθ (q) is a function of both the model parameters θ and the density on z, q(z|x)
• q(z|x) is known as variational latent posterior
• θ∗ = arg maxθ,q Jθ (q): Approximate ELBO
• The fundamental problem solved in any latent variable generative model is ELBO maximization
Pθ (x, z)
θ∗ , q ∗ = arg max Eq(z|x) log
θ,q q(z|x)

3.3 Gaussian Mixture Models

• z is discrete, z ∈ {1, 2, . . . , M }

X M
X
Pθ (x) = Pθ (x, z) = Pθ (z = j)Pθ (x|z = j)
z j=1

• In a GMM, Pθ (z = j) = αj , and Pθ (x|z = j) = N (x; µj , Σj )

M
X
Pθ (x) = αj · N (x; µj , Σj )
j=1

• Parameters of GMM: θ = {α1 , . . . , αM , µ1 , . . . , µM , Σ1 . . . , ΣM }

• Since x ∈ Rd , then µj ∈ Rd , and Σj ∈ Rd×d
M
X
0 ≤ αj ≤ 1, αj = 1
j=1

• Goal: Estimate θ, via ELBO optimization

• The optimization can be solved iteratively by optimizing θ and q, one after the other.

Algorithm 1 Expectation Maximization (EM) algorithm

1: Let θt and qt represent the estimates at iteration t, starting arbitrarily
2: for t = 1 to T do
∗
3: qt+1 = arg max Jθt (q), with θt a constant
∗
4: θt+1 = arg maxθ (qt+1 ), with qt+1 a constant
5: end for

• It can be shown that EM ensures that

L(θt+1 ) ≥ L(θt )

• For GMM, it can be shown analytically that

∗ Pθ (x|z)Pθ (z) N (x; µtj , Σtj )αjt

qt+1 = arg max Jθ (q) = Pθ (z|x) = = PM
q Pθ (x) t t t
j=1 αj N (x; µj Σj )

• Then, θt+1
∗
can be found by simple differentiation
• ELBO can be optimized for a latent variable model using the EM algorithm provided Pθ (z|x) can be
computed.

• Represent Pθ (x|z) and q(z|x) via neural networks

• Pθ (x|z): Conditional data likelihood
• q(z|x): Variational Latent posterior density
• How to represent probability distributions via neural networks?
1. Deterministic Representation: We have seen in GANs, the generator outputs samples from the
distribution that it models. Can also be done via classifiers.
2. Probabilistic Representation: The neural network will take input x and output the parameters of
the distribution that it models, but nit the samples from it.
• In a variational auto-encoder, the distributions q(z|x) and Pθ (x|z) are represented using probabilistic NNs.
• The neural network that represents the latent posterior is called the Encoder network (qϕ ), and the network
that represents the conditional data likelihood is called the Decoder network (Pθ ).
• There is no direct connection between the Encoder and Decoder
• ϕ and θ denote the weights of the Encoder and Decoder networks, respectively.
• To compute the gradients of Jθ (qϕ ) w.r.t ϕ
∇ϕ Eqϕ (z|x) log Pθ (x|z), z ∼ qϕ (z|x)
Consider a more general expectation of some function fϕ (z) taken w.r.t distribution qϕ (z)
Z
∇ϕ Eqϕ (z) fϕ (z) = ∇ϕ qϕ (z)fϕ (z)dz
z
Z
= ∇ϕ (qϕ (z)fϕ (z))dz
Zz Z
= (∇ϕ qϕ (z))fϕ (z)dz + (∇ϕ fϕ (z))qϕ (z)dz
z z
= is not an expectation & can’t be computed + Eqϕ (z) ∇ϕ fϕ (z)
=⇒ ∇ϕ Eqϕ (z|x) log Pθ (x|z) cannot be computed

• Gradients cannot be back-propagated through the entire pipeline because the sampling procedure is
non-differentiable.
• Reparameterization Trick: Express the distribution qϕ (z|x) in terms of another distribution(auxiliary
RV) which is independent of the parameter ψ
• Suppose there exists an auxiliary RV, ϵ ∼ Pϵ , where Pϵ is not dependent on ϕ. z = g(ϵ), then,
Eqϕ (z) fϕ (z) = EPϵ fϕ (g(ϵ)) : Law of the unconscious statistician or LOTUS

∇ϕ Eqϕ (z) fϕ (z) = ∇ϕ EPϵ fϕ (g(ϵ))

M
1 X
≈ ∇ϕ [fϕ (g(ϵ))] (Law of Large Numbers)
M j=1

10
• qϕ (z|x) = N (z; µϕ (x), Σϕ (x))
Let ϵ ∼ N (0, I), then z = µϕ (x) + Σϕ (x) · ϵ = g(ϵ)
• Inverse CDF method: Suppose z is a RV and Fz (z) denotes its CDF

z = Fz−1 (ϵ), ϵ ∼ [0, 1]

g(ϵ) = Fz−1 (Inverse CDF)

• From an implementation perspective qϕ (z|x) is assumed to be Gaussian.

• Gaussian choice is just a design choice; one can use any RV as long as they can find an appropriate g
function and the corresponding auxiliary RV.
• We wanted the gradient w.r.t ϕ

∇ϕ Eqϕ (z|x) log Pθ (x|z) = ∇ϕ EPϵ log Pθ (x|g(ϵ))

 
M
1 X
= ∇ϕ  log Pθ (x|g(ϵj ))
M j=1

• To compute log Pθ (x|g(ϵj )): Parameterize Pθ (x|z) via some known distribution and use the Decoder to
output its parameters

Pθ (x|z) = N (x; x̂θ (z), I)

1 2

= log α exp ∥x − x̂θ (z)∥2
(2π) 2
M
1 X
=⇒ ∇ϕ E log Pϕ (x|z) = ∇ϕ EPϵ log Pθ (x|g(ϵ)) = ∇ϕ ∥xi − x̂θ (zj )∥22
M j=1
where zj = µϕ (xi ) + Σϕ (xi ) · ϵ (ϵ1 , . . . , ϵj ∼ N (0, I))

• To compute ∇ϕ DKL (q(z|x)∥Pθ (z))

Pθ (z): Latent prior, typically assumed to be N (0, I)

DKL (q(z|x)∥Pθ (z)) = DKL (N (µϕ (x), Σϕ (x))∥N (0, I))

1
= − log |Σϕ (x)| − K + tr[Σϕ (x)] + ∥µϕ (x)∥22
2
K is the dimension of z. Then it’s straightforward to compute the gradient of the KL term.
• Forward Pass
1. Given an xi ∈ D, pass it through the Encoder to obtain µϕ (xi ) and Σϕ (xi )
2. Sample ϵ1 , . . . , ϵM ∼ N (0, I) outside the NNs
3. Compute z1 , . . . , zj via reparameterization
4. Pass all of z1 , . . . , zj through the Decoder to compute x̂θ (zj )
1
PM 2
5. Compute M j=1 ∥xi − x̂θ (zj )∥2

• Backward Pass
PM
1
1. Compute the gradient ∇ϕ M j=1 ∥xi − x̂θ (zj )∥22
2. Through Decoder, with fixed parameters θ
3. Reparam, I/d or θ
4. Through Encoder, add KL gradient
• Training the Decoder

∇θ Jθ (qϕ ) = ∇θ [Eqϕ (z|x) log Pθ (x|z)] | ∇θ [DKL (qϕ (z|x)|P (z))] = 0

• Backward Pass
PM
1
1. Compute the gradient ∇θ M j=1 ∥xi − x̂θ (zj )∥22
2. Through Decoder

11
• Inference with a trained VAE
R
1. Data Generation or Sampling: Consider the following sampling procedure, Pθ (x) = z
Pθ (x, z)dz
– Sample a znew ∼ P (z) N (0, 1)
– Sample a xnew ∼ Pθ (x|z), using Decoder of the VAE
The KL term in ELBO, will ensure that qϕ (z|x) = P (z) = N (0, I) ∀x
If the VAE is well trained, qϕ (z|x) = N (0, 1) ∀x
=⇒ sampling from N (0, 1) =⇒ sampling from qϕ (z|x)
x̂θ (znew ) is the output of Decoder for znew ∼ N (0, I).
Either use x̂θ (znew ) as the novel generated data point, x̂θ (znew ) ∈ Rd or xnew ∼ N (x̂θ (znew ), I) as
the generated point.
2. Feature/Embedding Extraction or Latent Posterior Inference: Consider the trained Encoder
of a VAE. The embedding or the feature vector for xtest is given by either ztest = µϕ (xtest ) or
ztest ∼ (µϕ (xtest ), Σϕ (xtest )).
Embeddings can be used to build another VAE in a smaller dimension.
• VAE can be viewed as an auto-encoder with a regularized latent space (Regularized Auto-Encoder)

• Posterior Collapse in a naive VAE: In a VAE, the second term in ELBO, DKL (qϕ (z|x)∥P (z)) is
minimized. ∀x, qϕ (z|x) is forcefully made to be the same as P (z) (N (0, 1)). The decoder will find it difficult
to differentiate between two input samples xi and xj

qϕ (z|xi ) = qϕ (z|xj ) = N (0, I)

zi ∼ qϕ (z|xi ) & zj ∼ qϕ (z|xj )

• β-VAE: Simply multiply β in the KL term

Jθ (q) = ∥x − x̂θ (z)∥22 −β · DKL (q(z|x)∥Pθ (z))

| {z } | {z }
Reconstruction error Regularization error

0 ≤ β ≤ 1, β - hyper parameter
Higher β → Lead to posterior collapse, better generation
Lower β → Reconstructions are better, generation will suffer because qϕ (z|x) ̸= P (z)
• Vector Quantized VAE: Current SOTA that is used in practice. In Vq-VAE, the latent space is discrete
& vector-quantized using a learnable dictionary.

• The Encoder gives the samples of ze (xi ) directly rather than the parameters of the distribution.
• The Decoder takes in the vector-quantized version of ze , i.e., zq (xi ).
• Suppose there are M latent vectors of K-dim each in the dictionary, L = {z1 , z2 , . . . , zM }, zi ∈ Rk

zq (xi ) = zj ∗ , j ∗ = arg min ∥ze (xi ) − zj ∥22 ← Vector Quantization

j=1,...,M

In practice, the latent space is designed to be a tensor of latent dictionary. The encoder outputs ẑe (xi ) ∈ Rk×p ,
contains p vectors from the latent dictionary, or each column is vector quantized.
• Objective of Vq-VAE
Jθ (q) = ∥x − x̂θ (ẑq )∥22 + ∥ze (xi ) − zq (x)∥22
In addition to the Encoder and Decoder parameters, the latent dictionary is also learned.
• Embedding Extraction: Just the vector quantized version of the output of the encoder given input xtest .

• Sampling in Vq-VAE: To be able to sample from the decoder of Vq-VAE, we need to sample from the
distribution of ẑq (·). To learn to sample from the distribution of ẑq (·), a generative model is fit to samples
of zq , e.g., GMM on the samples from the o/p of Encoder corresponding to all I/p data points. Sample
from the learned distribution of zq and use it as the input to the Decoder.
• Due to the cumbersome nature of sampling, Vq-VAE is typically not used for generation or sampling.

12
4 Denoising Diffusion Probabilistic Models (DDPM)
4.1 Introduction
4.2 Diffusion Models
4.3 Conditional Diffusion Models and Score-based Models

5 Auto-Regressive Models and LLMs

5.1 Introduction
5.2 Models, Sampling, Inference, and Quantization Methods
5.3 Reinforcement Learning based Alignment Methods

George and Mallery (2003), PDF
33% (3)
George and Mallery (2003), PDF
63 pages
Notes On Contrastive Divergence
No ratings yet
Notes On Contrastive Divergence
3 pages
189 Cheat Sheet Minicards
No ratings yet
189 Cheat Sheet Minicards
2 pages
Midterm Exam: Statistical Analysis
100% (1)
Midterm Exam: Statistical Analysis
9 pages
GANs 2
No ratings yet
GANs 2
10 pages
GANs: A Deep Dive for Researchers
No ratings yet
GANs: A Deep Dive for Researchers
62 pages
DeepLearning Aula6
No ratings yet
DeepLearning Aula6
63 pages
Hung-Yi Lee GAN-Basic Idea (2017.04.21)
No ratings yet
Hung-Yi Lee GAN-Basic Idea (2017.04.21)
67 pages
Tao 18 B
No ratings yet
Tao 18 B
10 pages
Gans
No ratings yet
Gans
26 pages
Implicit Generative Models: Dual vs. Primal Approaches
No ratings yet
Implicit Generative Models: Dual vs. Primal Approaches
42 pages
GANs: Challenges and Solutions
No ratings yet
GANs: Challenges and Solutions
50 pages
Bayesian NN
No ratings yet
Bayesian NN
82 pages
Montanari
No ratings yet
Montanari
10 pages
Notes On Gans, Energy-Based Models, and Saddle Points
No ratings yet
Notes On Gans, Energy-Based Models, and Saddle Points
10 pages
Mod5 Slides
No ratings yet
Mod5 Slides
37 pages
L1-Understanding Diffusion Models A Unified Persp
No ratings yet
L1-Understanding Diffusion Models A Unified Persp
27 pages
Understanding Diffusion Models: A Unified Perspective
No ratings yet
Understanding Diffusion Models: A Unified Perspective
23 pages
Notes On Deep Learning Theory
No ratings yet
Notes On Deep Learning Theory
68 pages
DLbook
No ratings yet
DLbook
165 pages
NIPS 2016 F Gan Training Generative Neural Samplers Using Variational Divergence Minimization Paper
No ratings yet
NIPS 2016 F Gan Training Generative Neural Samplers Using Variational Divergence Minimization Paper
12 pages
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan March 28, 2024
No ratings yet
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan March 28, 2024
51 pages
Advanced GAN Techniques Explained
No ratings yet
Advanced GAN Techniques Explained
71 pages
ACV - Notes - Final
No ratings yet
ACV - Notes - Final
7 pages
Solution 02
No ratings yet
Solution 02
9 pages
CS236 Homework 3 Answer
No ratings yet
CS236 Homework 3 Answer
8 pages
3a Variations4
No ratings yet
3a Variations4
5 pages
Tutorial On Diffusion Models
No ratings yet
Tutorial On Diffusion Models
4 pages
Gans Stanford
No ratings yet
Gans Stanford
39 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
8 pages
Lecture 14
No ratings yet
Lecture 14
23 pages
3a Variations
No ratings yet
3a Variations
17 pages
PRML Exercise Solutions Guide
No ratings yet
PRML Exercise Solutions Guide
87 pages
Lecture 18 20
No ratings yet
Lecture 18 20
65 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
DLAI4 Networks Gans
No ratings yet
DLAI4 Networks Gans
7 pages
Lecture 12
No ratings yet
Lecture 12
38 pages
Chapter 5
No ratings yet
Chapter 5
140 pages
cs236 Lecture4
No ratings yet
cs236 Lecture4
25 pages
Applying Statistical Learning Theory To Deep Learning
No ratings yet
Applying Statistical Learning Theory To Deep Learning
51 pages
Lecture 3 ML - Optimization
No ratings yet
Lecture 3 ML - Optimization
32 pages
TFM Lichtner Bajjaoui Aisha
No ratings yet
TFM Lichtner Bajjaoui Aisha
18 pages
DiffusionModel DDPM
No ratings yet
DiffusionModel DDPM
52 pages
Lecture 12
No ratings yet
Lecture 12
35 pages
Unit 2-DLV
No ratings yet
Unit 2-DLV
84 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
Lec 04 Deep Networks 2
No ratings yet
Lec 04 Deep Networks 2
78 pages
Unit 2 Introduction To Deep Learning
67% (3)
Unit 2 Introduction To Deep Learning
79 pages
Stabilizing Training of Generative Adversarial Networks Through Regularization
No ratings yet
Stabilizing Training of Generative Adversarial Networks Through Regularization
16 pages
Diffusion Models in Imaging Tutorial
No ratings yet
Diffusion Models in Imaging Tutorial
90 pages
Perceptrons
No ratings yet
Perceptrons
12 pages
AI60201 Module3 4 Problems
No ratings yet
AI60201 Module3 4 Problems
4 pages
Unit 2
No ratings yet
Unit 2
76 pages
Class19 Approxinf
No ratings yet
Class19 Approxinf
45 pages
Ps 1
No ratings yet
Ps 1
16 pages
Deep Learning Assignment3 Solution
No ratings yet
Deep Learning Assignment3 Solution
9 pages
Answer Key
No ratings yet
Answer Key
12 pages
Revision IIT 1
No ratings yet
Revision IIT 1
5 pages
Lecture14 Sep20 2021
No ratings yet
Lecture14 Sep20 2021
8 pages
Lecture16 Sep29 2021
No ratings yet
Lecture16 Sep29 2021
8 pages
Lecture Slides - 230809 - 154641
No ratings yet
Lecture Slides - 230809 - 154641
249 pages
Info Theory
No ratings yet
Info Theory
3 pages
Statistics S2 Mark Scheme 2007
No ratings yet
Statistics S2 Mark Scheme 2007
6 pages
C I O T U: EBU Nstitute F Echnology N I V E R S I T Y
No ratings yet
C I O T U: EBU Nstitute F Echnology N I V E R S I T Y
7 pages
PAST: Free Palaeontological Stats Tool
No ratings yet
PAST: Free Palaeontological Stats Tool
93 pages
Research Methodology - Unit 5 - Week 3 - Data Analysis and Modelling Skills
No ratings yet
Research Methodology - Unit 5 - Week 3 - Data Analysis and Modelling Skills
4 pages
AddMath Chapter 5 - Form 5
No ratings yet
AddMath Chapter 5 - Form 5
6 pages
Work Sheet For Final Exam
No ratings yet
Work Sheet For Final Exam
4 pages
Math 43
No ratings yet
Math 43
3 pages
Book of Bill
No ratings yet
Book of Bill
6 pages
Lecture 5 Final Point Estimation and Interval Estimation
No ratings yet
Lecture 5 Final Point Estimation and Interval Estimation
10 pages
De Souza
No ratings yet
De Souza
10 pages
Statistics - Probability - Q3 - Mod6 - Central Limit Theorem
No ratings yet
Statistics - Probability - Q3 - Mod6 - Central Limit Theorem
24 pages
B.Tech CSE & IT Stats Tutorial
No ratings yet
B.Tech CSE & IT Stats Tutorial
34 pages
Lesson 1 - Time Series Basics
No ratings yet
Lesson 1 - Time Series Basics
23 pages
MPC 006 PDF
No ratings yet
MPC 006 PDF
55 pages
Hasil Olah Data Jadiiiii
No ratings yet
Hasil Olah Data Jadiiiii
9 pages
Long Test in Statistics and Probability
No ratings yet
Long Test in Statistics and Probability
2 pages
Statistics Assignment
No ratings yet
Statistics Assignment
11 pages
1.1 - Statistics Refresher
No ratings yet
1.1 - Statistics Refresher
34 pages
Statistics - Statistical Inference
No ratings yet
Statistics - Statistical Inference
3 pages
Sufficient Statistics - Problems - Solved - Xiang - Yin
No ratings yet
Sufficient Statistics - Problems - Solved - Xiang - Yin
5 pages
Stat 1st Quarter Exam 46
No ratings yet
Stat 1st Quarter Exam 46
34 pages
Poisson Regression for Counts
No ratings yet
Poisson Regression for Counts
51 pages
IS4242 W3 Regression Analyses
No ratings yet
IS4242 W3 Regression Analyses
67 pages
APA Style Stats Reporting Guide
No ratings yet
APA Style Stats Reporting Guide
2 pages
Document 2
No ratings yet
Document 2
18 pages
Sample Exercises Biometric
No ratings yet
Sample Exercises Biometric
3 pages
Formula For Hypothesis Testing
No ratings yet
Formula For Hypothesis Testing
5 pages