Revision IIT
Revision IIT
https://github.com/Chandan-IISc/IITM_GenAI/tree/main
Upon solving the above optimization problem, the distribution Px is implicitly estimated by gθ (z) and one
can sample from Px using gθ (z)
A sample from z ∼ N (0, I), passed through gθ (z) would produce a sample from pθ∗ (x̂) which is close to Px ,
hence we end up sampling from Px .
• How to compute the divergence metric without knowing Px and Pθ ?
• What should be the choice of the divergence metric?
1
1. f (u) = u log u: KL-divergence
Z
Px (x)
Df (Px ∥Pθ ) = Pθ (x)f dx
x Pθ (x)
Z
Px (x) Px (x)
= Pθ (x) log dx
x Pθ (x) Pθ (x)
Z
Px (x)
= Px (x) log dx = DKL
x Pθ (x)
forward KL = DKL (Px ∥Pθ ) ̸= DKL (Pθ ∥Px ) = reverse KL
1 u+1
2. f (u) = 2 u log u − (u + 1) log 2 : JS-divergence
1
3. f (u) = 2 |u − 1|: Total variation distance
• Objective: Algorithm to minimize Df between Px and Pθ , without knowing both of them, but having
samples from both.
• Key Idea: Integrals involving density functions can be approximated using samples drawn from the
distribution
Suppose we want to compute the following integral
Z
I= h(x)Px (x)dx
x
where h(x) is a function and Px (x) the density function. We have samples drawn iid from Px , x1 , x2 , . . . , xn ∼
iid Px
I = EPx [h(x)]
where xi ∼ iid Px
• If the f-divergence can be expressed in terms of expectations of functions w.r.t Px and Pθ , then one can
compute and optimize them.
• Conjugate function for a Convex function: If f (u) is a convex function, then there exists a conjugate
function f ∗ (t) defined as follows
f ∗ (t) = sup {ut − f (u)}
u∈dom f
2
The space of functions T, that we are optimizing over may not contain the optimal T ∗ (x), that is the
solution for the inner optimization problem
Z
Px (x)
Df ≥ sup Pθ (x){T (x) − f ∗ (T (x))}dx
T (x)∈T x Pθ (x)
Z Z
≥ sup Px (x)T (x)dx − Pθ (x)f ∗ (T (x))dx
T (x) x x
• Best we can do is minimize the lower bound of Df rather than minimizing Df itself.
θ∗ , w∗ = arg min max (EPx [T (x)] − EPθ [f ∗ (T (x))]) = arg min max J(θ, w)
θ w θ w
• These alternative minimization and maximization problems are called Saddle Point Optimization
• Solving these kinds of problems is difficult
• Any Saddle point optimization is also called Adversarial Problem
• The θ network is called the Generator Network and the w network is called the Critic/Discriminator
Network
• We have to make sure that the T network maps X → dom f ∗ . In practice, we take Tw (x) = σf (Vw (x))
where σf is a f-divergence specific activation, and Vw (x) : X → R
• Hence, the final optimization problem becomes
3
One gradient descent step through the Generator, considering the Discriminator as constant
Classifier should also maximize the likelihood of x̂ not coming from Px when x̂ ∼ Pθ
4
1. Typical in a GAN, z ∈ Rk , x ∈ Rd , k ≪ d, we keep increasing the size of the input in the generator.
2. Upconvolutional or Transpose Convolutional layers are used in the generators.
3. Typically used for the case when the data is Images.
• Conditional GAN (C-GAN)
1. Once the generator is trained, there is no way to control what kind of images the generator will
generate.
2. Data: D = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} ∼ iid Pxy
3. Objective: Sample from the conditional distribution Px|y (instead of Px )
4. Solution: Estimate Px|y and make Pθ approach Px|y
5. Modify the generator and discriminator to operate on the samples of the conditional distribution.
6. Simply pass an additional input y to the generator.
7. Similarly, the discriminator also takes an additional input y.
8. The objective function becomes the following
• If we can find a Dw with 100% accuracy, then Df (Px ∥Pθ ) will be independent of the generator parameter θ.
• Solution: Use a ’softer’ divergence metric, that does not saturate when the manifolds of the supports of
Px and Pθ do not align.
• Support Px and Px̂ are two 1-D discrete pmf. The mass in Px can be redistributed such that Px transforms
into Px̂ .
• We need a way to quantify the effort involved in the seemingly infinite transport plans.
5
• Suppose some mass is moved from x to x̂
• Given that multiple transport plans (joint distribution) exist between x and x̂, which of them corresponds
to the least amount of work?
min Eλ(x,x̂) ∥x − x̂∥ = W (Px ∥Px̂ )
λ∈Π(x,x̂)
W (Px ∥Pθ ) = max [Ex∼Px [Tw (x)] − Ex̂∼Pt heta [Tw (x̂)]]
∥Tw (x)∥L <1
Eϕ : X → Z
6
• In BiGAN, the discriminator Dw is designed to classify between the data tuples of the form (x, Eϕ (x)) and
(gθ (z), z).
LBiGAN (θ, w, ϕ) = Ex∼Px [Eẑ∼Pϕ (z) [log(Dw (x, Eϕ (x)))]] + Ez∼Pz [Ex̂∼Pθ (x̂) [log(1 − Dw (gθ (z), z))]]
• Once a Bi-GAN is trained, the gθ∗ (z) is used for generation and Eϕ∗ (x) is used for inversion.
• It can be shown that, the optima of LBi-GAN is achieved when,
Z Z Z Z
Pẑx = Pzx̂ where, Pẑx = Px Pϕ (ẑ|x)dẑdx and Pzx̂ = Pz Pθ (x̂|z)dx̂dz
x ẑ z x̂
• GAN Inversion via Latent Regression: A simple network Eϕ (x̂) = ẑ, and use ∥z − ẑ∥22
LLat-reg = L(θ, w, ϕ) = [Ex∼Px log(Dw (x)) + Ex̂∼Pθ log(1 − Dw (x̂))] + λEx̂∼Pθ (∥z − Eϕ (x̂))∥22 )
where λ is a hyperparameter.
• However, it has been found that modifying the discriminator network and solving for the joint distribution
leads to a better inversion quality.
• Objective: Given Ds and DT , learn features/classifier such that it performs well on both Ps and PT
• Domain Adversarial Networks (DAN): Has a ϕ(x) network that takes both x and x̂ as input and
outputs fs and fT . We have another discriminator Tw (f ) that takes input fs and fT and outputs a value
between [0, 1]. This is just a GAN that would match Pfs and PfT
ϕ∗ , w∗ = arg min max EPfs [log Dw (fs )] + EPfT [log(1 − Dw (fT ))]
ϕ w
• We are trying to learn ϕ network in such a way that the feature vectors we are getting for the source and
target distribution are the same.
• To ensure that the feature vectors are useful for the classification task, we also learn a classifier simultaneously.
• This classifier hψ only takes input from the source distribution and gives a label from the source distribution.
7
3 Generative Modeling via Variational Auto-Encoding
3.1 Latent Variable Models
• Suppose we have data D = {x1 , x2 , . . . , xn } ∼ iid Px
• Suppose Pθ denotes our parameteric model
log Pθ (x) is known as the Log Likelihood function of Pθ (x), hence this procedure is also know as Maximum
Likelihood Estimation
• Jenson’s Inequality
log E(·) ≥ E(log (·))
8
Z
q(z|x)
L(θ) = log Pθ (x, z) dz
z q(z|x)
Z
Pθ (x, z)
= log q(z|x) dz
z q(z|x)
Pθ (x, z)
= log Eq(z|x)
q(z|x)
Pθ (x, z)
≥ Eq(z|x) log
q(z|x)
L(θ) ≥ Jθ (q), called the Evidence Lower Bound (ELBO)
• Jθ (q) is a function of both the model parameters θ and the density on z, q(z|x)
• q(z|x) is known as variational latent posterior
• θ∗ = arg maxθ,q Jθ (q): Approximate ELBO
• The fundamental problem solved in any latent variable generative model is ELBO maximization
Pθ (x, z)
θ∗ , q ∗ = arg max Eq(z|x) log
θ,q q(z|x)
X M
X
Pθ (x) = Pθ (x, z) = Pθ (z = j)Pθ (x|z = j)
z j=1
• Then, θt+1
∗
can be found by simple differentiation
• ELBO can be optimized for a latent variable model using the EM algorithm provided Pθ (z|x) can be
computed.
9
3.4 Variational Auto-Encoders
• Goal of a Neural Latent Variable Model
1. Learn a Latent Variable Model with unknown Pθ (z|x)
2. Enable sampling or generation from Pθ (x|z)
3. Enable posterior Inference, compute or estimate q(z|x)
• Consider our objective function
Pθ (x, z)
Jθ (q) = Eq(z|x) log
q(z|x)
Pθ (x|z)Pθ (z)
= Eq(z|x) log
q(z|x)
q(z|x)
= Eq(z|x) log Pθ (x|z) − Eq(z|x) log
Pθ (z)
= Eq(z|x) log Pθ (x|z) − DKL (q(z|x)∥Pθ (z))
• Gradients cannot be back-propagated through the entire pipeline because the sampling procedure is
non-differentiable.
• Reparameterization Trick: Express the distribution qϕ (z|x) in terms of another distribution(auxiliary
RV) which is independent of the parameter ψ
• Suppose there exists an auxiliary RV, ϵ ∼ Pϵ , where Pϵ is not dependent on ϕ. z = g(ϵ), then,
Eqϕ (z) fϕ (z) = EPϵ fϕ (g(ϵ)) : Law of the unconscious statistician or LOTUS
10
• qϕ (z|x) = N (z; µϕ (x), Σϕ (x))
Let ϵ ∼ N (0, I), then z = µϕ (x) + Σϕ (x) · ϵ = g(ϵ)
• Inverse CDF method: Suppose z is a RV and Fz (z) denotes its CDF
• To compute log Pθ (x|g(ϵj )): Parameterize Pθ (x|z) via some known distribution and use the Decoder to
output its parameters
• Backward Pass
PM
1
1. Compute the gradient ∇ϕ M j=1 ∥xi − x̂θ (zj )∥22
2. Through Decoder, with fixed parameters θ
3. Reparam, I/d or θ
4. Through Encoder, add KL gradient
• Training the Decoder
• Backward Pass
PM
1
1. Compute the gradient ∇θ M j=1 ∥xi − x̂θ (zj )∥22
2. Through Decoder
11
• Inference with a trained VAE
R
1. Data Generation or Sampling: Consider the following sampling procedure, Pθ (x) = z
Pθ (x, z)dz
– Sample a znew ∼ P (z) N (0, 1)
– Sample a xnew ∼ Pθ (x|z), using Decoder of the VAE
The KL term in ELBO, will ensure that qϕ (z|x) = P (z) = N (0, I) ∀x
If the VAE is well trained, qϕ (z|x) = N (0, 1) ∀x
=⇒ sampling from N (0, 1) =⇒ sampling from qϕ (z|x)
x̂θ (znew ) is the output of Decoder for znew ∼ N (0, I).
Either use x̂θ (znew ) as the novel generated data point, x̂θ (znew ) ∈ Rd or xnew ∼ N (x̂θ (znew ), I) as
the generated point.
2. Feature/Embedding Extraction or Latent Posterior Inference: Consider the trained Encoder
of a VAE. The embedding or the feature vector for xtest is given by either ztest = µϕ (xtest ) or
ztest ∼ (µϕ (xtest ), Σϕ (xtest )).
Embeddings can be used to build another VAE in a smaller dimension.
• VAE can be viewed as an auto-encoder with a regularized latent space (Regularized Auto-Encoder)
• Posterior Collapse in a naive VAE: In a VAE, the second term in ELBO, DKL (qϕ (z|x)∥P (z)) is
minimized. ∀x, qϕ (z|x) is forcefully made to be the same as P (z) (N (0, 1)). The decoder will find it difficult
to differentiate between two input samples xi and xj
0 ≤ β ≤ 1, β - hyper parameter
Higher β → Lead to posterior collapse, better generation
Lower β → Reconstructions are better, generation will suffer because qϕ (z|x) ̸= P (z)
• Vector Quantized VAE: Current SOTA that is used in practice. In Vq-VAE, the latent space is discrete
& vector-quantized using a learnable dictionary.
• The Encoder gives the samples of ze (xi ) directly rather than the parameters of the distribution.
• The Decoder takes in the vector-quantized version of ze , i.e., zq (xi ).
• Suppose there are M latent vectors of K-dim each in the dictionary, L = {z1 , z2 , . . . , zM }, zi ∈ Rk
In practice, the latent space is designed to be a tensor of latent dictionary. The encoder outputs ẑe (xi ) ∈ Rk×p ,
contains p vectors from the latent dictionary, or each column is vector quantized.
• Objective of Vq-VAE
Jθ (q) = ∥x − x̂θ (ẑq )∥22 + ∥ze (xi ) − zq (x)∥22
In addition to the Encoder and Decoder parameters, the latent dictionary is also learned.
• Embedding Extraction: Just the vector quantized version of the output of the encoder given input xtest .
• Sampling in Vq-VAE: To be able to sample from the decoder of Vq-VAE, we need to sample from the
distribution of ẑq (·). To learn to sample from the distribution of ẑq (·), a generative model is fit to samples
of zq , e.g., GMM on the samples from the o/p of Encoder corresponding to all I/p data points. Sample
from the learned distribution of zq and use it as the input to the Decoder.
• Due to the cumbersome nature of sampling, Vq-VAE is typically not used for generation or sampling.
12
4 Denoising Diffusion Probabilistic Models (DDPM)
4.1 Introduction
4.2 Diffusion Models
4.3 Conditional Diffusion Models and Score-based Models
13