Introduction NCSN DDPM References
Diffusion model
for machine learning
Tran Trong Khiem
AI lab tranning
2024/08/01
Tran Trong Khiem Diffusion model 1 / 23
Introduction NCSN DDPM References
1 Introduction
2 NCSN
3 DDPM
4 References
Tran Trong Khiem Diffusion model 2 / 23
Introduction NCSN DDPM References
Introduction
Math for machine learning :
• Complete Foundation chapter in Probabilistic Machine
Learning[1]
• Probability and Statistics.
• Linear Algebra.
• Optimazation.
Generative AI:
• Gan
• VAE
• Flow-base
• Diffusion model
Tran Trong Khiem Diffusion model 3 / 23
Introduction NCSN DDPM References
Generative AI
Existing generative modeling techniques can largely be grouped into
two categories based on how they represent probability distribu-
tions.
1 likelihood-based models: which directly learn the
distribution’s probability density (or mass) function via
(approximate) maximum likelihood.(VAEs, EBMs, ...)
• Cons: require strong restrictions on the model architecture to
ensure a tractable normalizing constant for likelihood
computation.
2 implicit generative models: where the probability distribution
is implicitly represented by a model of its sampling
process.(Gan,...)
• Cons: unstable and can lead to model collapse.
Diffusion model introduces another way to represent probability dis-
tributions that circumvent several of these limitations.
Tran Trong Khiem Diffusion model 4 / 23
Introduction NCSN DDPM References
Diffusion model
The key idea is to model the gradient of the log probability density
function, score function.
• score-based models are not required to have a tractable
normalizing constant, and can be directly learned by score
matching.Better than GAN in image generation.
Denote :
• The dataset consists of i.i.d. samples {xi ∈ RD }Ni=1 from an
unknown data distribution pdata (x).
• The score of a probability density p(x) is defined as ∇x log p(x).
• The score network sθ : RD → RD , which will be trained to
approximate the score of pdata (x).
The framework of score-based generative modeling:
1 score matching
2 Langevin dynamics.
Tran Trong Khiem Diffusion model 5 / 23
Introduction NCSN DDPM References
Framework of score-based generative modeling
Score matching :
• train a score network sθ (x) to estimate ∇x log pdata (x) without
training a model to estimate pdata (x)
Langevin dynamics
• produce samples from a probability density p(x) using only the
score function ∇x log pdata (x).
Figure 1: Score-based generative modeling with score matching + Langevin
dynamics.
Tran Trong Khiem Diffusion model 6 / 23
Introduction NCSN DDPM References
Score matching for score estimation
Goal: train a score network sθ (x) to estimate ∇x log pdata (x).
The objective minimizes :
h i
2
Epdata ∥sθ (x) − ∇x log pdata (x)∥2
which can be shown equivalent to the following up to a constant :
1 2
Epdata (x) tr(∇x sθ (x)) + ∥sθ (x)∥2
2
Problem: Score matching is not scalable to deep networks and high-
dimensional data due to the computation of tr(∇x sθ (x)).
Solution: There are two popular methods for large scale score match-
ing.
1 Denoising score matching
2 Sliced score matching
Tran Trong Khiem Diffusion model 7 / 23
Introduction NCSN DDPM References
Score matching for score estimation(.cnt)
Denoising score matching:
• completely circumvents tr(∇x sθ (x)).
• perturbs the data point x with a prespecified noise qσ (x̃ | x).
• employs score matching to estimate the score of the perturbed
data. h i
2
Eqσ (x̃|x)pdata (x) ∥sθ (x̃) − ∇x̃ log qσ (x̃ | x)∥2
• However, s∗θ (x) = ∇x log qσ (x) ≈ ∇x log pdata (x) is true only when
the noise is small enough such that qσ (x) ≈ pdata (x).
Sliced score matching:
• uses random projections to approximate tr(∇x sθ (x)).
• The objective is:
1 h
2
i
Epv Epdata v∇x sθ (x)v + ∥sθ (x)∥2
2
• pv is a simple distribution of random vectors.
Tran Trong Khiem Diffusion model 8 / 23
Introduction NCSN DDPM References
Sampling with Langevin dynamics
Goal: produce samples from a probability density p(x) using only the
score function ∇x log p(x).
• Given a fixed step size ϵ > 0, and an initial value x̃0 ∼ π(x)
• π is a prior distribution.
• Langevin method recursively computes the following :
ϵ
x̃t = x̃t−1 + ∇x log p(x̃t−1 ) + ϵzt ,
2
• zt ∼ N (0, I)
• The distribution of x̃T equals p(x) when ϵ → 0 and T → ∞,
• In practice, ϵ is small and T is large.
Tran Trong Khiem Diffusion model 9 / 23
Introduction NCSN DDPM References
Challenges of score-based generative modeling
Inaccurate score estimation with score matching:
• In score matching, we minimize :
h i Z h i
2 2
Epdata ∥sθ (x) − ∇x log pdata (x)∥2 = p(x) ∥sθ (x) − ∇x log pdata (x)∥2 dx
• Since square error weighted by p(x) , they are largely ignored in
low density regions where p(x) is small.
Figure 2: Estimated scores are only accurate in high density regions
Tran Trong Khiem Diffusion model 10 / 23
Introduction NCSN DDPM References
How to bypass the inaccurate score estimation in regions of
low data density?
Observation : perturbing data with random Gaussian noise makes the
data distribution more amenable to score-based generative modeling.
• large Gaussian noise has the effect of filling low density regions
in the original distribution.
Upon intuition is the key idea for Noise Conditional Score Networks(NCSN):
1 perturbing the data using various levels of noise.
2 simultaneously estimating scores corresponding to all noise levels
by training a single conditional score network.
Tran Trong Khiem Diffusion model 11 / 23
Introduction NCSN DDPM References
1 Introduction
2 NCSN
3 DDPM
4 References
Tran Trong Khiem Diffusion model 12 / 23
Introduction NCSN DDPM References
Noise Conditional Score Networks
Problem : How to choose an appropriate noise scale for the perturba-
tion process?
• Larger noise over-corrupts the data and alters it significantly from
the original distribution.
• Smaller noise, on the other hand, causes less corruption of the
original data.
Solution: Use multiple scales of noise perturbations simultaneously.
Denote:
• {σi }Li=1 be a positive sequence geometric decending sequence.
• qσ (x) = pdata (t)N (x | t, σ 2 I) dt the perturbed data distribution.
R
• sθ (x, σ) is a Noise Conditional Score Network (NCSN).
• train model to jointly estimate the scores of all perturbed data
distributions :
∀σ ∈ {σi }Li=1 : sθ (x, σ) ≈ ∇x log qσ (x)
Tran Trong Khiem Diffusion model 13 / 23
Introduction NCSN DDPM References
Learning NCSNs via score matching
Adapt denoising score matching for learning NCSNs.
• choose the noise distribution to be qσ (x̃ | x) = N (x̃ | x, σ 2 I)
• therefore ∇x̃ log qσ (x̃ | x) = − x̃−x
σ2
• For a given σ, the denoising score matching objective is :
" #
2
1 x̃ − x
L(θ; σ) = Epdata (x) Ex̃∼N (x,σ2 I) sθ (x̃, σ) + .
2 σ2 2
• We combine for all σ ∈ {σi }Li=1 to get one unified objective :
L
1X
L(θ; {σi }Li=1 ) = λ(σi )L(θ; σi )
L
i=1
Tran Trong Khiem Diffusion model 14 / 23
Introduction NCSN DDPM References
NCSN inference via annealed Langevin dynamics
• propose a sampling approach— annealed Langevin dynamics
Figure 3: Annealed Langevin dynamics.
Tran Trong Khiem Diffusion model 15 / 23
Introduction NCSN DDPM References
1 Introduction
2 NCSN
3 DDPM
4 References
Tran Trong Khiem Diffusion model 16 / 23
Introduction NCSN DDPM References
Denoising Diffusion Probabilistic Models
Figure 4: Diffusion model
Forward diffusion process:
• add small amount of Gaussian noise to the sample in T
• producing a sequence of noisy samples x1 , x2 · · · xT
• converts any complex data distribution into a simple, tractable,
distribution.
Reverse diffusion process:
• Learn a reveral of forward diffusion process.
Tran Trong Khiem Diffusion model 17 / 23
Introduction NCSN DDPM References
Foward process
Gradually adds Gaussian noise to the data according to a variance
schedule β1 , . . . , βT :
T
Y p
q(x1:T | x0 ) := q(xt | xt−1 ), q(xt | xt−1 ) := N (xt ; 1 − βt xt−1 , βt I)
t=1
Nice property: We can sample xt at timestep t as :
√ p √ p
xt = αt xt−1 + 1 − αt ϵt = ᾱt x0 + 1 − ᾱt ϵ
• ϵt ∼ N (0, I)
• ᾱt = ts=1 αt and αt = 1 − βt
Q
√
• Thus we have : q(xt |x0 ) = N (xt , ᾱt x0 , (1 − ᾱt I))
Tran Trong Khiem Diffusion model 18 / 23
Introduction NCSN DDPM References
Reverse diffusion process
Goal: Learn to reverse the forward process and sample from q(xt−1 |xt ).
• Use pθ (xt−1 |xt ) to approximate q(xt−1 |xt ).
• The reverse conditional probability is tractable when conditioned
on x0 :
q(xt−1 |xt , x0 ) = N (xt−1 , µ̃(xt , x0 ), β̃t I)
√
• β̃t = 1−at−1
¯ √ 1−ᾱ ᾱt−1 βt
1−a¯t
βt and µ̃(xt , x0 ) = αt 1−ᾱt−1
t
xt + 1−ᾱt
x0
Tran Trong Khiem Diffusion model 19 / 23
Introduction NCSN DDPM References
Reverse diffusion process(.cnt)
Training is performed by optimizing the usual variational bound on
negative log likelihood:
pθ (x0:T )
Eq [− log pθ (x0 )] ≤ Eq − log
q(x1:T | x0 )
T
" #
X pθ (xt−1 | xt )
= Eq − log p(xT ) − log =: L.
q(xt | xt−1 )
t=1
Loss function can rewrite as :
" #
X
Eq DKL (q(xT |x0 )||p(xT )) + DKL (q(xt−1 |xt , x0 )||pθ (xt−1 |xt )) − log pθ (x0 |x1 )
t>1
(1)
Label each component in the variational lower bound loss separately:
• LVLB = Tt=0 Lt
P
Tran Trong Khiem Diffusion model 20 / 23
Introduction NCSN DDPM References
Reverse diffusion process(.cnt)
The loss term Lt is parameterized and simplified to minimize :
h √ i
Lsimple
p
t = Et∼[1,T],x0 ,ϵt ||ϵt − ϵθ ( ᾱt x0 + 1 − ᾱt ϵt t)||2
Figure 5: Traing process.
Tran Trong Khiem Diffusion model 21 / 23
Introduction NCSN DDPM References
1 Introduction
2 NCSN
3 DDPM
4 References
Tran Trong Khiem Diffusion model 22 / 23
Introduction NCSN DDPM References
References
1 Weng, Lilian. (Jul 2021). What are diffusion models? Lil’Log.
2 Jonathan Ho,Ajay Jain,Pieter Abbeel, Denoising Diffusion
Probabilistic Models
3 Yang Song,Stefano Ermon, Generative Modeling by Estimating
Gradients of the Data Distribution
4 Generative Modeling by Estimating Gradients of the Data
Distribution
Tran Trong Khiem Diffusion model 23 / 23