I took a read at Ho's paper on denoising probabilistic diffusion models. It's a pretty dense
paper, with an 3 paragraph introduction that probably could have been closer to 3 pages
with the amount of details that it left out. This post is meant to be those 3 pages.
VAE's and ELBO Summarized
In generative models, we wish to estimate a data distribution pθ(x). It is difficult
in practice to force a single-function to capture everything about the data like:
multi-modality
long-range dependencies
complicated geometry
To make modelling easier, we instead introduce latent variables z, so that we can write:
pθ(x)=∫pθ(x∣z)p(z) dz
This is a continuous mixture model, and representing pθ(x) allows us to express
complicated distributions while keeping pθ(x∣z) relatively simple. Of course
in practice, pθ(x∣z) is also complicated (represented by a neural
network).
Note that in general, evaluating this integral is intractable, because of the
high-dimensional nature of z. This becomes problematic when trying to do inference
(computing p(z∣x)). Even though we can write
p(z∣x)=pθ(x)pθ(x∣z)p(z), computing the denominator is
intractable. Fortunately, introducing an approximate posterior distribution qϕ(z∣x)
allows us to give a lower-bound on the log-likelihood, as well as provide us a way to
do inference.
The last line is due to Jensen's inequality, and the quantity
Ez∼qϕ(⋅∣x)[log qϕ(z∣x)pθ(x,z)]
is known as the evidence-based lower bound (ELBO for short).
So the closer our estimated posterior qϕ matches the true posterior, the smaller the
gap between the true log-likelihood and the ELBO lower-bound.
From our discussion so far, we face three core decisions:
Defining and modelling the distribution of our latents p(z)
Modelling the true posterior distribution pθ(z∣x)
Modelling the estimated posterior distribution qϕ(z∣x)
All about Multivariate Gaussians
Since all the core modelling choices are centered around Gaussians, it's helpful to
review some properties about them. They will come in handy when we derive the training
algorithms outlined in the DDPM paper.
Three Equivalent Definitions
There are three equivalent definitions of a multivariate gaussian:
A random variable X=(X1,X2,…,Xk) is a multivariate normal iff
There exist μ∈Rk,A∈Rk×l such that
X=AZ+μ where Z=(Z1,…,Zl) such that Zi∼N(0,1) for all 1≤i≤n.
Every linear combination Y=∑i=1kaiXi is normally distributed.
There exist μ∈Rk and PSD matrix Σ∈Rk×k such that
ϕX(u)=E[exp(iuTμ−21uTΣu)].
To prove equivalnce, we will prove that 1⟹2, 2⟹3, and 3⟹1.
1⟹2:
We expand Y=aTX=aT(AZ+μ)=∑i=1n(aTAi)Zi+aTμ where Ai is the i-th column
of A. Since the Zi's are independent standard normal, we can conclude that Y is also normally distributed
with mean αTμ and variance ∑i=1n(aTAi)=aTAATa.
2⟹3:
Let Y=uTX. From (2), we know that Y is a normal distribution. Let μ(u),σ2(u) denote the mean and
variance of Y (implicitly dependent on μ). We can relate the characteristic functions of X and Y as follows:
We need to analyze the mean and variance of Y. It remains to show that
μ(u)=uTμ and σ2(u)=uTΣu for some
μ∈Rk and PSD matrix Σ∈Rk×k.
Define μ=(μ1,…,μk) where μi=E[eiTX]=E[Xi].
We then have that E[Y]=E[uTX]=∑i=1kuiE[eiTX]=uTμ.
We see that Var(uTX)=∑i,juiujCov(Xi,Xj). Define
Σ∈Rk×k where Σij=Cov(Xi,Xj). Since Σ is a
covariance matrix, it is positive symmetric definite matrix by construction. Noting that
∑i,juiujCov(Xi,Xj)=uTΣu, we are done.
3⟹1:
Let X be a random vector with characteristic function:
ϕX(u)=E[exp(iuTμ−21uTΣu)]
for some μ∈Rk and PSD matrix Σ∈Rk×k.
Since Σ is a PSD matrix, it can be factored as Σ=AAT. Define Y=AZ+μ, where
Z=(Z1,…,Zl) such that Zi∼N(0,1) for all 1≤i≤n. We will
show that the characteristic function of Y is equal to that of X, completing the proof.
We have:
ϕY(u)=E[exp(iuTY)]=exp(iuTμ)⋅E[exp(i(uTA)Z)]
Let b=ATu. We can rewrite
E[exp(i(uTA)Z)]=E[exp(ibTZ)]
Since the Zi's are iid standard normals, bTZ∼N(0,bTb). Therefore,
We now state the formula for the KL divergence of two Gaussian variables. Given p1∼N(μ1,Σ1),
p2∼N(μ2,Σ2), the KL divergence between the two distributions is:
From the definition of the KL divergence, we now proceed to take expectations under the distribution p1.
The first term is constant with respect to x. For the second term, we have that:
The first term is (X−μ1)TΣ2−1(X−μ1), so taking expectations will give
Tr(Σ2−1Σ1). Since Y has zero mean, the expected value of the second term is 0.
Finally the third term is constant, with value (μ1−μ2)TΣ2−1(μ1−μ2).
Putting this all together, we have the desired formula. Note for isotropic Gaussians of the form N(μ,βI),
the KL divergence simplifies to:
=2β21∥μ1−μ2∥22+C
where C are some constant terms.
Denoising Probabilistic Diffusion Models
Let's go back to our modelling paradigm. Recall that we need to make three decisions:
Defining and modelling the distribution of our latents p(z)
Modelling the true posterior distribution pθ(z∣x)
Modelling the estimated posterior distribution qϕ(z∣x)
In the DDPM paper, they make the following choices:
z is a sequence of latent variables obtained by an iterative denoising process. In
particular, they start out with a data point xT∼N(0,I), and model the
sequence xT−1,…,x1 via a Markov chain with Gaussain transitions. Note that
x0∼q(x0) is the actual data.
the true posterior distribution is an iterative denoising process where xt−1=N(μ(xt,t),Σ(xt,t))
where μθ(⋅),Σθ(⋅) are learned functions.
the approximate distribution q(x1:T∣x0) is also fixed to a markov chain witht the following transition rule:
xt=N(xt−1,βtI) where 1≤t≤T and βt is a predefined variance schedule.
This process is called the forward process or the diffusion process.
ELBO for Diffusion Models
For training, we wish to minimize the log-likelihood of the data, or to maximize the negative ELBO as
a proxy. We write the ELBO as follows:
Deriving the Simplified Loss Function for Training
In the DDPM paper, they ignore the first and last terms of the loss function and focus solely
on optimizing the DKL(q(xt−1∣xt,x0)∥pθ(xt−1∣xt)) terms.
A nice fact is that q(xt−1∣xt,x0) is gaussian with mean μt~(xt,x0) and
covariance βt~I where
In the paper, they make the simplification that Σθ(xt,t)=σt2I where σt2 set
to either βt or βt. Since each KL term in the loss is between two gaussians, we can write
each term (dropping constants) as:
Eq[2σt21∥μt~(xt,x0)−μθ(xt,t)∥2]
From the relation, xt=αtx0+1−αtϵ, we can
write x0=αt1(xt−1−αtϵ)
and substitute:
What we have here is a regression problem where the learned mean μθ needs to predict
αt1(xt−1−αtβtϵ). Instead of
predicting the mean directly, we predict the noise given a timestep t and xt. From this,
we know that the estimated x0 is
αt1(xt−1−αtϵθ(xt,t)),
where ϵθ is our learned model. Plugging this into the formula for μ~, we
have that
μθ(xt,t)=αt1(xt−1−αtβtϵθ(xt,t))
During sampling we iteratively apply the rule xt=μθ(xt,t)+σt2z where z is a
standard Gaussian.
Using the reparameterized noise model, we have that: