I took a read at Ho's paper on denoising probabilistic diffusion models. It's a pretty dense paper, with an 3 paragraph introduction that probably could have been closer to 3 pages with the amount of details that it left out. This post is meant to be those 3 pages.

VAE's and ELBO Summarized

In generative models, we wish to estimate a data distribution pθ(x)p_{\theta}(x). It is difficult in practice to force a single-function to capture everything about the data like:

  • multi-modality
  • long-range dependencies
  • complicated geometry

To make modelling easier, we instead introduce latent variables zz, so that we can write:

pθ(x)=pθ(xz)p(z) dz p_{\theta}(x) = \int p_{\theta}(x | z)p(z) \text{ dz}

This is a continuous mixture model, and representing pθ(x)p_{\theta}(x) allows us to express complicated distributions while keeping pθ(xz)p_{\theta}(x | z) relatively simple. Of course in practice, pθ(xz)p_{\theta}(x | z) is also complicated (represented by a neural network).

Note that in general, evaluating this integral is intractable, because of the high-dimensional nature of zz. This becomes problematic when trying to do inference (computing p(zx)p(z | x)). Even though we can write p(zx)=pθ(xz)p(z)pθ(x)p(z | x) = \frac{p_{\theta}(x | z)p(z)}{p_{\theta}(x)}, computing the denominator is intractable. Fortunately, introducing an approximate posterior distribution qϕ(zx)q_{\phi}(z | x) allows us to give a lower-bound on the log-likelihood, as well as provide us a way to do inference.

Deriving the ELBO

We can write log pθ(x)\text{log }p_{\theta}(x) as follows:

log pθ(x) \text{log }p_{\theta}(x) =log pθ(x,z) dz = \text{log }\int p_{\theta}(x, z) \text{ dz} =log pθ(x,z)qϕ(zx)qϕ(zx) dz = \text{log }\int p_{\theta}(x, z) \cdot \frac{q_{\phi}(z | x)}{q_{\phi}(z | x)}\text{ dz} =log Ezqϕ(x)[pθ(x,z)qϕ(zx)] = \text{log }\mathbb{E}_{z \sim q_{\phi}(\cdot | x)}\left[ \frac{p_{\theta}(x, z)}{q_{\phi}(z | x)} \right] Ezqϕ(x)[log pθ(x,z)qϕ(zx)] \geq \mathbb{E}_{z \sim q_{\phi}(\cdot | x)}\left[ \text{log }\frac{p_{\theta}(x, z)}{q_{\phi}(z | x)} \right]

The last line is due to Jensen's inequality, and the quantity Ezqϕ(x)[log pθ(x,z)qϕ(zx)]\mathbb{E}_{z \sim q_{\phi}(\cdot | x)}\left[ \text{log }\frac{p_{\theta}(x, z)}{q_{\phi}(z | x)} \right] is known as the evidence-based lower bound (ELBO for short).

We can also rewrite the ELBO as follows:

Ezqϕ(x)[log pθ(x)+log pθ(zx)log qϕ(zx)] \mathbb{E}_{z \sim q_{\phi}(\cdot | x)}\left[ \text{log }p_{\theta}(x) + \text{log }p_{\theta}(z | x) - \text{log }q_{\phi}(z | x) \right]

Rearranging, we get that:

log pθ(x)=ELBO +DKL(qϕ(zx)pθ(zx)) \text{log }p_{\theta}(x) = \text{ELBO } + D_{KL}(q_{\phi}(z | x) \lVert p_{\theta}(z | x))

So the closer our estimated posterior qϕq_{\phi} matches the true posterior, the smaller the gap between the true log-likelihood and the ELBO lower-bound.

From our discussion so far, we face three core decisions:

  • Defining and modelling the distribution of our latents p(z)p(z)
  • Modelling the true posterior distribution pθ(zx)p_{\theta}(z | x)
  • Modelling the estimated posterior distribution qϕ(zx)q_{\phi}(z | x)

All about Multivariate Gaussians

Since all the core modelling choices are centered around Gaussians, it's helpful to review some properties about them. They will come in handy when we derive the training algorithms outlined in the DDPM paper.

Three Equivalent Definitions

There are three equivalent definitions of a multivariate gaussian:

A random variable X=(X1,X2,,Xk)X = (X_1, X_2, \ldots, X_k) is a multivariate normal iff

  1. There exist μRk,ARk×l\mu \in \mathbb{R}^k, A \in \mathbb{R}^{k \times l} such that X=AZ+μX = AZ + \mu where Z=(Z1,,Zl)Z = (Z_1, \ldots, Z_l) such that ZiN(0,1)Z_i \sim \mathcal{N}(0, 1) for all 1in1 \leq i \leq n.
  2. Every linear combination Y=i=1kaiXiY = \sum_{i = 1}^k a_iX_i is normally distributed.
  3. There exist μRk\mu \in \mathbb{R}^k and PSD matrix ΣRk×k\Sigma \in \mathbb{R}^{k \times k} such that ϕX(u)=E[exp(iuTμ12uTΣu)]\phi_X(u) = \mathbb{E}\left[\text{exp}\left(iu^T\mu - \frac{1}{2}u^T\Sigma u\right)\right].

To prove equivalnce, we will prove that 1    21 \implies 2, 2    32 \implies 3, and 3    13 \implies 1.

1    21 \implies 2:

We expand Y=aTX=aT(AZ+μ)=i=1n(aTAi)Zi+aTμY = a^TX = a^T(AZ + \mu) = \sum{i = 1}^n (a^TA_i)Z_i + a^T\mu where AiA_i is the ii-th column of A. Since the ZiZ_i's are independent standard normal, we can conclude that YY is also normally distributed with mean αTμ\alpha^T\mu and variance i=1n(aTAi)=aTAATa\sum{i = 1}^n (a^TA_i) = a^TAA^Ta.

2    32 \implies 3:

Let Y=uTXY = u^TX. From (2), we know that YY is a normal distribution. Let μ(u),σ2(u)\mu(u), \sigma^2(u) denote the mean and variance of YY (implicitly dependent on μ\mu). We can relate the characteristic functions of XX and YY as follows:

ϕX(u)=E[exp(iuTX)] \phi_X(u) = \mathbb{E}\left[\text{exp}\left(iu^TX\right)\right] =E[exp(iY)] = \mathbb{E}\left[\text{exp}\left(iY\right)\right] =ϕY(1) = \phi_Y(1) =exp(iμ(u)12σ2(u)) = \text{exp}\left(i\mu(u) - \frac{1}{2}\sigma^2(u)\right)

We need to analyze the mean and variance of YY. It remains to show that μ(u)=uTμ\mu(u) = u^T\mu and σ2(u)=uTΣu\sigma^2(u) = u^T\Sigma u for some μRk\mu \in \mathbb{R}^k and PSD matrix ΣRk×k\Sigma \in \mathbb{R}^{k \times k}. Define μ=(μ1,,μk)\mu = (\mu_1, \ldots, \mu_k) where μi=E[eiTX]=E[Xi]\mu_i = \mathbb{E}[e_i^TX] = \mathbb{E}[X_i]. We then have that E[Y]=E[uTX]=i=1kuiE[eiTX]=uTμ\mathbb{E}[Y] = \mathbb{E}[u^TX] = \sum_{i = 1}^k u_i\mathbb{E}[e_i^TX] = u^T\mu.

We see that Var(uTX)=i,juiujCov(Xi,Xj)\text{Var}(u^TX) = \sum_{i, j}u_iu_j\text{Cov}(X_i, X_j). Define ΣRk×k\Sigma \in \mathbb{R}^{k \times k} where Σij=Cov(Xi,Xj)\Sigma_{ij} = \text{Cov}(X_i, X_j). Since Σ\Sigma is a covariance matrix, it is positive symmetric definite matrix by construction. Noting that i,juiujCov(Xi,Xj)=uTΣu\sum_{i, j}u_iu_j\text{Cov}(X_i, X_j) = u^T\Sigma u, we are done.

3    13 \implies 1:

Let XX be a random vector with characteristic function: ϕX(u)=E[exp(iuTμ12uTΣu)]\phi_X(u) = \mathbb{E}\left[\text{exp}\left(iu^T\mu - \frac{1}{2}u^T\Sigma u\right)\right] for some μRk\mu \in \mathbb{R}^k and PSD matrix ΣRk×k\Sigma \in \mathbb{R}^{k \times k}.

Since Σ\Sigma is a PSD matrix, it can be factored as Σ=AAT\Sigma = AA^T. Define Y=AZ+μY = AZ + \mu, where Z=(Z1,,Zl)Z = (Z_1, \ldots, Z_l) such that ZiN(0,1)Z_i \sim \mathcal{N}(0, 1) for all 1in1 \leq i \leq n. We will show that the characteristic function of YY is equal to that of XX, completing the proof.

We have:

ϕY(u)=E[exp(iuTY)] \phi_Y(u) = \mathbb{E}\left[\text{exp}\left(iu^TY\right)\right] =exp(iuTμ)E[exp(i(uTA)Z)] = \text{exp}\left(iu^T\mu\right) \cdot \mathbb{E}\left[\text{exp}\left(i(u^TA)Z\right)\right]

Let b=ATub = A^Tu. We can rewrite

E[exp(i(uTA)Z)]=E[exp(ibTZ)] \mathbb{E}\left[\text{exp}\left(i(u^TA)Z\right)\right] = \mathbb{E}\left[\text{exp}\left(ib^TZ\right)\right]

Since the ZiZ_i's are iid standard normals, bTZN(0,bTb)b^TZ \sim \mathcal{N}(0, b^Tb). Therefore,

E[exp(ibTZ)]=exp(12bTb) \mathbb{E}\left[\text{exp}\left(ib^TZ\right)\right] = \text{exp}\left(-\frac{1}{2}b^Tb\right) =exp(12uTAATu) = \text{exp}\left(-\frac{1}{2}u^TAA^Tu\right)

We now have:

ϕY(u)=exp(iuTμ)exp(12uTAATu) \phi_Y(u) = \text{exp}\left(iu^T\mu\right) \cdot \text{exp}\left(-\frac{1}{2}u^TAA^Tu\right) =exp(iuTμ12uTΣu) = \text{exp}\left(iu^T\mu - \frac{1}{2}u^T\Sigma u\right)

as desired.

KL Divergence between of Multivariate Gaussians

We now state the formula for the KL divergence of two Gaussian variables. Given p1N(μ1,Σ1)p_1 \sim \mathcal{N}(\mu_1, \Sigma_1), p2N(μ2,Σ2)p_2 \sim \mathcal{N}(\mu_2, \Sigma_2), the KL divergence between the two distributions is:

12[logΣ2Σ1k+Tr(Σ21Σ)+(μ1μ2)TΣ21(μ1μ2)] \frac{1}{2} \left[\text{log}\frac{|\Sigma_2|}{|\Sigma_1|} - k + \text{Tr}\left(\Sigma_2^{-1}\Sigma\right) + (\mu_1 - \mu_2)^T\Sigma_2^{-1}(\mu_1 - \mu_2) \right]

To show this, start off with the fact that:

logp1(x)=k2log 2π12logΣ112(xμ1)TΣ11(xμ1) \text{log}p_1(x) = -\frac{k}{2} \text{log }2\pi - \frac{1}{2}\text{log}|\Sigma_1| - \frac{1}{2}(x - \mu_1)^T\Sigma_1^{-1}(x - \mu_1)

and

logp2(x)=k2log 2π12logΣ212(xμ2)TΣ21(xμ2) \text{log}p_2(x) = -\frac{k}{2} \text{log }2\pi - \frac{1}{2}\text{log}|\Sigma_2| - \frac{1}{2}(x - \mu_2)^T\Sigma_2^{-1}(x - \mu_2)

Therefore

logp1(x)p2(x)=12logΣ1Σ212(xμ1)TΣ11(xμ1)+12(xμ2)TΣ21(xμ2) \text{log}\frac{p_1(x)}{p_2(x)} = - \frac{1}{2}\text{log}\frac{|\Sigma_1|}{|\Sigma_2|} - \frac{1}{2}(x - \mu_1)^T\Sigma_1^{-1}(x - \mu_1) + \frac{1}{2}(x - \mu_2)^T\Sigma_2^{-1}(x - \mu_2)

From the definition of the KL divergence, we now proceed to take expectations under the distribution p1p_1. The first term is constant with respect to xx. For the second term, we have that:

Ep1[12(xμ1)TΣ11(xμ1)]=12Ep1[Tr((xμ1)TΣ11(xμ1))] \mathbb{E}_{p_1}\left[-\frac{1}{2}(x - \mu_1)^T\Sigma_1^{-1}(x - \mu_1)\right] = -\frac{1}{2}\mathbb{E}_{p_1}\left[\text{Tr}\left((x - \mu_1)^T\Sigma_1^{-1}(x - \mu_1)\right)\right] =12Ep1[Tr(Σ11(xμ1)(xμ1)T)] = -\frac{1}{2}\mathbb{E}_{p_1}\left[\text{Tr}\left(\Sigma_1^{-1}(x - \mu_1)(x - \mu_1)^T\right)\right] =12Tr(Σ11Ep1[(xμ1)(xμ1)T]) = -\frac{1}{2}\text{Tr}\left(\Sigma_1^{-1}\mathbb{E}_{p_1}\left[(x - \mu_1)(x - \mu_1)^T\right]\right) =12Tr(Σ11Σ1) = -\frac{1}{2}\text{Tr}\left(\Sigma_1^{-1}\Sigma_1\right) =k2 = -\frac{k}{2}

Now, onto calculating the third term. We introduce variables y=xμ1y = x - \mu_1 and δ=μ1μ2\delta = \mu_1 - \mu_2. We can therefore rewrite

Ep1[12(xμ2)TΣ21(xμ2)]=Ep1[12(y+δ)TΣ21(y+δ)] \mathbb{E}_{p_1}\left[\frac{1}{2}(x - \mu_2)^T\Sigma_2^{-1}(x - \mu_2)\right] = \mathbb{E}_{p_1}\left[\frac{1}{2}(y + \delta)^T\Sigma_2^{-1}(y + \delta)\right] =12Ep1[yTΣ21y+2yTΣ21δ+δTδ] = \frac{1}{2}\mathbb{E}_{p_1}\left[y^T\Sigma_2^{-1}y + 2y^T\Sigma_2^{-1}\delta + \delta^T\delta\right]

The first term is (Xμ1)TΣ21(Xμ1)(X - \mu_1)^T\Sigma_2^{-1}(X - \mu_1), so taking expectations will give Tr(Σ21Σ1)\text{Tr}(\Sigma_2^{-1}\Sigma_1). Since YY has zero mean, the expected value of the second term is 0. Finally the third term is constant, with value (μ1μ2)TΣ21(μ1μ2)(\mu_1 - \mu_2)^T\Sigma_2^{-1}(\mu_1 - \mu_2).

Putting this all together, we have the desired formula. Note for isotropic Gaussians of the form N(μ,βI)\mathcal{N}(\mu, \beta I), the KL divergence simplifies to:

=12β2μ1μ222+C = \frac{1}{2\beta_2}\lVert\mu_1 - \mu_2\rVert_2^2 + C

where CC are some constant terms.

Denoising Probabilistic Diffusion Models

Let's go back to our modelling paradigm. Recall that we need to make three decisions:

  • Defining and modelling the distribution of our latents p(z)p(z)
  • Modelling the true posterior distribution pθ(zx)p_{\theta}(z | x)
  • Modelling the estimated posterior distribution qϕ(zx)q_{\phi}(z | x)

In the DDPM paper, they make the following choices:

  • zz is a sequence of latent variables obtained by an iterative denoising process. In particular, they start out with a data point xTN(0,I)x_T \sim \mathcal{N}(0, I), and model the sequence xT1,,x1x_{T - 1}, \ldots, x_1 via a Markov chain with Gaussain transitions. Note that x0q(x0)x_0 \sim q(x_0) is the actual data.
  • the true posterior distribution is an iterative denoising process where xt1=N(μ(xt,t),Σ(xt,t))x_{t - 1} = \mathcal{N}(\mu(x_t, t), \Sigma(x_t, t)) where μθ(),Σθ()\mu_{\theta}(\cdot), \Sigma_{\theta}(\cdot) are learned functions.
  • the approximate distribution q(x1:Tx0)q(x_{1:T} | x_0) is also fixed to a markov chain witht the following transition rule: xt=N(xt1,βtI)x_t = \mathcal{N}(x_{t - 1}, \beta_t I) where 1tT1 \leq t \leq T and βt\beta_t is a predefined variance schedule. This process is called the forward process or the diffusion process.

ELBO for Diffusion Models

For training, we wish to minimize the log-likelihood of the data, or to maximize the negative ELBO as a proxy. We write the ELBO as follows:

Eq[logpθ(x0:T)q(x1:Tx0)]=Eq[logpθ(xT)t1logpθ(xt1xt)q(xtxt1)] \mathbb{E}_q\left[-\text{log}\frac{p_{\theta}(x_{0:T})}{q(x_{1:T} | x_0)}\right] = \mathbb{E}_q\left[ -\text{log}p_{\theta}(x_T) - \sum_{t \geq 1}\text{log}\frac{p_{\theta}(x_{t-1}|x_t)}{q(x_t | x_{t - 1})} \right]

We can further rewrite this as:

Eq[DKL(q(xTx0)pθ(xT))+t>1DKL(q(xt1xt,x0)pθ(xt1xt))log pθ(x0x1)] \mathbb{E}_q\left[ D_{KL}(q(x_T | x_0) \lVert p_{\theta}(x_T)) + \sum_{t > 1} D_{KL}(q(x_{t - 1} | x_t, x_0) \lVert p_{\theta}(x_{t - 1} | x_t)) - \text{log } p_{\theta}(x_0 | x_1) \right]

Deriving the Simplified Loss Function for Training

In the DDPM paper, they ignore the first and last terms of the loss function and focus solely on optimizing the DKL(q(xt1xt,x0)pθ(xt1xt))D_{KL}(q(x_{t - 1} | x_t, x_0) \lVert p_{\theta}(x_{t - 1} | x_t)) terms. A nice fact is that q(xt1xt,x0)q(x_{t - 1} | x_t, x_0) is gaussian with mean μt~(xt,x0)\tilde{\mu_t}(x_t, x_0) and covariance βt~I\tilde{\beta_t}I where

μt~(xt,x0)=αt1βt1αtx0+αt(1αt1)1αtxt \tilde{\mu_t}(x_t, x_0) = \frac{\sqrt{\overline{\alpha_{t - 1}}}\beta_t}{1 - \overline{\alpha_t}}x_0 + \frac{\sqrt{\alpha_t (1 - \overline{\alpha_{t - 1}})}}{1 - \overline{\alpha_t}}x_t

and

βt~=1αt11αtβt \tilde{\beta_t} = \frac{1 - \overline{\alpha_{t - 1}}}{1 - \overline{\alpha_t}}\beta_t

In the paper, they make the simplification that Σθ(xt,t)=σt2I\Sigma_{\theta}(x_t, t) = \sigma_t^2I where σt2\sigma_t^2 set to either βt\beta_t or βt\overline{\beta_t}. Since each KL term in the loss is between two gaussians, we can write each term (dropping constants) as:

Eq[12σt2μt~(xt,x0)μθ(xt,t)2] \mathbb{E}_q\left[\frac{1}{2\sigma_t^2}\lVert \tilde{\mu_t}(x_t, x_0) - \mu_{\theta}(x_t, t) \rVert^2\right]

From the relation, xt=αtx0+1αtϵx_t = \sqrt{\overline{\alpha_t}}x_0 + \sqrt{1 - \overline{\alpha_t}}\epsilon, we can write x0=1αt(xt1αtϵ)x_0 = \frac{1}{\sqrt{\overline{\alpha_t}}}\left(x_t - \sqrt{1 - \overline{\alpha_t}}\epsilon\right) and substitute:

Ex0,ϵ[12σt21αt(xtβt1αt)μθ(xt,t)2] \mathbb{E}_{x_0, \epsilon}\left[ \frac{1}{2\sigma_t^2}\lVert \frac{1}{\sqrt{\alpha_t}}\left( x_t - \frac{\beta_t}{\sqrt{1 - \overline{\alpha_t}}} \right) - \mu_{\theta}(x_t, t) \rVert^2 \right]

What we have here is a regression problem where the learned mean μθ\mu_{\theta} needs to predict 1αt(xtβt1αtϵ)\frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1 - \overline{\alpha_t}}}\epsilon\right). Instead of predicting the mean directly, we predict the noise given a timestep tt and xtx_t. From this, we know that the estimated x0x_0 is 1αt(xt1αtϵθ(xt,t))\frac{1}{\sqrt{\overline{\alpha_t}}}\left(x_t - \sqrt{1 - \overline{\alpha_t}}\epsilon_{\theta}(x_t, t)\right), where ϵθ\epsilon_{\theta} is our learned model. Plugging this into the formula for μ~\tilde{\mu}, we have that

μθ(xt,t)=1αt(xtβt1αtϵθ(xt,t)) \mu_{\theta}(x_t, t) = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1 - \overline{\alpha_t}}}\epsilon_{\theta}(x_t, t) \right)

During sampling we iteratively apply the rule xt=μθ(xt,t)+σt2zx_t = \mu_{\theta}(x_t, t) + \sigma_t^2 z where zz is a standard Gaussian.

Using the reparameterized noise model, we have that:

Eq[12σt2μt~(xt,x0)μθ(xt,t)2]=Ex0,ϵ[βt22σt2αt(1αt)ϵϵθ(xt,t)2] \mathbb{E}_q\left[\frac{1}{2\sigma_t^2}\lVert \tilde{\mu_t}(x_t, x_0) - \mu_{\theta}(x_t, t) \rVert^2\right] = \mathbb{E}_{x_0, \epsilon}\left[ \frac{\beta_t^2}{2\sigma_t^2\alpha_t(1 - \overline{\alpha_t})} \lVert \epsilon - \epsilon_{\theta}(x_t, t)\rVert^2 \right]

The DDPM paper drops the constant in front of the squared-norm.

Here is the pseudocode for training and sampling.

Training and Sampling