Evidence Lower Bound (ELBO)

In Bayes’ rule, evidence is another name for the marginal likelihood of the data. It is called as “evidence”, because it measures how well the data supports the model.

\[p(z|x) = \frac{p(x|z)p(z)}{\color{red}{p(x)}}\]

where \(p(x) = \int p(x,z) dz = \int p(x\vert z)p(z)dz\).

The ultimate goal of many likelihood-based generative models is to maximize this. However, this marginal likelihood is most of the time intractable due to the integral over all possible latent variable \(z\). Therefore, ELBO, which is a tractable bound, comes as a useful tool in designing optimizers. Since we cannot maximize \(\text{log}p_\phi(x)\) directly, we instead maximize it’s lower bound.

\[\text{log}p_\phi(x) \geq L_{\text{ELBO}}(\theta, \phi; x)\]

I wrote the above formula for VAEs, as \(\theta\) is an encoder parameter, \(\phi\) is a decoder parameter. In case of DDPMs, encoder is fixed, so \(\theta\) doesn’t exist and \(\phi\) is the learnable decoder (denoiser) parameter.

Deriving ELBO

Deriving the formula for ELBO is actually quite simple. I do it here for VAEs. Let \(q_\theta(z|x)\) be an encoder of a VAE, which is an approximation of \(p_\phi(z|x)\). For DDPMs, this is just replaced by the known encoder \(p(x_{i-1}|x_i)\).

First, we rewrite the marginal with a joint distribution and use \(q_\theta\).

\[\text{log}p_\phi(x) = \text{log} \int p_\phi(x,z) dz = log \int q_\theta(z|x) \frac{p_\phi(x,z)}{q_\theta(z|x)}dz \\ = \text{log} \mathbb{E}_{z \sim q_\theta(z|x)}[\frac{p_\phi(x,z)}{q_\theta(z|x)}]\]

Since log is concave function, the Jensen’s inequality tells us that \(\text{log} \mathbb{E}[Z] \geq \mathbb{E}[\text{log}Z]\) (think about drawing a straight line between two points in a concave function):

\[\text{log} \mathbb{E}_{z \sim q_\theta(z|x)}[\frac{p_\phi(x,z)}{q_\theta(z|x)}] \geq \mathbb{E}_{z \sim q_\theta(z|x)}[\text{log} \frac{p_\phi(x,z)}{q_\theta(z|x)}]\]

Then, just expanding the last equation gives the ELBO:

\[L_{ELBO} = \mathbb{E}_{z \sim q_\theta(z|x)}[\text{log}p_\phi(x|z)] - D_{KL}(q_\theta(z|x) || p(z))\]

where the KL divergence is the difference in the information of two distributions

\[D_{KL}(q_\theta(z|x) || p(z)) = \mathbb{E}_{z \sim q_\theta(z|x)}[\text{log}\frac{q_\theta(z|x)}{p(z)}].\]

The first term of ELBO is the likelihood of observing the data given the latent variable, which in other words, is the reconstruction term. The second KL divergence term pushes the encoder distribution \(q_\theta(z\vert x)\) to be close to the latent distribution \(p(z)\), which in many cases is a simple Gaussian.

ELBO for DDPMs

Same as VAEs, the DDPMs’ training objective is also to maximize the marginal log-likelihood. Notable differences are that the encoder is a known distribution, since the “noising” process is small steps of adding Gaussian noise, and that the marginal log-likelihood is a joint distribution over all \(T\) annealing steps, instead of a single latent \(z\). Note that the DDPM notation is that \(x_0\) is the clean data and \(x_T\) is a noisy sample.

\[log p_\phi(x) = log \int p_\phi(x, x_{0:T})dx_{0:T}.\]

In the same way as in VAEs where we inserted \(q_\theta(z\vert x)\), here we insert known distribution of the forward process \(p(x_{0:T}\vert x)\), which is a joint distribution of all the noisy samples of x.

\[log p_\phi(x) = log \int p(x_{0:T}|x) \frac{p_\phi(x, x_{0:T})}{p(x_{0:T}|x)}dx_{0:T}\]

Expanding the above and using the Jensen’s inequality in the same way will give the following ELBO with 3 terms:

\[L_{\text{ELBO}}(x_0; \phi) = - D_{KL}(p(x_T|x_0)||p_{\text{prior}}(x_T)) + \mathbb{E}_{p(x_1|x_0)}[\text{log}p_\phi(x_0|x_1)] - \sum_{i=1}^T \mathbb{E}_{p(x_i|x_0)} [D_{KL}(p(x_{i-1}|x_i, x_0)||p_\phi(x_{i-1}|x_i))]\]

where the first one becomes close to zero if we add enough noise to make \(x_T\) completely noisy so that \(p(\cdot \vert x_0) \approx p_{\text{prior}}(\cdot)\), and the second one is a reconstruction/denoising term and the last term is the diffusion term, where we make sure that the true distribution of the denoising process \(p(x_{i-1}\vert x_i)\) is approximated with the learned denoiser \(p_\phi(x_{i-1}\vert x_i)\). Note that since \(p(x_{i-1}\vert x_i)\) is not tractable, conditional distribution is used as done commonly in most diffusion/flow models.

References

Lai, Chieh-Hsin, et al. “The principles of diffusion models.” arXiv preprint arXiv:2510.21890 (2025).