Maximum Likelihood and KL Divergence

The ultimate purpose of the generative model is generate new samples that look like real data. This means that we would need to know the probability distribution of data, \(p_{\text{data}}(x)\). However, this distribution is an oracle distribution, which is complex and practically impossible to know. For images, it would be the distribution of entire images in the world. Therefore, the goal of generative model \(p_\phi(x)\) is to approximate it, so that sampling becomes possible.

\[p_\phi(x) \approx p_{\text{data}}(x)\]

Maximum likelihood Estimation (MLE)

Here I give a short review of MLE. Maximum likelihood estimator tries to find the best parameter which maximizes the probability of observing the dataset. The general formula for MLE is:

\[\phi^* = \text{argmax}_\phi \mathbb{E}_{x \sim p_{\text{data}}}[\text{log} p_\phi(x)],\]

where it is common to see this notation as well \(p(x \vert \phi)=p_\phi(x)\).

Recall that the Bayes’ rule of the posterior is \(p(\phi \vert x) = \frac{p(x \vert \phi)p(\phi)}{p(x)}\). MLE ignores the prior of \(\phi\) and just maximizes the likelihood.

I think here comes a bit of terminology confusion (at least for me). When talking about evidence lower bound (ELBO), the term “evidence” actually refers to the marginal likelihood of the data:

\(p_\phi(x) = \int p_\phi(x,z) dz = \int p_\phi(x\vert z)p(z)dz\).

as this appears in Bayes’ rule as:

\[p(z | x) = \frac{p_\phi(x|z)p(z)}{p_\phi(x)}\]

where \(p_\phi(x)\) appears as the evidence (aka. marginal likelihood).

But, note that this is the Bayes’ formula for the latent variable model. \(p_\phi(x)\) has another layer as it is parametrized by \(\phi\), thus in terms of the parameter, \(p_\phi(x)\) is the data likelihood as shown above.

KL Divergence and MLE

So the generative model’s goal is to approximate the true data distribution. We can build an objective function based on the KL divergence, which compares the two distribution in terms of their information.

\[D_{KL}(p_{\text{data}} || p_\phi) = \mathbb{E}_{x \sim p_{\text{data}}}[\text{log} p_{\text{data}}(x) - \text{log}p_\phi(x)]\]

If we try to minimize this KL divergence with respect to \(\phi\), the term \(\mathbb{E}_{x \sim p_{\text{data}}}[\text{log}p_{\text{data}}(x)]\) becomes a constant. Thus, it boils down to maximizing the likelihood.

\[\text{min}_\phi D_{KL}(p_{\text{data}}||p_\phi) \Longleftrightarrow \text{max}_\phi \mathbb{E}_{x \sim p_{\text{data}}}[\text{log}p_\phi(x)]\]

References

Lai, Chieh-Hsin, et al. “The principles of diffusion models.” arXiv preprint arXiv:2510.21890 (2025).