As we all know, there has been a rapid evolution of distribution- learning models. For me, it has been and still is a challenge to see the connection between them. Only looking at the final objective, they look like they have different goals. What helped me to get a birds-eye view is remembering that all these models are trying to learn the data distribution, and the most basic objective for that is maximizing the likelihood.

DDPMs and EBMs’ objectives look different at first glance, but both objectives initially arise from MLE.

For DDPM, maximizing likelihood leads to the ELBO. As the forward and reverse transitions are Gaussian, the ELBO simplifies to an MSE loss of the noise (or mean) prediction at each time step.

For EBMs, maximizing the likelihood leads to minimizing the expected energy on data while maximizing entropy through the normalizing constant. However, the intractability of the normalizing constant motivates working with the gradient of the log-likelihood instead. Therefore came the idea of using score functions instead of the densities, as score functions are sufficient to describe the underlying density. So if you have learned a correct score function, you will obtain the right density function.

As the idea of score matching came about, NCSN directly uses this idea to train a score-matching model, bypassing the energy function.