Energy-based models and Score matching models

Energy-based models represent the pdf with an energy function, \(E_\phi(x)\). Basically, the ideally learned EBMs will tell us that higher probability data have lower energy.

\[p_\phi(x) = \frac{e^{-E_\phi(x)}}{Z_\phi}, Z_\phi = \int e^{-E_\phi(x)} dx\]

As usual, we would like to use the maximum likelihood to train EBMs:

\[L_{\text{MLE}}(\phi) = \mathbb{E}_{x \sim p_{\text{data}}(x)}[\text{log} p_\phi(x)] = - \mathbb{E}_{x \sim p_{\text{data}}(x)}[E_\phi(x)] - \text{log} Z_\phi\]

The first term tells us that the goal is to minimize the expected energy of observed data, i.e. maximize their probability. The second term is a regularizer that related to the entropy.

Entropy of the model is: \(H(p_\phi) = -\mathbb{E}_{x \sim p_\phi}[\text{log} p_\phi(x)] = \mathbb{E}_{x \sim p_\phi}[E_\phi(x)] + \text{log} Z_\phi.\)

As \(Z_\phi\) is a normalizing constant, if it increases, it means that the probability will generally decrease, i.e. more evenly distributed across space. This means that the overall entropy increases. This brings an effect of improved mode coverage.

Challenge in EBMs

Directly optimizing MLE in this way is a challenge, as \(Z_\phi\) is intractable. We can get away from this by playing in the gradient space as by taking the gradient of the log probability, the normalizing constant disappears. So now, comes the score function.

Score function is a general term that refers to the gradient of the log probability. It tells us the direction of higher probability. It describes the vector field.

\[s(x) = \nabla_x \text{log} p(x)\]

If we plug in the energy function here, we get the following without the normalizing constant.

\[s_\phi(x) = \nabla_x \text{log} p_\phi(x) = - \nabla_x E_\phi(x)\]

So, instead of training EBMs to maximize the likelihood, we can train it by matching the model score to the true score function, \(s(x)\).

\[L_{SM}(\phi) = \frac{1}{2}\mathbb{E}_{x\sim p_\text{data}}|| s_\phi(x) - s(x) ||_2^2 \\ = \frac{1}{2}\mathbb{E}_{x\sim p_\text{data}}|| \nabla_x \text{log} p_\phi(x) - \nabla_x \text{log} p_{\text{data}}(x)||_2^2\]

As the true score function is not possible to compute, Hyvarinen et al. have shown that there is an equivalent form so that the objective only depends on \(s_\phi\).

\[L_{SM}(\phi) = \mathbb{E}_{x\sim p_\text{data}}[ \text{Tr}(\nabla_x s_\phi(x)) + \frac{1}{2} || s_\phi(x) ||_2^2]\]

Minimizing the second term simply tells us that the score should be small for highly probable data (high probability region should have 0 gradient (maximum)). Thus it is at best 0. This leads to the goal of making first term negative. It is the divergence term as it is the trace of the second derivative. Negative divergence means that around the high probability points, the vector field should point inwards, i.e. is a sink.

From energy-based models to score-based models

Jacobian is expensive to compute. So came the idea that instead of bothering about energies to just directly work with scores. Can’t we just train a neural network that predicts score functions instead of energy functions?

Yes, but the problem remains that the true score function is intractable. The idea to overcome this is by injecting known amount of noise to data and working with the “noisy” data distribution and conditionals.

\[\tilde{x} = x + N(0, \sigma^2I)\] \[p_\sigma(\tilde{x}) = \int p_\sigma(\tilde{x}|x) p_{\text{data}}(x)dx\] \[L_{SM}(\phi) = \frac{1}{2}\mathbb{E}_{\tilde{x} \sim p_\sigma}[|| s_\phi(\tilde{x}) - \nabla_{\tilde{x}} \text{log} p_\sigma(\tilde{x})||_2^2]\]

Using the conditioning technique, we can avoid the gradient of the marginal and just work with the conditional. The difference is only a constant, thus optimizing the conditional is equivalent to optimizing with the marginal distribution.

\[L_{DSM}(\phi) = \frac{1}{2}\mathbb{E}_{\tilde{x} \sim p_\sigma(\cdot | x), x \sim p_\text{data}}[|| s_\phi(\tilde{x}) - \nabla_{\tilde{x}} \text{log} p_\sigma(\tilde{x}|x)||_2^2]\]

The conditional is very easy to compute as we have designed the noisy version ourselves. For instance, for a Gaussian noise: \(p_\sigma(\tilde{x} \vert x) = N(\tilde{x}; x, \sigma^2I)\). The conditional score is then \(\nabla_{\tilde{x}} \text{log} p_\sigma(\tilde{x}\vert x) = -\frac{\tilde{x} - x}{\sigma^2}\).

Thus, the objective simplifies in this case to:

\[L_{DSM}(\phi) = \frac{1}{2}\mathbb{E}_{\tilde{x} \sim p_\sigma(\cdot | x), x \sim p_\text{data}}[|| s_\phi(\tilde{x}) + \frac{\tilde{x} - x}{\sigma^2} ||_2^2].\]

References

Lai, Chieh-Hsin, et al. “The principles of diffusion models.” arXiv preprint arXiv:2510.21890 (2025).