MMSE is an estimator that gives us the expectation of the posterior.

Let \(x\) be a clean signal and \(\tilde{x}\) a noisy version of that:

\[\begin{align} \begin{aligned} \tilde{x} &= x + \sigma \epsilon \\ \epsilon &\sim N(0, I) \end{aligned} \end{align}\]

then, the MMSE denoiser is

\[\hat{x} = \mathbb{E}[x|\tilde{x}] = \int x p(x|\tilde{x}) dx = \int x \frac{p(\tilde{x}|x)p(x)}{p(\tilde{x})} dx\]

From the setup of the problem, we know that \(p(\tilde{x}\vert x) \sim N(x, \sigma^2)\), thus the marginal distribution of \(p(\tilde{x})\) is

\[p(\tilde{x}) = \int p(\tilde{x}|x) p(x) dx.\]

As a result, we can rewrite the MMSE denoiser as the following:

\[\mathbb{E}[x|\tilde{x}] = \frac{\int x p(\tilde{x}|x)p(x) dx}{\int p(\tilde{x}|x) p(x) dx}\]

We can simplify this formula using some trick, because there are things like \(p(x)\), which is intractable. This trick is what leads to the Tweedie’s formula.

First, we call the numerator \(N(\tilde{x}) = \int x p(\tilde{x} \vert x)p(x) dx\) and the denominator \(m(\tilde{x}) = p(\tilde{x}) = \int p(\tilde{x} \vert x) p(x) dx\).

The trick is to take the derivative of the marginal \(p(\tilde{x}) = m(\tilde{x})\).

\[\frac{d}{d\tilde{x}}m(\tilde{x}) = \frac{1}{\sigma^2} \int (x-\tilde{x}) p(\tilde{x}|x) p(x)dx\] \[= \frac{1}{\sigma^2}(N(\tilde{x}) - \tilde{x} m(\tilde{x}))\]

Then, we can represent the numerator \(N(\tilde{x})\) in terms of this gradient:

\[N(\tilde{x}) = \tilde{x} m(\tilde{x}) + \sigma^2 \frac{d}{d\tilde{x}}m(\tilde{x})\]

Plugging these back into the MMSE formula, we get

\[\mathbb{E}[x|\tilde{x}] = \frac{\tilde{x} m(\tilde{x}) + \sigma^2 \frac{d}{d\tilde{x}}m(\tilde{x}) }{m(\tilde{x})}\] \[= \tilde{x} + \sigma^2 \frac{d}{d\tilde{x}} \text{log} m(\tilde{x})\] \[= \tilde{x} + \sigma^2 \frac{d}{d\tilde{x}} \text{log} p(\tilde{x}),\]

which is the Tweedie’s formula.

As \(\frac{d}{d\tilde{x}} \text{log} p(\tilde{x}) = \nabla_{\tilde{x}} \text{log} p(\tilde{x})\), this formula tells us that you can denoise a noisy signal just by correcting the noisy signal with the score function.

The paper by Efron talks about the motivation behind it. The example in the paper is to estimate the means of many random variables each from single noisy observations. If you happen to observe only very extreme values, something that occurs under selection bias, these observations are simply inflated by noise.
Tweedie’s Formula corrects this inflation and allows you to estimate the debiased mean.

How does the Tweedie’s formula pop up in diffusion and flow models?

Tweedie’s formula connects the MMSE denoiser and the score function. This connection is actually very fundamental in diffusion and flow models. Instead of learning the actual prior distribution directly, these models instead learn the conditional score function or conditional vector field, which is possible by the noising and denoising framework.

At each time step of diffusion/flow model, the denoising step is essentially the Tweedie’s formula. Thus, by learning the score function or the vector field, the model can express the denoising process.

\[\mathbb{E}[x_0 | x_t]= x_t + \sigma_t^2 \nabla_{x_t} \text{log} p_{t}(x_t)\]

References

  • Efron, Bradley. “Tweedie’s formula and selection bias.” Journal of the American Statistical Association 106.496 (2011): 1602-1614.
  • Lai, Chieh-Hsin, et al. “The principles of diffusion models.” arXiv preprint arXiv:2510.21890 (2025).