Demystifying and connecting

  • Stationary and non-stationary SDEs
  • Langevin dynamics, MCMC
  • Diffusion models
  • Fokker-Planck Equation

Stationary vs. Non-stationary SDEs

Stationary SDE means that the distribution doesn’t change over time under the SDE dynamics. So starting from \(x_0 \sim p(x)\), it will always gives \(x_t \sim p(x)\) for all t.

Langevin SDE is an example of a stationary SDE with the stationary distribution \(p(x)\).

Non-stationary SDE has changing distribution over time. The purpose of non-stationary SDE is not to sample, but to transform data distribution.

Diffusion model’s forward process is an example of a non-stationary SDE.

It took some time for me to clarify these different roles of SDEs while learning about diffusion and flow models.

Langevin SDE (stationary SDE)

Lagevin SDE is as the following:

\[dx_t = \nabla_x \text{log}p(x_t)dt + \sqrt{2} dw_t,\]

where \(w_t\) is a Brownian motion (or called a Wiener process), a non-differentiable stochastic process).

It’s main purpose is to use SDE to sample from the target \(p(x)\). Note that the score function is not time-dependent; It is not \(\nabla_x \text{log}p_t(x_t)dt\). Thus, it is a sampling algorithm. This makes Langevin Dynamic a MCMC sampler. If some distribution is hard to sample from directly, you would use these sampling methods.

Why is there the \(\sqrt{2}\) term?

Due to the stationary property, the Langevin SDE ensures that at the end of the iterative sampling steps, you will get a sample from the target \(p(x)\). This is achieved only because of the magic coefficient \(\sqrt{2}\), which balances the drift and the noise term.

Note: Mathematically, the Langevin SDE as it is has nothing to do with diffusion models. The connection arises as there is something called, annealed Langevin dynamics.

Diffusion model SDE (non-stationary SDE)

Diffusion model’s forward SDE is as the following:

\[dx_t = -\frac{1}{2} \beta(t)x_tdt + \sqrt{\beta(t)} dw_t,\]

where \(\beta(t)\) describes how fast \(x_t\) is destroyed by adding noise (aka. noise schedule). From this equation, we see that the first term (drift term) is negative and therefore, it makes the mean of the final distribution to go to zero. The variance of the final distribution comes from the second term (noise term). As a result, the final distribution of the forward SDE is \(N(0, I_d)\).

Fokker-Planck Equation

SDEs in general gives us a “particle” view. It tells us about how each particles move in a dynamic system and there are many different possible trajectories that these particles can move because of randomness. But behind every SDE, there is an underlying distribution, be it stationary or non-stationary.

The Fokker-Planck equation gives this “distributional” view. It describes how distributions change over time and ignores the detail of particles. As a result, all the different trajectories generated by an SDE correspond to the same Fokker-Planck equation.

  • SDE : sample path
  • Fokker-Planck : distribution of these sample paths over time

A general SDE is of the form:

\[dx_t = f(x_t,t) dt + g(t) dw_t,\]

where \(f(x_t,t)dt\) is a drift term that tells which direction sample \(x_t\) should go and \(g(t)dw_t\) is a diffusion term that adds perturbation.

Then, the corresponding pdf \(p_t(x)\) satisfies the Fokker-Planck equation:

\[\frac{\partial p_t(x)}{\partial t} = - \nabla \cdot (f(x,t)p_t(x)) + \frac{1}{2} \nabla^2 (g(t)^2p_t(x)).\]

As in the SDE, the first term is here can also be interpreted as a drift term. \(\nabla \cdot\) is a divergence operator, which measures how much probability is entering or leaving at x(t) (equivalent to \(\text{Tr}(\nabla (f(x,t)p_t(x)))\)). It tells us if the density increases or decreases. The second term is a diffusion term, which tells how fast or slow the distribution spreads over time. \(\nabla^2\) is the Laplacian operator, which will lead \(p_t(x)\) to spread out. \(g(t)\) tells you how strong the diffusion is.

Fokker-Planck to SDEs

For stationary SDE, \(\frac{\partial p_t(x)}{\partial t}=0\).
We can see that for the Langevin SDE, by setting \(f(x,t) = \nabla_x \text{log} p(x_t)\) and \(g(t)=\sqrt{2}\):

\[\frac{\partial p_t(x)}{\partial t} = - \nabla \cdot (p_t(x) \nabla_x \text{log} p(x)) + \nabla^2 p_t(x) = 0.\]

Thus, \(p_t(x)\) does not change over time.

For diffusion forward SDE, \(f(x,t) = -\frac{1}{2} \beta(t)x_t\) and \(g(t)=\sqrt{\beta(t)}\):

\[\frac{\partial p_t(x)}{\partial t} = \nabla \cdot (\frac{1}{2} \beta(t)x_t p_t(x)) + \frac{1}{2} \beta(t)\nabla^2p_t(x) \neq 0.\]

This shows why during the diffusion forward process, \(p_t(x)\) becomes more and more Gaussian over time.

References

  • Lai, Chieh-Hsin, et al. “The principles of diffusion models.” arXiv preprint arXiv:2510.21890 (2025).