Takeaway

Learn the score (the gradient of log density) at multiple noise levels and sample with Langevin dynamics or SDE solvers.

The problem (before → after)

  • Before: Directly modeling complex densities is hard.
  • After: Estimating the score bypasses normalization; denoising score matching learns ∇_x log p_σ(x).

Mental model first

Imagine walking downhill in fog toward the densest region; the score is a compass pointing to higher probability. Adding a little noise smooths the landscape and stabilizes the compass.

Just-in-time concepts

  • Score: s_θ(x, σ) ≈ ∇_x log p_σ(x).
  • DSM: Minimize E||s_θ(x+σξ, σ) − ∇_x log p_σ(x)||² with synthetic noise.
  • Predictor–corrector sampling: Alternating SDE steps with Langevin “corrections.”

First-pass solution

Train s_θ across a noise schedule; sample from prior noise by integrating the reverse-time SDE with PC steps.

Iterative refinement

  1. VE/VP subtypes define different forward SDEs and noise schedules.
  2. Guidance and conditioning extend to conditional generation.
  3. Connections: DDPMs arise as discretizations of reverse SDEs.

Code as a byproduct (annealed Langevin step)

import torch

def langevin_step(x, sigma, score, step):
    noise = torch.randn_like(x)
    grad = score(x, sigma)
    return x + step * grad + (2*step)**0.5 * noise

Principles, not prescriptions

  • Smooth with noise to stabilize learning, then anneal to zero.
  • Accurate scores at high noise are crucial to guide early steps.

Common pitfalls

  • Step sizes too small or large cause bias or divergence.
  • Score networks must be well-conditioned across sigmas.

Connections and contrasts

  • See also: [/blog/diffusion-models], [/blog/normalizing-flows], [/blog/gans].

Quick checks

  1. Why learn scores? — Avoids computing partition functions.
  2. Why multi-σ training? — Easier gradients and better global coverage.
  3. How related to DDPM? — Both learn to reverse a noising process.

Further reading

  • Song & Ermon papers (source above)
  • Karras et al., EDM