Score-Based Generative Modeling
Takeaway
Learn the score (the gradient of log density) at multiple noise levels and sample with Langevin dynamics or SDE solvers.
The problem (before → after)
- Before: Directly modeling complex densities is hard.
- After: Estimating the score bypasses normalization; denoising score matching learns ∇_x log p_σ(x).
Mental model first
Imagine walking downhill in fog toward the densest region; the score is a compass pointing to higher probability. Adding a little noise smooths the landscape and stabilizes the compass.
Just-in-time concepts
- Score: s_θ(x, σ) ≈ ∇_x log p_σ(x).
- DSM: Minimize E||s_θ(x+σξ, σ) − ∇_x log p_σ(x)||² with synthetic noise.
- Predictor–corrector sampling: Alternating SDE steps with Langevin “corrections.”
First-pass solution
Train s_θ across a noise schedule; sample from prior noise by integrating the reverse-time SDE with PC steps.
Iterative refinement
- VE/VP subtypes define different forward SDEs and noise schedules.
- Guidance and conditioning extend to conditional generation.
- Connections: DDPMs arise as discretizations of reverse SDEs.
Code as a byproduct (annealed Langevin step)
import torch
def langevin_step(x, sigma, score, step):
noise = torch.randn_like(x)
grad = score(x, sigma)
return x + step * grad + (2*step)**0.5 * noise
Principles, not prescriptions
- Smooth with noise to stabilize learning, then anneal to zero.
- Accurate scores at high noise are crucial to guide early steps.
Common pitfalls
- Step sizes too small or large cause bias or divergence.
- Score networks must be well-conditioned across sigmas.
Connections and contrasts
- See also: [/blog/diffusion-models], [/blog/normalizing-flows], [/blog/gans].
Quick checks
- Why learn scores? — Avoids computing partition functions.
- Why multi-σ training? — Easier gradients and better global coverage.
- How related to DDPM? — Both learn to reverse a noising process.
Further reading
- Song & Ermon papers (source above)
- Karras et al., EDM