Score-Based Generative Modeling

Takeaway

Learn the score (the gradient of log density) at multiple noise levels and sample with Langevin dynamics or SDE solvers.

The problem (before → after)

Before: Directly modeling complex densities is hard.
After: Estimating the score bypasses normalization; denoising score matching learns ∇_x log p_σ(x).

Mental model first

Imagine walking downhill in fog toward the densest region; the score is a compass pointing to higher probability. Adding a little noise smooths the landscape and stabilizes the compass.

Just-in-time concepts

Score: s_θ(x, σ) ≈ ∇_x log p_σ(x).
DSM: Minimize E||s_θ(x+σξ, σ) − ∇_x log p_σ(x)||² with synthetic noise.
Predictor–corrector sampling: Alternating SDE steps with Langevin “corrections.”

First-pass solution

Train s_θ across a noise schedule; sample from prior noise by integrating the reverse-time SDE with PC steps.

VE/VP subtypes define different forward SDEs and noise schedules.
Guidance and conditioning extend to conditional generation.
Connections: DDPMs arise as discretizations of reverse SDEs.

Code as a byproduct (annealed Langevin step)

import torch

def langevin_step(x, sigma, score, step):
    noise = torch.randn_like(x)
    grad = score(x, sigma)
    return x + step * grad + (2*step)**0.5 * noise

Principles, not prescriptions

Smooth with noise to stabilize learning, then anneal to zero.
Accurate scores at high noise are crucial to guide early steps.

Common pitfalls

Step sizes too small or large cause bias or divergence.
Score networks must be well-conditioned across sigmas.

Connections and contrasts

See also: [/blog/diffusion-models], [/blog/normalizing-flows], [/blog/gans].

Quick checks

Why learn scores? — Avoids computing partition functions.
Why multi-σ training? — Easier gradients and better global coverage.
How related to DDPM? — Both learn to reverse a noising process.