Diffusion Probabilistic Models (DDPM)
Takeaway
Train a network to reverse a gradual noising process; sample by iteratively denoising from pure noise to data.
The problem (before → after)
- Before: Likelihood-based models struggle with complex, multi-scale data without heavy inductive bias.
- After: Use a simple forward diffusion and learn its reverse; training reduces to denoising or noise prediction.
Mental model first
Picture blurring an image repeatedly until it becomes static; a skilled restorer learns to remove just the right amount of blur at each step to recover the picture.
Just-in-time concepts
- Forward process q(x_t|x_{t−1}) adds Gaussian noise with schedule β_t.
- Reverse process p_θ(x_{t−1}|x_t) is Gaussian with learned mean.
- Noise prediction: Predict ε rather than x to stabilize training.
First-pass solution
Optimize an MSE between predicted and true noise under a cosine or linear β schedule; sample with ancestral steps from t=T to 0.
Iterative refinement
- Classifier-free guidance trades fidelity for adherence to prompts.
- Fewer steps: Distillation and improved samplers (DDIM, DPM-Solver) accelerate sampling.
- Continuous-time SDEs unify with score-based modeling.
Code as a byproduct (one reverse step)
import torch
def reverse_step(x_t, t, model, alpha_t, beta_t):
eps_hat = model(x_t, t)
mean = (x_t - beta_t/torch.sqrt(1-alpha_t) * eps_hat) / torch.sqrt(1-beta_t)
z = torch.randn_like(x_t) if t > 0 else 0.0
x_prev = mean + torch.sqrt(beta_t) * z
return x_prev
Principles, not prescriptions
- Choose parameterizations that match the noise process for stable training.
- Schedules control trade-offs between learning and sampling cost.
Common pitfalls
- Mismatch between training and sampling parameterizations.
- Under-conditioning in conditional tasks; use guidance or better encoders.
Connections and contrasts
- See also: [/blog/score-based-modeling], [/blog/normalizing-flows], [/blog/gans].
Quick checks
- Why predict noise? — Yields a well-conditioned loss tied to the diffusion process.
- How to speed up sampling? — Use DDIM or solver-based samplers.
- Relation to scores? — Reverse dynamics depend on score ∇_x log p_t(x).
Further reading
- Ho et al., DDPM; Song et al. (SDEs)
- Denoising diffusion implicit models (DDIM)