Diffusion Probabilistic Models (DDPM)

Takeaway

Train a network to reverse a gradual noising process; sample by iteratively denoising from pure noise to data.

The problem (before → after)

Before: Likelihood-based models struggle with complex, multi-scale data without heavy inductive bias.
After: Use a simple forward diffusion and learn its reverse; training reduces to denoising or noise prediction.

Mental model first

Picture blurring an image repeatedly until it becomes static; a skilled restorer learns to remove just the right amount of blur at each step to recover the picture.

Just-in-time concepts

Forward process q(x_t|x_{t−1}) adds Gaussian noise with schedule β_t.
Reverse process p_θ(x_{t−1}|x_t) is Gaussian with learned mean.
Noise prediction: Predict ε rather than x to stabilize training.

First-pass solution

Optimize an MSE between predicted and true noise under a cosine or linear β schedule; sample with ancestral steps from t=T to 0.

Classifier-free guidance trades fidelity for adherence to prompts.
Fewer steps: Distillation and improved samplers (DDIM, DPM-Solver) accelerate sampling.
Continuous-time SDEs unify with score-based modeling.

Code as a byproduct (one reverse step)

import torch

def reverse_step(x_t, t, model, alpha_t, beta_t):
    eps_hat = model(x_t, t)
    mean = (x_t - beta_t/torch.sqrt(1-alpha_t) * eps_hat) / torch.sqrt(1-beta_t)
    z = torch.randn_like(x_t) if t > 0 else 0.0
    x_prev = mean + torch.sqrt(beta_t) * z
    return x_prev

Principles, not prescriptions

Choose parameterizations that match the noise process for stable training.
Schedules control trade-offs between learning and sampling cost.

Common pitfalls

Mismatch between training and sampling parameterizations.
Under-conditioning in conditional tasks; use guidance or better encoders.

Connections and contrasts

See also: [/blog/score-based-modeling], [/blog/normalizing-flows], [/blog/gans].

Quick checks

Why predict noise? — Yields a well-conditioned loss tied to the diffusion process.
How to speed up sampling? — Use DDIM or solver-based samplers.
Relation to scores? — Reverse dynamics depend on score ∇_x log p_t(x).