Variational Inference and the ELBO
Takeaway
Variational inference turns Bayesian inference into optimization: choose a tractable family q_ϕ(z|x) and fit it to p(z|x) by maximizing the ELBO (equivalently, minimizing KL(q||p)).
The problem (before → after)
- Before: Exact posteriors are intractable; MCMC can be slow.
- After: Pick a parametric family and optimize a lower bound on log evidence; scale with stochastic gradients and amortization.
Mental model first
Imagine squeezing a flexible mold (q) around a complex statue (p). You cannot capture every groove, but with the right shape and pressure you approximate the parts that matter for decisions.
Just-in-time concepts
- Evidence Lower Bound (ELBO): log p(x) ≥ E_q[log p(x,z) − log q(z|x)].
- Mean-field: q factorizes across latent dimensions; simple but biased.
- Reparameterization trick: z = g_ϕ(ε, x), ε ∼ p(ε), to get low-variance gradients.
- Amortized VI: Neural network outputs ϕ(x) so inference cost per data point is O(1).
First-pass solution
Choose q_ϕ(z|x) with easy sampling and density; derive ELBO; compute stochastic gradients over minibatches; update ϕ (and model θ) with Adam.
Iterative refinement
- Control variates and baselines reduce gradient variance.
- Richer families: normalizing flows, mixture posteriors, hierarchical VI.
- Structured VI: Keep dependencies where they matter (e.g., time series, trees).
Code as a byproduct (reparameterized ELBO)
import torch
def elbo(model, guide, x, K=1):
log_w = []
for _ in range(K):
z, log_q = guide.sample_with_log_prob(x) # reparameterized
log_p = model.log_joint(x, z)
log_w.append(log_p - log_q)
return torch.stack(log_w).logsumexp(0) - torch.log(torch.tensor(K.0))
Principles, not prescriptions
- Bias–variance trade-off: richer q reduces bias but may increase variance and cost.
- Optimize what you care about: ELBO sharpens posteriors relevant to predictions.
Common pitfalls
- Posterior collapse when the decoder is too strong; use KL annealing or free bits.
- Ignoring calibration: good ELBO does not always mean calibrated uncertainty.
Connections and contrasts
- See also: [/blog/black-box-vi], [/blog/normalizing-flows], [/blog/attention-is-all-you-need] (amortization via encoders).
Quick checks
- Why maximize ELBO? — It’s a tractable lower bound to log evidence that tightens as q approaches p.
- Why reparameterize? — Lower-variance gradients than score-function estimators.
- When avoid mean-field? — When correlations matter for decisions.
Further reading
- Kingma & Welling (VAE), Rezende et al. (normalizing flows)
- Source tutorial (above)