Variational Inference and the ELBO

Takeaway

Variational inference turns Bayesian inference into optimization: choose a tractable family q_ϕ(z|x) and fit it to p(z|x) by maximizing the ELBO (equivalently, minimizing KL(q||p)).

The problem (before → after)

Before: Exact posteriors are intractable; MCMC can be slow.
After: Pick a parametric family and optimize a lower bound on log evidence; scale with stochastic gradients and amortization.

Mental model first

Imagine squeezing a flexible mold (q) around a complex statue (p). You cannot capture every groove, but with the right shape and pressure you approximate the parts that matter for decisions.

Just-in-time concepts

Evidence Lower Bound (ELBO): log p(x) ≥ E_q[log p(x,z) − log q(z|x)].
Mean-field: q factorizes across latent dimensions; simple but biased.
Reparameterization trick: z = g_ϕ(ε, x), ε ∼ p(ε), to get low-variance gradients.
Amortized VI: Neural network outputs ϕ(x) so inference cost per data point is O(1).

First-pass solution

Choose q_ϕ(z|x) with easy sampling and density; derive ELBO; compute stochastic gradients over minibatches; update ϕ (and model θ) with Adam.

Control variates and baselines reduce gradient variance.
Richer families: normalizing flows, mixture posteriors, hierarchical VI.
Structured VI: Keep dependencies where they matter (e.g., time series, trees).

Code as a byproduct (reparameterized ELBO)

import torch

def elbo(model, guide, x, K=1):
    log_w = []
    for _ in range(K):
        z, log_q = guide.sample_with_log_prob(x)   # reparameterized
        log_p = model.log_joint(x, z)
        log_w.append(log_p - log_q)
    return torch.stack(log_w).logsumexp(0) - torch.log(torch.tensor(K.0))

Principles, not prescriptions

Bias–variance trade-off: richer q reduces bias but may increase variance and cost.
Optimize what you care about: ELBO sharpens posteriors relevant to predictions.

Common pitfalls

Posterior collapse when the decoder is too strong; use KL annealing or free bits.
Ignoring calibration: good ELBO does not always mean calibrated uncertainty.

Connections and contrasts

See also: [/blog/black-box-vi], [/blog/normalizing-flows], [/blog/attention-is-all-you-need] (amortization via encoders).

Quick checks

Why maximize ELBO? — It’s a tractable lower bound to log evidence that tightens as q approaches p.
Why reparameterize? — Lower-variance gradients than score-function estimators.
When avoid mean-field? — When correlations matter for decisions.