Expectation–Maximization (EM) Algorithm
Takeaway
EM maximizes likelihoods with latent variables by alternating between inferring hidden structure (E-step) and optimizing parameters given that structure (M-step).
The problem (before → after)
- Before: Latent variables make direct maximization of log-likelihood hard.
- After: Lower-bound the log-likelihood with a tight bound at current parameters and maximize it alternately.
Mental model first
Imagine sorting blurry photos into albums: first, guess which album each photo belongs to (E-step); then, update album covers (parameters) to best fit the assignments (M-step). Repeat until stable.
Just-in-time concepts
- Complete-data log-likelihood; Q(θ|θ^t) = E_{Z|X,θ^t}[log p(X,Z|θ)].
- Monotonicity: Each EM step does not decrease the observed-data log-likelihood.
- Examples: GMM with closed-form responsibilities and parameter updates.
First-pass solution
Initialize θ; E-step compute responsibilities; M-step update parameters; check convergence of log-likelihood.
Iterative refinement
- Variants: ECM, ECME, generalized EM.
- Initialization via k-means for GMMs; annealing to escape poor local maxima.
- EM as coordinate ascent on ELBO with q fixed to posterior.
Code as a byproduct (GMM E-step)
import numpy as np
def estep_gmm(X, pis, mus, sigmas):
K = len(pis)
N = X.shape[0]
resp = np.zeros((N, K))
for k in range(K):
resp[:, k] = pis[k] * gaussian_pdf(X, mus[k], sigmas[k])
resp = resp / resp.sum(axis=1, keepdims=True)
return resp
Principles, not prescriptions
- Use problem structure for closed-form E/M steps where possible.
- Monitor log-likelihood; guard against singularities in mixtures.
Common pitfalls
- Poor initialization → local maxima; degenerate covariances in GMMs.
- Misinterpreting monotonicity as convergence to global optimum.
Connections and contrasts
- See also: [/blog/variational-inference], [/blog/black-box-vi].
Quick checks
- Why monotone increase? — EM maximizes a tight lower bound each step.
- Why k-means init for GMMs? — Provides sensible cluster seeds.
- When use generalized EM? — When exact M-step is hard.
Further reading
- Dempster, Laird, Rubin (1977) (source above)