Expectation–Maximization (EM) Algorithm

Takeaway

EM maximizes likelihoods with latent variables by alternating between inferring hidden structure (E-step) and optimizing parameters given that structure (M-step).

The problem (before → after)

Before: Latent variables make direct maximization of log-likelihood hard.
After: Lower-bound the log-likelihood with a tight bound at current parameters and maximize it alternately.

Mental model first

Imagine sorting blurry photos into albums: first, guess which album each photo belongs to (E-step); then, update album covers (parameters) to best fit the assignments (M-step). Repeat until stable.

Just-in-time concepts

Complete-data log-likelihood; Q(θ|θ^t) = E_{Z|X,θ^t}[log p(X,Z|θ)].
Monotonicity: Each EM step does not decrease the observed-data log-likelihood.
Examples: GMM with closed-form responsibilities and parameter updates.

First-pass solution

Initialize θ; E-step compute responsibilities; M-step update parameters; check convergence of log-likelihood.

Variants: ECM, ECME, generalized EM.
Initialization via k-means for GMMs; annealing to escape poor local maxima.
EM as coordinate ascent on ELBO with q fixed to posterior.

Code as a byproduct (GMM E-step)

import numpy as np

def estep_gmm(X, pis, mus, sigmas):
    K = len(pis)
    N = X.shape[0]
    resp = np.zeros((N, K))
    for k in range(K):
        resp[:, k] = pis[k] * gaussian_pdf(X, mus[k], sigmas[k])
    resp = resp / resp.sum(axis=1, keepdims=True)
    return resp

Principles, not prescriptions

Use problem structure for closed-form E/M steps where possible.
Monitor log-likelihood; guard against singularities in mixtures.

Common pitfalls

Poor initialization → local maxima; degenerate covariances in GMMs.
Misinterpreting monotonicity as convergence to global optimum.

Connections and contrasts

See also: [/blog/variational-inference], [/blog/black-box-vi].

Quick checks

Why monotone increase? — EM maximizes a tight lower bound each step.
Why k-means init for GMMs? — Provides sensible cluster seeds.
When use generalized EM? — When exact M-step is hard.