Takeaway

EM maximizes likelihoods with latent variables by alternating between inferring hidden structure (E-step) and optimizing parameters given that structure (M-step).

The problem (before → after)

  • Before: Latent variables make direct maximization of log-likelihood hard.
  • After: Lower-bound the log-likelihood with a tight bound at current parameters and maximize it alternately.

Mental model first

Imagine sorting blurry photos into albums: first, guess which album each photo belongs to (E-step); then, update album covers (parameters) to best fit the assignments (M-step). Repeat until stable.

Just-in-time concepts

  • Complete-data log-likelihood; Q(θ|θ^t) = E_{Z|X,θ^t}[log p(X,Z|θ)].
  • Monotonicity: Each EM step does not decrease the observed-data log-likelihood.
  • Examples: GMM with closed-form responsibilities and parameter updates.

First-pass solution

Initialize θ; E-step compute responsibilities; M-step update parameters; check convergence of log-likelihood.

Iterative refinement

  1. Variants: ECM, ECME, generalized EM.
  2. Initialization via k-means for GMMs; annealing to escape poor local maxima.
  3. EM as coordinate ascent on ELBO with q fixed to posterior.

Code as a byproduct (GMM E-step)

import numpy as np

def estep_gmm(X, pis, mus, sigmas):
    K = len(pis)
    N = X.shape[0]
    resp = np.zeros((N, K))
    for k in range(K):
        resp[:, k] = pis[k] * gaussian_pdf(X, mus[k], sigmas[k])
    resp = resp / resp.sum(axis=1, keepdims=True)
    return resp

Principles, not prescriptions

  • Use problem structure for closed-form E/M steps where possible.
  • Monitor log-likelihood; guard against singularities in mixtures.

Common pitfalls

  • Poor initialization → local maxima; degenerate covariances in GMMs.
  • Misinterpreting monotonicity as convergence to global optimum.

Connections and contrasts

  • See also: [/blog/variational-inference], [/blog/black-box-vi].

Quick checks

  1. Why monotone increase? — EM maximizes a tight lower bound each step.
  2. Why k-means init for GMMs? — Provides sensible cluster seeds.
  3. When use generalized EM? — When exact M-step is hard.

Further reading

  • Dempster, Laird, Rubin (1977) (source above)