Dirichlet Processes — Bayesian Nonparametrics

Takeaway

The Dirichlet process defines a prior over discrete distributions with countably infinite support, enabling models to grow complexity with data (e.g., infinite mixture models).

The problem (before → after)

Before: Fixed-K mixture models risk under/overfitting.
After: DP mixtures infer the number of clusters from data with principled uncertainty.

Mental model first

Chinese restaurant process: customers (data) enter and either join an occupied table with probability proportional to its size or start a new table with probability proportional to α.

Just-in-time concepts

DP(α, G₀), stick-breaking β_k ∼ Beta(1, α), π_k = β_k ∏_{j<k} (1−β_j).
Exchangeability and Pólya urn representation.
Gibbs sampling and collapsed inference.

First-pass solution

Define a DP mixture: draw component parameters from G; assign data to components via CRP; sample assignments and parameters; compute predictive densities.

Hierarchical DPs share components across groups.
Truncation and variational inference for scalability.
Pitman–Yor processes favor power-law cluster sizes.

Principles, not prescriptions

Let data determine complexity; avoid rigid K.
Use conjugacy for efficient inference when possible.

Common pitfalls

Label switching and mixing issues in MCMC.
Sensitivity to α and base measure choices.

Connections and contrasts

See also: [/blog/variational-inference], [/blog/black-box-vi].

Quick checks

Why DP mixtures? — Flexible clustering with uncertainty over K.
What does α control? — Tendency to create new clusters.
Why discrete draws? — DP draws are almost surely discrete.