Optimal Transport for Machine Learning
Takeaway
Optimal transport measures the cost to move one distribution into another; Wasserstein distances provide geometry-aware losses for ML and statistics.
The problem (before → after)
- Before: Divergences like KL fail when supports don’t overlap and ignore geometry.
- After: Wasserstein distances reflect cost over space, remain finite with disjoint supports, and yield stable gradients.
Mental model first
Imagine moving piles of dirt into holes; cost equals work = mass × distance. The cheapest plan defines the distance between shapes.
Just-in-time concepts
- Monge vs Kantorovich formulations; couplings γ.
- Wasserstein-1 (Earth Mover’s): W₁(P,Q) = inf_γ E[||x−y||].
- Entropic OT: Adds −ε H(γ) for faster Sinkhorn iterations.
First-pass solution
Compute OT via linear programs or Sinkhorn; use dual forms for gradients; apply as a loss for domain alignment, generative modeling, and robustness.
Iterative refinement
- Barycenters and mapping estimation.
- Sliced and projected OT for scalability.
- Unbalanced OT handles mass change.
Principles, not prescriptions
- Choose cost reflecting task geometry; regularize for speed.
- Duality yields efficient gradients and intuitive potentials.
Common pitfalls
- High computational cost in high dimensions; approximate judiciously.
- Mis-specified cost function undermines performance.
Connections and contrasts
- See also: [/blog/normalizing-flows], [/blog/gans], [/blog/black-box-vi].
Quick checks
- Why OT vs KL? — Finite, geometry-aware distance with disjoint supports.
- What is Sinkhorn? — Entropy-regularized OT solved via matrix scaling.
- Where used? — Domain adaptation, generative modeling, fairness.
Further reading
- Peyré & Cuturi (2019); ML survey (source above)