Takeaway

Optimal transport measures the cost to move one distribution into another; Wasserstein distances provide geometry-aware losses for ML and statistics.

The problem (before → after)

  • Before: Divergences like KL fail when supports don’t overlap and ignore geometry.
  • After: Wasserstein distances reflect cost over space, remain finite with disjoint supports, and yield stable gradients.

Mental model first

Imagine moving piles of dirt into holes; cost equals work = mass × distance. The cheapest plan defines the distance between shapes.

Just-in-time concepts

  • Monge vs Kantorovich formulations; couplings γ.
  • Wasserstein-1 (Earth Mover’s): W₁(P,Q) = inf_γ E[||x−y||].
  • Entropic OT: Adds −ε H(γ) for faster Sinkhorn iterations.

First-pass solution

Compute OT via linear programs or Sinkhorn; use dual forms for gradients; apply as a loss for domain alignment, generative modeling, and robustness.

Iterative refinement

  1. Barycenters and mapping estimation.
  2. Sliced and projected OT for scalability.
  3. Unbalanced OT handles mass change.

Principles, not prescriptions

  • Choose cost reflecting task geometry; regularize for speed.
  • Duality yields efficient gradients and intuitive potentials.

Common pitfalls

  • High computational cost in high dimensions; approximate judiciously.
  • Mis-specified cost function undermines performance.

Connections and contrasts

  • See also: [/blog/normalizing-flows], [/blog/gans], [/blog/black-box-vi].

Quick checks

  1. Why OT vs KL? — Finite, geometry-aware distance with disjoint supports.
  2. What is Sinkhorn? — Entropy-regularized OT solved via matrix scaling.
  3. Where used? — Domain adaptation, generative modeling, fairness.

Further reading

  • Peyré & Cuturi (2019); ML survey (source above)