Optimal Transport for Machine Learning

Takeaway

Optimal transport measures the cost to move one distribution into another; Wasserstein distances provide geometry-aware losses for ML and statistics.

The problem (before → after)

Before: Divergences like KL fail when supports don’t overlap and ignore geometry.
After: Wasserstein distances reflect cost over space, remain finite with disjoint supports, and yield stable gradients.

Mental model first

Imagine moving piles of dirt into holes; cost equals work = mass × distance. The cheapest plan defines the distance between shapes.

Just-in-time concepts

Monge vs Kantorovich formulations; couplings γ.
Wasserstein-1 (Earth Mover’s): W₁(P,Q) = inf_γ E[||x−y||].
Entropic OT: Adds −ε H(γ) for faster Sinkhorn iterations.

First-pass solution

Compute OT via linear programs or Sinkhorn; use dual forms for gradients; apply as a loss for domain alignment, generative modeling, and robustness.

Barycenters and mapping estimation.
Sliced and projected OT for scalability.
Unbalanced OT handles mass change.

Principles, not prescriptions

Choose cost reflecting task geometry; regularize for speed.
Duality yields efficient gradients and intuitive potentials.

Common pitfalls

High computational cost in high dimensions; approximate judiciously.
Mis-specified cost function undermines performance.

Connections and contrasts

See also: [/blog/normalizing-flows], [/blog/gans], [/blog/black-box-vi].

Quick checks

Why OT vs KL? — Finite, geometry-aware distance with disjoint supports.
What is Sinkhorn? — Entropy-regularized OT solved via matrix scaling.
Where used? — Domain adaptation, generative modeling, fairness.

Further reading

Peyré & Cuturi (2019); ML survey (source above)