Double Machine Learning

Takeaway

Double ML uses orthogonal moments and cross-fitting to estimate low-dimensional causal/structural parameters while controlling bias from high-dimensional nuisance components learned by ML.

The problem (before → after)

Before: Plug-in ML estimates of nuisance functions bias target parameter estimates.
After: Orthogonalize the moment condition so first-order errors in nuisances cancel; cross-fit to avoid overfitting.

Mental model first

Imagine balancing a scale with noisy weights; by arranging the pans (moments) so that small errors cancel, you still read the true mass (parameter) accurately.

Just-in-time concepts

Neyman orthogonality: ∂ E[m(W; θ, η)] / ∂η = 0 at true (θ₀, η₀).
Cross-fitting: Split data; fit nuisances on folds; plug into moments on held-out folds.
Asymptotic normality: √n consistency with valid inference under regularity.

First-pass solution

Estimate propensity and outcome models with flexible ML; form orthogonal score; solve for θ̂ using cross-fitted scores; compute standard errors via influence functions.

High-dimensional controls: Lasso/boosting/RF for nuisances.
Heterogeneous effects via orthogonalization within strata.
Debiased ML for other targets (ATE, IV, policy value).

Principles, not prescriptions

Build estimators robust to small nuisance errors.
Separate fitting and estimating stages to prevent overfitting bias.

Common pitfalls

Violating overlap/positivity assumptions.
Using the same data fold for fitting and scoring.

Connections and contrasts

See also: [/blog/causal-trees], [/blog/multi-armed-bandits], [/blog/pac-bayes].

Quick checks

Why orthogonal moments? — Reduce bias from nuisance estimation.
Why cross-fitting? — Avoids adaptive overfitting in scores.
What assumptions? — Overlap, smoothness, bounded moments.