Double Machine Learning
Takeaway
Double ML uses orthogonal moments and cross-fitting to estimate low-dimensional causal/structural parameters while controlling bias from high-dimensional nuisance components learned by ML.
The problem (before → after)
- Before: Plug-in ML estimates of nuisance functions bias target parameter estimates.
- After: Orthogonalize the moment condition so first-order errors in nuisances cancel; cross-fit to avoid overfitting.
Mental model first
Imagine balancing a scale with noisy weights; by arranging the pans (moments) so that small errors cancel, you still read the true mass (parameter) accurately.
Just-in-time concepts
- Neyman orthogonality: ∂ E[m(W; θ, η)] / ∂η = 0 at true (θ₀, η₀).
- Cross-fitting: Split data; fit nuisances on folds; plug into moments on held-out folds.
- Asymptotic normality: √n consistency with valid inference under regularity.
First-pass solution
Estimate propensity and outcome models with flexible ML; form orthogonal score; solve for θ̂ using cross-fitted scores; compute standard errors via influence functions.
Iterative refinement
- High-dimensional controls: Lasso/boosting/RF for nuisances.
- Heterogeneous effects via orthogonalization within strata.
- Debiased ML for other targets (ATE, IV, policy value).
Principles, not prescriptions
- Build estimators robust to small nuisance errors.
- Separate fitting and estimating stages to prevent overfitting bias.
Common pitfalls
- Violating overlap/positivity assumptions.
- Using the same data fold for fitting and scoring.
Connections and contrasts
- See also: [/blog/causal-trees], [/blog/multi-armed-bandits], [/blog/pac-bayes].
Quick checks
- Why orthogonal moments? — Reduce bias from nuisance estimation.
- Why cross-fitting? — Avoids adaptive overfitting in scores.
- What assumptions? — Overlap, smoothness, bounded moments.
Further reading
- Chernozhukov et al., 2018 (source above)
- Semiparametric efficiency literature