PAC-Bayes Bounds — Data-Dependent Generalization
Takeaway
PAC-Bayes bounds provide tight, data-dependent generalization guarantees for randomized predictors via a trade-off between empirical loss and KL divergence to a prior.
The problem (before → after)
- Before: Classical PAC bounds can be loose for complex models.
- After: Choose a prior over hypotheses; learn a posterior that balances fit and complexity measured by KL.
Mental model first
It’s like packing a parachute: the more you deviate from the manufacturer’s instructions (prior), the more careful you must be with checks (empirical performance) to guarantee safety.
Just-in-time concepts
- Posterior Q over hypotheses; prior P independent of data.
- Bound: With prob 1−δ, L(Q) ≤ kl^{-1}(Ĥ(Q), (KL(Q||P)+log(2√n/δ))/n).
- Flat minima correspond to posteriors with small KL.
First-pass solution
Pick P; optimize Q (e.g., Gaussian around weights) to minimize a PAC-Bayes objective; report bound alongside test error.
Iterative refinement
- Data-dependent priors via holdout or hierarchical schemes.
- Sharp bounds using refined inequalities (Catoni-style).
- Connections to compression and SGD-induced posteriors.
Principles, not prescriptions
- Report priors and KLs; avoid hindsight priors without correction.
- Favor flat, robust solutions.
Common pitfalls
- Choosing P after seeing full data invalidates guarantees.
- Ignoring mismatch between randomized predictor and deterministic test-time use.
Connections and contrasts
- See also: [/blog/pac-learning], [/blog/adversarial-examples].
Quick checks
- Why KL in the bound? — Measures complexity relative to prior.
- Why randomized predictors? — Bounds apply to Gibbs classifiers.
- Can we use data-dependent priors? — Yes, with proper splitting or penalties.
Further reading
- McAllester, Catoni; modern deep learning PAC-Bayes works (source above)