Takeaway

PAC-Bayes bounds provide tight, data-dependent generalization guarantees for randomized predictors via a trade-off between empirical loss and KL divergence to a prior.

The problem (before → after)

  • Before: Classical PAC bounds can be loose for complex models.
  • After: Choose a prior over hypotheses; learn a posterior that balances fit and complexity measured by KL.

Mental model first

It’s like packing a parachute: the more you deviate from the manufacturer’s instructions (prior), the more careful you must be with checks (empirical performance) to guarantee safety.

Just-in-time concepts

  • Posterior Q over hypotheses; prior P independent of data.
  • Bound: With prob 1−δ, L(Q) ≤ kl^{-1}(Ĥ(Q), (KL(Q||P)+log(2√n/δ))/n).
  • Flat minima correspond to posteriors with small KL.

First-pass solution

Pick P; optimize Q (e.g., Gaussian around weights) to minimize a PAC-Bayes objective; report bound alongside test error.

Iterative refinement

  1. Data-dependent priors via holdout or hierarchical schemes.
  2. Sharp bounds using refined inequalities (Catoni-style).
  3. Connections to compression and SGD-induced posteriors.

Principles, not prescriptions

  • Report priors and KLs; avoid hindsight priors without correction.
  • Favor flat, robust solutions.

Common pitfalls

  • Choosing P after seeing full data invalidates guarantees.
  • Ignoring mismatch between randomized predictor and deterministic test-time use.

Connections and contrasts

  • See also: [/blog/pac-learning], [/blog/adversarial-examples].

Quick checks

  1. Why KL in the bound? — Measures complexity relative to prior.
  2. Why randomized predictors? — Bounds apply to Gibbs classifiers.
  3. Can we use data-dependent priors? — Yes, with proper splitting or penalties.

Further reading

  • McAllester, Catoni; modern deep learning PAC-Bayes works (source above)