PAC-Bayes Bounds — Data-Dependent Generalization

Takeaway

PAC-Bayes bounds provide tight, data-dependent generalization guarantees for randomized predictors via a trade-off between empirical loss and KL divergence to a prior.

The problem (before → after)

Before: Classical PAC bounds can be loose for complex models.
After: Choose a prior over hypotheses; learn a posterior that balances fit and complexity measured by KL.

Mental model first

It’s like packing a parachute: the more you deviate from the manufacturer’s instructions (prior), the more careful you must be with checks (empirical performance) to guarantee safety.

Just-in-time concepts

Posterior Q over hypotheses; prior P independent of data.
Bound: With prob 1−δ, L(Q) ≤ kl^{-1}(Ĥ(Q), (KL(Q||P)+log(2√n/δ))/n).
Flat minima correspond to posteriors with small KL.

First-pass solution

Pick P; optimize Q (e.g., Gaussian around weights) to minimize a PAC-Bayes objective; report bound alongside test error.

Data-dependent priors via holdout or hierarchical schemes.
Sharp bounds using refined inequalities (Catoni-style).
Connections to compression and SGD-induced posteriors.

Principles, not prescriptions

Report priors and KLs; avoid hindsight priors without correction.
Favor flat, robust solutions.

Common pitfalls

Choosing P after seeing full data invalidates guarantees.
Ignoring mismatch between randomized predictor and deterministic test-time use.

Connections and contrasts

See also: [/blog/pac-learning], [/blog/adversarial-examples].

Quick checks

Why KL in the bound? — Measures complexity relative to prior.
Why randomized predictors? — Bounds apply to Gibbs classifiers.
Can we use data-dependent priors? — Yes, with proper splitting or penalties.