Regret Bounds in Multi-Armed Bandits
Takeaway
Bandit algorithms balance exploration and exploitation to minimize cumulative regret; UCB and Thompson sampling achieve logarithmic regret under standard assumptions.
The problem (before → after)
- Before: Greedy strategies get stuck on early lucky arms or miss better options.
- After: Optimism (UCB) or posterior sampling (Thompson) systematically explores while converging to the best arm.
Mental model first
Think of restaurant discovery: try places with high ratings but occasionally sample new ones; over time you settle on the consistently best spot while minimizing bad meals.
Just-in-time concepts
- Regret: R_T = T μ* − ∑t μ.
- UCB: Choose arm maximizing empirical mean + confidence radius.
- Thompson sampling: Sample parameters from posterior; pick arm with sampled best mean.
First-pass solution
Implement UCB1 with bounds ∝ √(2 log t / n_a). Thompson sampling for Bernoulli rewards uses Beta posteriors.
Iterative refinement
- Nonstationary bandits: Sliding-window UCB.
- Contextual bandits: Condition on features; use linear/GLM models.
- Structured priors and constraints for real-world deployments.
Code as a byproduct (UCB1 step)
import math
def ucb1_step(t, counts, means):
ucb = [m + math.sqrt(2*math.log(t)/(c or 1)) for m, c in zip(means, counts)]
return max(range(len(ucb)), key=lambda i: ucb[i])
Principles, not prescriptions
- Quantify uncertainty; optimism or sampling balances exploration.
- Track regret and confidence to guide exploration schedules.
Common pitfalls
- Using fixed ε-greedy leads to linear regret; decay ε or use principled methods.
- Ignoring delayed feedback and nonstationarity in production.
Connections and contrasts
- See also: [/blog/contextual-bandits], [/blog/kelly-criterion] (risk vs growth), [/blog/pac-learning].
Quick checks
- Why logarithmic regret? — Concentration bounds shrink uncertainty over time.
- When Thompson vs UCB? — Thompson is often simpler and competitive; UCB is deterministic.
- What about contexts? — Move to contextual bandits with linear models.
Further reading
- Lai & Robbins, Auer et al.; Russo & Van Roy on information-directed sampling