Regret Bounds in Multi-Armed Bandits

Takeaway

Bandit algorithms balance exploration and exploitation to minimize cumulative regret; UCB and Thompson sampling achieve logarithmic regret under standard assumptions.

The problem (before → after)

Before: Greedy strategies get stuck on early lucky arms or miss better options.
After: Optimism (UCB) or posterior sampling (Thompson) systematically explores while converging to the best arm.

Mental model first

Think of restaurant discovery: try places with high ratings but occasionally sample new ones; over time you settle on the consistently best spot while minimizing bad meals.

Just-in-time concepts

Regret: R_T = T μ* − ∑t μ.
UCB: Choose arm maximizing empirical mean + confidence radius.
Thompson sampling: Sample parameters from posterior; pick arm with sampled best mean.

First-pass solution

Implement UCB1 with bounds ∝ √(2 log t / n_a). Thompson sampling for Bernoulli rewards uses Beta posteriors.

Nonstationary bandits: Sliding-window UCB.
Contextual bandits: Condition on features; use linear/GLM models.
Structured priors and constraints for real-world deployments.

Code as a byproduct (UCB1 step)

import math

def ucb1_step(t, counts, means):
    ucb = [m + math.sqrt(2*math.log(t)/(c or 1)) for m, c in zip(means, counts)]
    return max(range(len(ucb)), key=lambda i: ucb[i])

Principles, not prescriptions

Quantify uncertainty; optimism or sampling balances exploration.
Track regret and confidence to guide exploration schedules.

Common pitfalls

Using fixed ε-greedy leads to linear regret; decay ε or use principled methods.
Ignoring delayed feedback and nonstationarity in production.

Connections and contrasts

See also: [/blog/contextual-bandits], [/blog/kelly-criterion] (risk vs growth), [/blog/pac-learning].

Quick checks

Why logarithmic regret? — Concentration bounds shrink uncertainty over time.
When Thompson vs UCB? — Thompson is often simpler and competitive; UCB is deterministic.
What about contexts? — Move to contextual bandits with linear models.