Takeaway

Bandit algorithms balance exploration and exploitation to minimize cumulative regret; UCB and Thompson sampling achieve logarithmic regret under standard assumptions.

The problem (before → after)

  • Before: Greedy strategies get stuck on early lucky arms or miss better options.
  • After: Optimism (UCB) or posterior sampling (Thompson) systematically explores while converging to the best arm.

Mental model first

Think of restaurant discovery: try places with high ratings but occasionally sample new ones; over time you settle on the consistently best spot while minimizing bad meals.

Just-in-time concepts

  • Regret: R_T = T μ* − ∑t μ.
  • UCB: Choose arm maximizing empirical mean + confidence radius.
  • Thompson sampling: Sample parameters from posterior; pick arm with sampled best mean.

First-pass solution

Implement UCB1 with bounds ∝ √(2 log t / n_a). Thompson sampling for Bernoulli rewards uses Beta posteriors.

Iterative refinement

  1. Nonstationary bandits: Sliding-window UCB.
  2. Contextual bandits: Condition on features; use linear/GLM models.
  3. Structured priors and constraints for real-world deployments.

Code as a byproduct (UCB1 step)

import math

def ucb1_step(t, counts, means):
    ucb = [m + math.sqrt(2*math.log(t)/(c or 1)) for m, c in zip(means, counts)]
    return max(range(len(ucb)), key=lambda i: ucb[i])

Principles, not prescriptions

  • Quantify uncertainty; optimism or sampling balances exploration.
  • Track regret and confidence to guide exploration schedules.

Common pitfalls

  • Using fixed ε-greedy leads to linear regret; decay ε or use principled methods.
  • Ignoring delayed feedback and nonstationarity in production.

Connections and contrasts

  • See also: [/blog/contextual-bandits], [/blog/kelly-criterion] (risk vs growth), [/blog/pac-learning].

Quick checks

  1. Why logarithmic regret? — Concentration bounds shrink uncertainty over time.
  2. When Thompson vs UCB? — Thompson is often simpler and competitive; UCB is deterministic.
  3. What about contexts? — Move to contextual bandits with linear models.

Further reading

  • Lai & Robbins, Auer et al.; Russo & Van Roy on information-directed sampling