Contextual Bandits with Linear Payoffs
Takeaway
Contextual bandits choose actions based on features; algorithms like LinUCB and Thompson sampling with linear models achieve regret scaling with feature dimension.
The problem (before → after)
- Before: Plain bandits ignore context and learn slowly.
- After: Incorporate features to generalize across actions and users, reducing regret.
Mental model first
Like recommending movies: knowledge about a user’s preferences generalizes to similar users and movies via shared features.
Just-in-time concepts
- Linear payoff: r_t = x_{t,a}^T θ* + noise.
- LinUCB: Upper confidence bounds from ridge-regression posteriors.
- Thompson sampling: Sample θ from posterior and act greedily.
First-pass solution
Maintain A = λI + ∑ x x^T and b = ∑ r x; estimate θ̂ = A^{-1} b; pick action maximizing x^T θ̂ + α √(x^T A^{-1} x).
Iterative refinement
- Nonlinear features via kernels or neural networks.
- Delayed feedback and propensity scoring for logged bandits.
- Safety constraints and fairness-aware exploration.
Code as a byproduct (LinUCB update)
import numpy as np
def linucb_update(A, b, x, r):
A += x @ x.T
b += r * x
return A, b
Principles, not prescriptions
- Share information across actions via features to accelerate learning.
- Calibrate uncertainty for reliable exploration.
Common pitfalls
- Poor feature scaling inflates uncertainty estimates.
- Logged bandit evaluation requires off-policy corrections.
Connections and contrasts
- See also: [/blog/multi-armed-bandits], [/blog/pac-learning], [/blog/pac-bayes].
Quick checks
- Why √(x^T A^{-1} x)? — It’s the posterior standard deviation in ridge regression.
- When choose Thompson over LinUCB? — Simpler tuning and good empirical performance.
- How handle delayed rewards? — Maintain buffers and correct for bias.
Further reading
- Li et al., LinUCB; Agrawal & Goyal TS for linear bandits