Takeaway

Contextual bandits choose actions based on features; algorithms like LinUCB and Thompson sampling with linear models achieve regret scaling with feature dimension.

The problem (before → after)

  • Before: Plain bandits ignore context and learn slowly.
  • After: Incorporate features to generalize across actions and users, reducing regret.

Mental model first

Like recommending movies: knowledge about a user’s preferences generalizes to similar users and movies via shared features.

Just-in-time concepts

  • Linear payoff: r_t = x_{t,a}^T θ* + noise.
  • LinUCB: Upper confidence bounds from ridge-regression posteriors.
  • Thompson sampling: Sample θ from posterior and act greedily.

First-pass solution

Maintain A = λI + ∑ x x^T and b = ∑ r x; estimate θ̂ = A^{-1} b; pick action maximizing x^T θ̂ + α √(x^T A^{-1} x).

Iterative refinement

  1. Nonlinear features via kernels or neural networks.
  2. Delayed feedback and propensity scoring for logged bandits.
  3. Safety constraints and fairness-aware exploration.

Code as a byproduct (LinUCB update)

import numpy as np

def linucb_update(A, b, x, r):
    A += x @ x.T
    b += r * x
    return A, b

Principles, not prescriptions

  • Share information across actions via features to accelerate learning.
  • Calibrate uncertainty for reliable exploration.

Common pitfalls

  • Poor feature scaling inflates uncertainty estimates.
  • Logged bandit evaluation requires off-policy corrections.

Connections and contrasts

  • See also: [/blog/multi-armed-bandits], [/blog/pac-learning], [/blog/pac-bayes].

Quick checks

  1. Why √(x^T A^{-1} x)? — It’s the posterior standard deviation in ridge regression.
  2. When choose Thompson over LinUCB? — Simpler tuning and good empirical performance.
  3. How handle delayed rewards? — Maintain buffers and correct for bias.

Further reading

  • Li et al., LinUCB; Agrawal & Goyal TS for linear bandits