Contextual Bandits with Linear Payoffs

Takeaway

Contextual bandits choose actions based on features; algorithms like LinUCB and Thompson sampling with linear models achieve regret scaling with feature dimension.

The problem (before → after)

Before: Plain bandits ignore context and learn slowly.
After: Incorporate features to generalize across actions and users, reducing regret.

Mental model first

Like recommending movies: knowledge about a user’s preferences generalizes to similar users and movies via shared features.

Just-in-time concepts

Linear payoff: r_t = x_{t,a}^T θ* + noise.
LinUCB: Upper confidence bounds from ridge-regression posteriors.
Thompson sampling: Sample θ from posterior and act greedily.

First-pass solution

Maintain A = λI + ∑ x x^T and b = ∑ r x; estimate θ̂ = A^{-1} b; pick action maximizing x^T θ̂ + α √(x^T A^{-1} x).

Nonlinear features via kernels or neural networks.
Delayed feedback and propensity scoring for logged bandits.
Safety constraints and fairness-aware exploration.

Code as a byproduct (LinUCB update)

import numpy as np

def linucb_update(A, b, x, r):
    A += x @ x.T
    b += r * x
    return A, b

Principles, not prescriptions

Share information across actions via features to accelerate learning.
Calibrate uncertainty for reliable exploration.

Common pitfalls

Poor feature scaling inflates uncertainty estimates.
Logged bandit evaluation requires off-policy corrections.

Connections and contrasts

See also: [/blog/multi-armed-bandits], [/blog/pac-learning], [/blog/pac-bayes].

Quick checks

Why √(x^T A^{-1} x)? — It’s the posterior standard deviation in ridge regression.
When choose Thompson over LinUCB? — Simpler tuning and good empirical performance.
How handle delayed rewards? — Maintain buffers and correct for bias.