Adversarial Examples and Robustness
Takeaway
Small, structured perturbations can reliably fool neural networks; robustness requires training objectives and architectures aligned with worst-case perturbations.
The problem (before → after)
- Before: High accuracy suggests models understand data.
- After: Gradient-aligned perturbations reveal brittle decision boundaries; defense requires robust optimization.
Mental model first
Imagine a smooth hillside with hidden cliffs covered by grass. From afar it looks safe, but a small step in the wrong direction drops you off an unseen edge. Adversaries find those edges.
Just-in-time concepts
- Threat model: Norm-bounded perturbations (ℓ_∞, ℓ_2) or semantic attacks.
- Robust training: min_θ E[max_{||δ||≤ε} ℓ(f_θ(x+δ), y)].
- Attacks: FGSM, PGD, Carlini–Wagner.
First-pass solution
Adversarial training with PGD inner loops yields certified robustness within the chosen threat model; regularization and data augmentation help.
Iterative refinement
- Certified defenses: Randomized smoothing yields probabilistic ℓ_2 certificates.
- Distribution shift: Robust features improve transfer but can hurt clean accuracy.
- Beyond norms: Spatial and physical-world robustness.
Code as a byproduct (PGD attack)
import torch
def pgd(model, x, y, eps, alpha, steps):
x_adv = x.clone().detach().requires_grad_(True)
for _ in range(steps):
logits = model(x_adv)
loss = torch.nn.functional.cross_entropy(logits, y)
loss.backward()
with torch.no_grad():
x_adv += alpha * x_adv.grad.sign()
x_adv = torch.min(torch.max(x_adv, x - eps), x + eps)
x_adv.clamp_(0, 1)
x_adv.grad.zero_()
return x_adv.detach()
Principles, not prescriptions
- Align loss with threat model; optimize for worst-case within constraints.
- Evaluate with strong, adaptive attacks; avoid gradient masking.
Common pitfalls
- Overfitting to weak attacks; robustness does not transfer across threat models.
- Ignoring distribution shift and semantics beyond pixel norms.
Connections and contrasts
- See also: [/blog/attention-is-all-you-need], [/blog/shapley-explanations] (attribution vs robustness), [/blog/gans] (minimax parallels).
Quick checks
- Why adversarial training works? — Minimizes worst-case loss within ε-balls.
- Why robustness trade-offs? — Robust features can differ from features for clean accuracy.
- What breaks many defenses? — Gradient masking; adaptive attacks defeat them.
Further reading
- Goodfellow et al., 2015; Madry et al., 2017
- Certified defenses literature