Takeaway

Small, structured perturbations can reliably fool neural networks; robustness requires training objectives and architectures aligned with worst-case perturbations.

The problem (before → after)

  • Before: High accuracy suggests models understand data.
  • After: Gradient-aligned perturbations reveal brittle decision boundaries; defense requires robust optimization.

Mental model first

Imagine a smooth hillside with hidden cliffs covered by grass. From afar it looks safe, but a small step in the wrong direction drops you off an unseen edge. Adversaries find those edges.

Just-in-time concepts

  • Threat model: Norm-bounded perturbations (ℓ_∞, ℓ_2) or semantic attacks.
  • Robust training: min_θ E[max_{||δ||≤ε} ℓ(f_θ(x+δ), y)].
  • Attacks: FGSM, PGD, Carlini–Wagner.

First-pass solution

Adversarial training with PGD inner loops yields certified robustness within the chosen threat model; regularization and data augmentation help.

Iterative refinement

  1. Certified defenses: Randomized smoothing yields probabilistic ℓ_2 certificates.
  2. Distribution shift: Robust features improve transfer but can hurt clean accuracy.
  3. Beyond norms: Spatial and physical-world robustness.

Code as a byproduct (PGD attack)

import torch

def pgd(model, x, y, eps, alpha, steps):
    x_adv = x.clone().detach().requires_grad_(True)
    for _ in range(steps):
        logits = model(x_adv)
        loss = torch.nn.functional.cross_entropy(logits, y)
        loss.backward()
        with torch.no_grad():
            x_adv += alpha * x_adv.grad.sign()
            x_adv = torch.min(torch.max(x_adv, x - eps), x + eps)
            x_adv.clamp_(0, 1)
        x_adv.grad.zero_()
    return x_adv.detach()

Principles, not prescriptions

  • Align loss with threat model; optimize for worst-case within constraints.
  • Evaluate with strong, adaptive attacks; avoid gradient masking.

Common pitfalls

  • Overfitting to weak attacks; robustness does not transfer across threat models.
  • Ignoring distribution shift and semantics beyond pixel norms.

Connections and contrasts

  • See also: [/blog/attention-is-all-you-need], [/blog/shapley-explanations] (attribution vs robustness), [/blog/gans] (minimax parallels).

Quick checks

  1. Why adversarial training works? — Minimizes worst-case loss within ε-balls.
  2. Why robustness trade-offs? — Robust features can differ from features for clean accuracy.
  3. What breaks many defenses? — Gradient masking; adaptive attacks defeat them.

Further reading

  • Goodfellow et al., 2015; Madry et al., 2017
  • Certified defenses literature