Perceptual Losses for Image Synthesis
Takeaway
Perceptual losses compare deep feature activations rather than pixels, aligning optimization with human perception for tasks like style transfer and super-resolution.
The problem (before → after)
- Before: L2 pixel losses yield blurry results and penalize harmless shifts.
- After: Feature-based losses capture texture and structure similarity.
Mental model first
Humans compare images by parts and patterns, not pixel-by-pixel; using features from a trained network approximates this perceptual comparison.
Just-in-time concepts
- Content vs style losses from intermediate CNN layers.
- Gram matrices capture texture statistics.
- Feature reconstruction aligns semantics.
First-pass solution
Extract features with a fixed CNN (e.g., VGG); define content/style losses; optimize image or train a feed-forward network to minimize them.
Iterative refinement
- Multi-layer and multi-scale features.
- Adversarial and perceptual hybrid losses.
- Learned perceptual metrics (LPIPS) correlate with human judgments.
Principles, not prescriptions
- Choose layers that reflect desired percepts; balance content vs style.
Common pitfalls
- Overfitting to one network’s biases.
- Modeled textures without preserving structure.
Connections and contrasts
- See also: [/blog/neural-style-transfer], [/blog/rendering-equation].
Quick checks
- Why Gram matrices? — Capture second-order feature correlations.
- Why fixed CNN? — Provides stable feature space; training can drift.
- Why blur with L2? — Penalizes small misalignments harshly.
Further reading
- Johnson et al.; Dosovitskiy & Brox; LPIPS