Information Theory and Entropy

Takeaway

Entropy quantifies uncertainty; channel capacity and coding theorems show how reliably we can communicate over noisy channels.

The problem (before → after)

Before: No principled way to measure information or limits of compression/communication.
After: Entropy H, mutual information I, and capacity C provide tight limits and achievable strategies.

Mental model first

Information is surprise: messages that are harder to predict carry more bits. Compression removes predictability; error-correcting codes add redundancy to fight noise.

Just-in-time concepts

Entropy H(X) = −∑ p log p; mutual information I(X;Y); KL divergence.
Source coding: Optimal average codelength ≈ H.
Channel coding: Reliable communication below capacity with vanishing error.

First-pass solution

Design prefix-free codes approaching H; use block codes and decoding to approach capacity; measure performance with MI and error rates.

Modern codes: LDPC, Turbo, Polar codes approach capacity efficiently.
Info theory in ML: regularization, representation learning, privacy.
Rate–distortion trades fidelity for bitrate.

Principles, not prescriptions

Bits measure uncertainty, not meaning.
Trade redundancy and rates to meet reliability goals.

Common pitfalls

Confusing MI with causation.
Applying capacity without matching channel models.

Connections and contrasts

See also: [/blog/differential-privacy], [/blog/kelly-criterion], [/blog/black-box-vi].

Quick checks

Why negative sum? — Ensures higher uncertainty → higher entropy.
What is capacity? — Max MI over inputs: C = max_{p(x)} I(X;Y).
How does compression limit relate to H? — Average codelength can’t beat H.