Information Theory & Entropy

Information theory, pioneered by Claude Shannon in his seminal 1948 paper A Mathematical Theory of Communication, provides the mathematical framework for quantifying information, compression, and transmission. This lesson explores the rigorous foundations of entropy and its derivatives.

1. Shannon Entropy $H (X)$

The core measure of information is Shannon Entropy. For a discrete random variable $X$ with alphabet $X$ and probability mass function $p (x) = P (X = x)$ , the entropy $H (X)$ is defined as:

$H (X) = - \sum_{x \in X} p (x) lo g_{b} p (x)$

Usually, $b = 2$ (bits) or $b = e$ (nats). By convention, $0 lo g 0 = 0$ .

Axiomatic Foundations

Shannon showed that $H (X)$ is the unique function (up to a constant factor) satisfying the following axioms:

Continuity: Small changes in $p (x)$ result in small changes in entropy.
Monotonicity: If all $n$ outcomes are equally likely ( $p (x) = 1/ n$ ), $H (X)$ should be a monotonically increasing function of $n$ .
Recursion/Grouping: The entropy of a choice can be decomposed into weighted sums of sub-choices. If an outcome is split into two, the new entropy is the original plus the weighted entropy of the split.

2. Joint and Conditional Entropy

To handle multiple random variables, we extend the definition to joint and conditional contexts.

Joint Entropy $H (X, Y)$ measures the total uncertainty in a pair of variables:

$H (X, Y) = - \sum_{x \in X} \sum_{y \in Y} p (x, y) lo g p (x, y)$

Conditional Entropy $H (Y ∣ X)$ measures the remaining uncertainty in $Y$ given that $X$ is known:

$H (Y ∣ X) = \sum_{x \in X} p (x) H (Y ∣ X = x) = - \sum_{x \in X} \sum_{y \in Y} p (x, y) lo g p (y ∣ x)$

The Chain Rule for Entropy:

$H (X, Y) = H (X) + H (Y ∣ X)$

This implies that $H (Y ∣ X) \leq H (Y)$ (conditioning reduces entropy, or at least does not increase it), with equality if and only if $X$ and $Y$ are independent.

3. Mutual Information $I (X; Y)$

Mutual information quantifies the amount of information obtained about one random variable through another. It is the reduction in uncertainty of $X$ due to the knowledge of $Y$ :

$I (X; Y) = H (X) - H (X ∣ Y) = H (Y) - H (Y ∣ X)$

In terms of probability distributions:

$I (X; Y) = \sum_{x, y} p (x, y) lo g \frac{p ( x , y )}{p ( x ) p ( y )}$

It is symmetric ( $I (X; Y) = I (Y; X)$ ) and non-negative ( $I (X; Y) \geq 0$ ).

4. Relative Entropy (Kullback-Leibler Divergence)

The Kullback-Leibler (KL) Divergence $D_{K L} (P ∣∣ Q)$ measures the “distance” between two probability distributions $P$ and $Q$ over the same alphabet:

$D_{K L} (P ∣∣ Q) = \sum_{x \in X} P (x) lo g \frac{P ( x )}{Q ( x )}$

Key Properties

Non-negativity: $D_{K L} (P ∣∣ Q) \geq 0$ (Gibbs’ Inequality), with equality iff $P = Q$ .
Non-symmetry: $D_{K L} (P ∣∣ Q) \neq = D_{K L} (Q ∣∣ P)$ . Thus, it is not a metric (it also fails the triangle inequality).
Interpretation: It represents the expected number of extra bits required to code samples from $P$ using a code optimized for $Q$ .

5. The Source Coding Theorem

Shannon’s first fundamental theorem establishes the absolute limit of data compression. It states that for a source of i.i.d. random variables $X_{1}, X_{2}, \dots, X_{n}$ :

As $n \to \infty$ , the data can be compressed into $n H (X)$ bits with negligible risk of information loss.
It is impossible to compress the data into fewer than $H (X)$ bits per symbol without losing information.

This defines entropy as the fundamental limit of lossless compression.

6. Channel Capacity and Noisy Coding

While source coding deals with compression, Channel Coding deals with reliability over noisy media.

Channel Capacity $C$

The capacity of a discrete memoryless channel is the maximum mutual information between input $X$ and output $Y$ over all possible input distributions: $C = max_{p (x)} I (X; Y)$

Noisy Channel Coding Theorem

Shannon proved that for any rate $R < C$ , there exist error-correcting codes such that the probability of error at the receiver can be made arbitrarily small as the block length $n \to \infty$ . Conversely, if $R > C$ , the error probability is bounded away from zero.

Shannon-Hartley Theorem

For a continuous channel with bandwidth $B$ (Hz), signal power $S$ , and additive white Gaussian noise power $N$ : $C = B lo g_{2} (1 + \frac{S}{N})$

7. Differential Entropy

For continuous random variables with PDF $f (x)$ , Differential Entropy $h (X)$ is: $h (X) = - \int_{- \infty}^{\infty} f (x) lo g f (x) d x$

Warning: Unlike discrete entropy, $h (X)$ can be negative and is not invariant under change of variables. For example, if $X \sim Uniform (0, a)$ , then $h (X) = lo g a$ . If $a < 1$ , $h (X) < 0$ .

8. Maximum Entropy Principle

The Principle of Maximum Entropy (MaxEnt) states that the probability distribution which best represents the current state of knowledge is the one with the largest entropy, subject to known constraints.

If we only know the mean $μ$ and variance $σ^{2}$ of a distribution, the MaxEnt distribution is the Normal (Gaussian) Distribution. If we only know the mean of a positive-valued variable, it is the Exponential Distribution.

In statistical mechanics, the Boltzmann distribution is found by maximizing entropy subject to a fixed average energy.

9. Python Implementation: Entropy and KL Divergence

import numpy as np
from scipy.stats import entropy

def calculate_shannon_entropy(p):
    \"\"\"Calculates Shannon Entropy of a discrete distribution p.\"\"\"
    p = np.array(p)
    # Filter out zero probabilities to avoid log(0)
    p = p[p > 0]
    return -np.sum(p * np.log2(p))

def calculate_kl_divergence(p, q):
    \"\"\"Calculates D_KL(P || Q) for two discrete distributions.\"\"\"
    p = np.array(p)
    q = np.array(q)
    # Ensure they sum to 1
    p = p / np.sum(p)
    q = q / np.sum(q)
    
    # Using scipy for comparison
    kl_scipy = entropy(p, q, base=2)
    
    # Manual calculation
    # Only sum where p[i] > 0
    mask = p > 0
    kl_manual = np.sum(p[mask] * np.log2(p[mask] / q[mask]))
    
    return kl_manual, kl_scipy

# Example usage
p_dist = [0.2, 0.5, 0.3]
q_dist = [0.1, 0.6, 0.3]

h_p = calculate_shannon_entropy(p_dist)
kl_val, kl_ref = calculate_kl_divergence(p_dist, q_dist)

print(f"Entropy H(P): {h_p:.4f} bits")
print(f"KL Divergence D_KL(P||Q): {kl_val:.4f} bits")

Conceptual Check

Why is Kullback-Leibler Divergence $D_{KL}(P || Q)$ not considered a metric in the mathematical sense?

Conceptual Check

According to the Noisy Channel Coding Theorem, what is the primary requirement to achieve an arbitrarily low bit error rate?

Conceptual Check

Which distribution maximizes differential entropy for a continuous variable restricted to a fixed finite interval [a, b]?

Conceptual Check

Information Theory & Entropy

1. Shannon Entropy H(X)

Axiomatic Foundations

2. Joint and Conditional Entropy

3. Mutual Information I(X;Y)

4. Relative Entropy (Kullback-Leibler Divergence)

Key Properties

5. The Source Coding Theorem

6. Channel Capacity and Noisy Coding

Channel Capacity C

Noisy Channel Coding Theorem

Shannon-Hartley Theorem

7. Differential Entropy

8. Maximum Entropy Principle

9. Python Implementation: Entropy and KL Divergence

Why is Kullback-Leibler Divergence $D_{KL}(P || Q)$ not considered a metric in the mathematical sense?

According to the Noisy Channel Coding Theorem, what is the primary requirement to achieve an arbitrarily low bit error rate?

Which distribution maximizes differential entropy for a continuous variable restricted to a fixed finite interval [a, b]?

What does the identity $I(X; Y) = H(X) + H(Y) - H(X, Y)$ represent?

1. Shannon Entropy $H (X)$

3. Mutual Information $I (X; Y)$

Channel Capacity $C$