Search Knowledge

© 2026 LIBREUNI PROJECT

Mathematics / Probability & Statistics

Information Theory & Entropy

Information theory, pioneered by Claude Shannon in his seminal 1948 paper A Mathematical Theory of Communication, provides the mathematical framework for quantifying information, compression, and transmission. This lesson explores the rigorous foundations of entropy and its derivatives.

1. Shannon Entropy

The core measure of information is Shannon Entropy. For a discrete random variable with alphabet and probability mass function , the entropy is defined as:

Usually, (bits) or (nats). By convention, .

Axiomatic Foundations

Shannon showed that is the unique function (up to a constant factor) satisfying the following axioms:

  1. Continuity: Small changes in result in small changes in entropy.
  2. Monotonicity: If all outcomes are equally likely (), should be a monotonically increasing function of .
  3. Recursion/Grouping: The entropy of a choice can be decomposed into weighted sums of sub-choices. If an outcome is split into two, the new entropy is the original plus the weighted entropy of the split.

2. Joint and Conditional Entropy

To handle multiple random variables, we extend the definition to joint and conditional contexts.

Joint Entropy measures the total uncertainty in a pair of variables:

Conditional Entropy measures the remaining uncertainty in given that is known:

The Chain Rule for Entropy:

This implies that (conditioning reduces entropy, or at least does not increase it), with equality if and only if and are independent.

3. Mutual Information

Mutual information quantifies the amount of information obtained about one random variable through another. It is the reduction in uncertainty of due to the knowledge of :

In terms of probability distributions:

It is symmetric () and non-negative ().

4. Relative Entropy (Kullback-Leibler Divergence)

The Kullback-Leibler (KL) Divergence measures the “distance” between two probability distributions and over the same alphabet:

Key Properties

  • Non-negativity: (Gibbs’ Inequality), with equality iff .
  • Non-symmetry: . Thus, it is not a metric (it also fails the triangle inequality).
  • Interpretation: It represents the expected number of extra bits required to code samples from using a code optimized for .

5. The Source Coding Theorem

Shannon’s first fundamental theorem establishes the absolute limit of data compression. It states that for a source of i.i.d. random variables :

  1. As , the data can be compressed into bits with negligible risk of information loss.
  2. It is impossible to compress the data into fewer than bits per symbol without losing information.

This defines entropy as the fundamental limit of lossless compression.

6. Channel Capacity and Noisy Coding

While source coding deals with compression, Channel Coding deals with reliability over noisy media.

Channel Capacity

The capacity of a discrete memoryless channel is the maximum mutual information between input and output over all possible input distributions:

Noisy Channel Coding Theorem

Shannon proved that for any rate , there exist error-correcting codes such that the probability of error at the receiver can be made arbitrarily small as the block length . Conversely, if , the error probability is bounded away from zero.

Shannon-Hartley Theorem

For a continuous channel with bandwidth (Hz), signal power , and additive white Gaussian noise power :

7. Differential Entropy

For continuous random variables with PDF , Differential Entropy is:

Warning: Unlike discrete entropy, can be negative and is not invariant under change of variables. For example, if , then . If , .

8. Maximum Entropy Principle

The Principle of Maximum Entropy (MaxEnt) states that the probability distribution which best represents the current state of knowledge is the one with the largest entropy, subject to known constraints.

If we only know the mean and variance of a distribution, the MaxEnt distribution is the Normal (Gaussian) Distribution. If we only know the mean of a positive-valued variable, it is the Exponential Distribution.

In statistical mechanics, the Boltzmann distribution is found by maximizing entropy subject to a fixed average energy.

9. Python Implementation: Entropy and KL Divergence

import numpy as np
from scipy.stats import entropy

def calculate_shannon_entropy(p):
    \"\"\"Calculates Shannon Entropy of a discrete distribution p.\"\"\"
    p = np.array(p)
    # Filter out zero probabilities to avoid log(0)
    p = p[p > 0]
    return -np.sum(p * np.log2(p))

def calculate_kl_divergence(p, q):
    \"\"\"Calculates D_KL(P || Q) for two discrete distributions.\"\"\"
    p = np.array(p)
    q = np.array(q)
    # Ensure they sum to 1
    p = p / np.sum(p)
    q = q / np.sum(q)
    
    # Using scipy for comparison
    kl_scipy = entropy(p, q, base=2)
    
    # Manual calculation
    # Only sum where p[i] > 0
    mask = p > 0
    kl_manual = np.sum(p[mask] * np.log2(p[mask] / q[mask]))
    
    return kl_manual, kl_scipy

# Example usage
p_dist = [0.2, 0.5, 0.3]
q_dist = [0.1, 0.6, 0.3]

h_p = calculate_shannon_entropy(p_dist)
kl_val, kl_ref = calculate_kl_divergence(p_dist, q_dist)

print(f"Entropy H(P): {h_p:.4f} bits")
print(f"KL Divergence D_KL(P||Q): {kl_val:.4f} bits")
Conceptual Check

Why is Kullback-Leibler Divergence $D_{KL}(P || Q)$ not considered a metric in the mathematical sense?

Conceptual Check

According to the Noisy Channel Coding Theorem, what is the primary requirement to achieve an arbitrarily low bit error rate?

Conceptual Check

Which distribution maximizes differential entropy for a continuous variable restricted to a fixed finite interval [a, b]?

Conceptual Check

What does the identity $I(X; Y) = H(X) + H(Y) - H(X, Y)$ represent?