Statistical Inference & Estimation

Statistical inference is the process of using data analysis to deduce properties of an underlying distribution of probability. We assume that the observed data $x = (x_{1}, x_{2}, \dots, x_{n})$ are realizations of random variables $X_{1}, X_{2}, \dots, X_{n}$ distributed according to some member of a parametric family $P = {f (x; θ) : θ \in Θ}$ .

1. Point Estimation

A point estimator $\hat{θ}$ is a statistic (a function of the data) used to approximate the unknown parameter $θ$ .

Bias and Mean Squared Error (MSE)

The Bias of an estimator $\hat{θ}$ is defined as: $B (\hat{θ}) = E_{θ} [\hat{θ}] - θ$ An estimator is unbiased if $B (\hat{θ}) = 0$ .

The Mean Squared Error (MSE) measures the average squared difference between the estimator and the parameter: $MSE (\hat{θ}) = E_{θ} [(\hat{θ} - θ)^{2}]$ A fundamental decomposition of MSE is: $MSE (\hat{θ}) = Va r_{θ} (\hat{θ}) + [B (\hat{θ})]^{2}$ This highlights the bias-variance tradeoff: as we reduce bias, variance often increases, and vice-versa.

Consistency

An estimator $\hat{θ}_{n}$ is consistent if it converges in probability to the true parameter: $\forall ϵ > 0 : lim_{n \to \infty} P_{θ} (∣ \hat{θ}_{n} - θ ∣ > ϵ) = 0$ This is often denoted as $\hat{θ}_{n} P θ$ . By the Law of Large Numbers, the sample mean $\overset{ˉ}{X}$ is a consistent estimator of the population mean $μ$ .

2. Maximum Likelihood Estimation (MLE)

MLE is the most widely used method for point estimation. Given i.i.d. observations, the Likelihood Function $L (θ)$ is: $L (θ) = \prod_{i = 1}^{n} f (x_{i}; θ)$ We seek $\hat{θ}_{M L E}$ that maximizes $L (θ)$ . In practice, it is easier to maximize the log-likelihood: $ℓ (θ) = ln L (θ) = \sum_{i = 1}^{n} ln f (x_{i}; θ)$

The Score Function

The Score Function $U (θ)$ is the gradient of the log-likelihood: $U (θ) = \nabla_{θ} ℓ (θ)$ The MLE is found by solving the likelihood equation $U (\hat{θ}) = 0$ .

Asymptotic Properties of MLE

Under “regularity conditions,” MLEs possess desirable large-sample properties:

Consistency: $\hat{θ}_{M L E} P θ_{0}$ .
Asymptotic Normality: $n (\hat{θ}_{M L E} - θ_{0}) d N (0, I (θ)^{- 1})$ , where $I (θ)$ is the Fisher Information.
Efficiency: For large $n$ , MLE achieves the Cramér-Rao Lower Bound.

3. Method of Moments (MoM)

MoM estimates parameters by equating population moments to sample moments. If $μ_{k} (θ) = E [X^{k}]$ , we solve the system: $μ_{k} (θ) = \frac{1}{n} \sum_{i = 1}^{n} x_{i}^{k}, k = 1, \dots, p$ MoM is often easier to compute than MLE but is usually less efficient (higher variance).

4. Sufficient Statistics

A statistic $T (X)$ is sufficient for $θ$ if the conditional distribution of $X$ given $T (X)$ does not depend on $θ$ . This means $T (X)$ captures all the information in the sample about $θ$ .

Factorization Theorem (Fisher-Neyman)

$T (X)$ is sufficient for $θ$ if and only if the joint density can be factored as: $f (x; θ) = h (x) g (T (x); θ)$ where $h (x)$ does not depend on $θ$ and $g$ depends on $x$ only through $T (x)$ .

5. Information & Efficiency

Fisher Information

The Fisher Information $I (θ)$ represents the amount of information that an observable random variable $X$ carries about an unknown parameter $θ$ : $I (θ) = E_{θ} [(\frac{\partial}{\partial θ} ln f (X; θ))^{2}] = - E_{θ} [\frac{\partial ^{2}}{\partial θ ^{2}} ln f (X; θ)]$

Cramér-Rao Lower Bound (CRLB)

For any unbiased estimator $\hat{θ}$ , its variance is bounded from below: $Va r (\hat{θ}) \geq \frac{1}{n I ( θ )}$ An unbiased estimator that achieves this bound is called UMVUE (Uniformly Minimum Variance Unbiased Estimator) if it is efficient.

Rao-Blackwell Theorem

If $\hat{θ}$ is an unbiased estimator and $T$ is a sufficient statistic, then the conditional expectation $\hat{θ}^{*} = E [\hat{θ} ∣ T]$ is also unbiased and $Va r (\hat{θ}^{*}) \leq Va r (\hat{θ})$ . This implies that we only ever need to search for optimal estimators among functions of sufficient statistics.

6. Interval Estimation

Instead of a single point, we construct a Confidence Interval (CI) $[L, U]$ such that: $P_{θ} (L (X) \leq θ \leq U (X)) = 1 - α$ This is often done using a Pivotal Quantity $Q (X, θ)$ , a function of the data and the parameter whose distribution does not depend on $θ$ . Example: For $X \sim N (μ, σ^{2})$ with known $σ$ , $Z = \frac{X ˉ - μ}{σ / n}$ is a pivot since $Z \sim N (0, 1)$ .

Python Implementation: MLE for Poisson Distribution

The following code calculates the MLE for the $λ$ parameter of a Poisson distribution and visualizes the log-likelihood surface.

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize_scalar
from scipy.stats import poisson

# Generate synthetic Poisson data with true lambda = 4.5
np.random.seed(42)
true_lambda = 4.5
data = np.random.poisson(true_lambda, size=100)

def log_likelihood(lam, data):
    if lam <= 0: return -np.inf
    # Poisson PMF: (lam^k * e^-lam) / k!
    return np.sum(poisson.logpmf(data, lam))

# We want to maximize log-likelihood, which is minimizing the negative log-likelihood
def neg_log_likelihood(lam, data):
    return -log_likelihood(lam, data)

# Find MLE using scipy
res = minimize_scalar(neg_log_likelihood, args=(data,), bounds=(0.1, 10), method='bounded')
mle_lambda = res.x

print(f"Sample Mean: {np.mean(data):.4f}")
print(f"MLE Lambda: {mle_lambda:.4f}")

# Visualization
lam_range = np.linspace(2, 7, 100)
ll_values = [log_likelihood(l, data) for l in lam_range]

plt.figure(figsize=(10, 5))
plt.plot(lam_range, ll_values, label='Log-Likelihood', color='#2563eb', lw=2)
plt.axvline(mle_lambda, color='red', linestyle='--', label=f'MLE $\hat{{\lambda}}$ = {mle_lambda:.2f}')
plt.title('Log-Likelihood Surface for Poisson Parameter $\lambda$')
plt.xlabel('$\lambda$')
plt.ylabel('Log-Likelihood')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

Advanced Concepts: Efficiency

An estimator’s performance is often compared via Relative Efficiency: $E ff (\hat{θ}_{1}, \hat{θ}_{2}) = \frac{Va r ( θ ^ _{2} )}{Va r ( θ ^ _{1} )}$ If the efficiency is $> 1$ , $\hat{θ}_{1}$ is superior. As $n \to \infty$ , the asymptotic efficiency of MLE is 1, meaning it is the best one can do in the limit.

Conceptual Check

According to the Factorization Theorem, what defines a sufficient statistic T(x)?

Conceptual Check

What is the significance of the Cramér-Rao Lower Bound?

Conceptual Check

Statistical Inference & Estimation

1. Point Estimation

Bias and Mean Squared Error (MSE)

Consistency

2. Maximum Likelihood Estimation (MLE)

The Score Function

Asymptotic Properties of MLE

3. Method of Moments (MoM)

4. Sufficient Statistics

Factorization Theorem (Fisher-Neyman)

5. Information & Efficiency

Fisher Information

Cramér-Rao Lower Bound (CRLB)

Rao-Blackwell Theorem

6. Interval Estimation

Python Implementation: MLE for Poisson Distribution

Advanced Concepts: Efficiency

According to the Factorization Theorem, what defines a sufficient statistic T(x)?

What is the significance of the Cramér-Rao Lower Bound?

Which property of MLE ensures that as sample size increases, the estimator converges to the true parameter?