Hypothesis Testing & Power

In statistical inference, hypothesis testing is the formal process of using data to evaluate the validity of a claim about a population parameter. This lesson moves beyond introductory “plug-and-chug” methods to explore the mathematical foundations of decision theory, likelihood ratios, and the optimization of test power.

1. The Decision Framework: $H_{0}$ and $H_{1}$

We define two competing hypotheses:

Null Hypothesis ( $H_{0}$ ): The status quo or a specific “no effect” state. Mathematically, it typically specifies a subset of the parameter space $Θ_{0} \subset Θ$ .
Alternative Hypothesis ( $H_{1}$ ): The statement we seek to find evidence for, $Θ_{1} = Θ ∖ Θ_{0}$ .

A test is a decision rule $δ (X)$ that maps the sample space $X$ to the set ${H_{0}, H_{1}}$ . This is often defined via a rejection region $R \subset X$ :

Reject H_{0} if X \in R

2. Errors in Decision Making

Errors are unavoidable in frequentist inference. We quantify them as probabilities:

Type I Error ( $α$ ): Rejecting $H_{0}$ when it is true.

$α = P (Reject H_{0} ∣ H_{0} is true) = P (X \in R ∣ θ \in Θ_{0})$
Type II Error ( $β$ ): Failing to reject $H_{0}$ when it is false.

$β = P (Fail to reject H_{0} ∣ H_{1} is true) = P (X \in / R ∣ θ \in Θ_{1})$
Power of the Test ( $1 - β$ ): The probability of correctly rejecting a false null hypothesis.

$Power (θ) = P (X \in R ∣ θ \in Θ_{1})$

Ideally, we minimize both $α$ and $β$ . However, for a fixed sample size $n$ , there is an inverse relationship: decreasing the “size” ( $α$ ) of the test generally increases $β$ .

3. Test Statistics and Rejection Regions

A test statistic $T (X)$ reduces the dimensionality of the data to a single value used for the decision. Common forms include:

The Z-Test (Known Variance)

Under $H_{0} : μ = μ_{0}$ with known variance $σ^{2}$ : $Z = \frac{X ˉ - μ _{0}}{σ / n} \sim N (0, 1)$

The T-Test (Unknown Variance)

If $σ^{2}$ is unknown and estimated by sample variance $s^{2}$ : $t = \frac{X ˉ - μ _{0}}{s / n} \sim t_{n - 1}$

The Chi-Squared Test

For variance testing or goodness-of-fit: $χ^{2} = \sum \frac{( O _{i} - E _{i} ) ^{2}}{E _{i}} \sim χ_{k}^{2}$

4. The P-Value: A Measure of Evidence

The p-value is not the probability that $H_{0}$ is true. Rather, it is the probability of observing a test statistic at least as extreme as the one computed, assuming $H_{0}$ is true.

Formally, for a test statistic $T (X)$ where large values provide evidence against $H_{0}$ : $p = P (T \geq t_{o b s} ∣ H_{0})$

A p-value is a random variable itself. If $H_{0}$ is true, the p-value is uniformly distributed on $[0, 1]$ for continuous test statistics: $p \sim U (0, 1)$ .

5. The Neyman-Pearson Lemma

How do we choose the best rejection region $R$ ? For a simple null $H_{0} : θ = θ_{0}$ versus a simple alternative $H_{1} : θ = θ_{1}$ , the Neyman-Pearson Lemma provides the Most Powerful (MP) test.

The lemma states that the region $R$ that maximizes power for a fixed $α$ is defined by the Likelihood Ratio: $Λ (x) = \frac{L ( θ _{0} ∣ x )}{L ( θ _{1} ∣ x )} \leq k$ where $k$ is chosen such that $P (Λ (X) \leq k ∣ θ_{0}) = α$ . This ratio ensures that we reject $H_{0}$ when the data is significantly “more likely” under $H_{1}$ than under $H_{0}$ .

6. Uniformly Most Powerful (UMP) Tests

When $H_{1}$ is composite (e.g., $θ > θ_{0}$ ), we seek a test that is the most powerful for all $θ \in Θ_{1}$ . Such a test is called Uniformly Most Powerful (UMP).

The existence of a UMP test is guaranteed if the family of distributions possesses the Monotone Likelihood Ratio (MLR) property. A family $f (x ∣ θ)$ has MLR in $T (x)$ if for any $θ_{1} < θ_{2}$ , the ratio $f (x ∣ θ_{2}) / f (x ∣ θ_{1})$ is a non-decreasing function of $T (x)$ .

7. Likelihood Ratio Tests (LRT) and Wilks’ Theorem

For complex, multi-parameter composite hypotheses, we use the generalized Likelihood Ratio Test: $λ (x) = \frac{s u p _{θ \in Θ_{0}} L ( θ ∣ x )}{s u p _{θ \in Θ} L ( θ ∣ x )}$ where $0 \leq λ \leq 1$ . Small values of $λ$ lead to rejection.

Wilks’ Theorem: Under certain regularity conditions, as $n \to \infty$ , the distribution of $- 2 ln λ (x)$ converges in distribution to a $χ^{2}$ distribution with degrees of freedom equal to the difference in dimensionality between $Θ$ and $Θ_{0}$ : $- 2 ln λ (x) d χ_{r}^{2}$

8. Multiple Testing and Bonferroni Correction

When conducting $m$ independent tests at a significance level $α$ , the probability of committing at least one Type I error (Family-Wide Error Rate, FWER) is $1 - (1 - α)^{m}$ . As $m$ grows, this approaches 1.

The Bonferroni correction guards against this by using a stricter threshold for each individual test: $α^{'} = \frac{α}{m}$

Python Implementation: T-Test and Visualization

The following code calculates a one-sample t-test and visualizes the rejection region vs. the p-value.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Parameters
mu_null = 50
sample_size = 30
data = np.random.normal(52, 10, sample_size) # Mean 52, StdDev 10

# Perform t-test
t_stat, p_val = stats.ttest_1samp(data, mu_null)
df = sample_size - 1

# Plotting
x = np.linspace(-4, 4, 1000)
y = stats.t.pdf(x, df)
critical_value = stats.t.ppf(0.95, df)

plt.figure(figsize=(10, 6))
plt.plot(x, y, label=f't-distribution (df={df})')

# Rejection Region (Alpha = 0.05)
plt.fill_between(x, 0, y, where=(x > critical_value), color='red', alpha=0.3, label='Rejection Region')

# P-value area
plt.fill_between(x, 0, y, where=(x > t_stat), color='blue', alpha=0.5, label=f'p-value area (p={p_val:.4f})')

plt.axvline(t_stat, color='black', linestyle='--', label=f'Observed t={t_stat:.2f}')
plt.title('One-Sample T-Test: Rejection Region vs P-value')
plt.legend()
plt.show()

Conceptual Check

According to the Neyman-Pearson Lemma, what defines the rejection region for the Most Powerful test between two simple hypotheses?

Conceptual Check

What is the distribution of the p-value when the Null Hypothesis is true for a continuous test statistic?

Conceptual Check

Hypothesis Testing & Power

Hypothesis Testing & Power

1. The Decision Framework: H0​ and H1​

2. Errors in Decision Making

3. Test Statistics and Rejection Regions

The Z-Test (Known Variance)

The T-Test (Unknown Variance)

The Chi-Squared Test

4. The P-Value: A Measure of Evidence

5. The Neyman-Pearson Lemma

6. Uniformly Most Powerful (UMP) Tests

7. Likelihood Ratio Tests (LRT) and Wilks’ Theorem

8. Multiple Testing and Bonferroni Correction

Python Implementation: T-Test and Visualization

According to the Neyman-Pearson Lemma, what defines the rejection region for the Most Powerful test between two simple hypotheses?

What is the distribution of the p-value when the Null Hypothesis is true for a continuous test statistic?

Which theorem provides the asymptotic distribution for the Likelihood Ratio Test statistic $-2 \\ln \\lambda$?

1. The Decision Framework: $H_{0}$ and $H_{1}$