Hypothesis Testing & Power
In statistical inference, hypothesis testing is the formal process of using data to evaluate the validity of a claim about a population parameter. This lesson moves beyond introductory “plug-and-chug” methods to explore the mathematical foundations of decision theory, likelihood ratios, and the optimization of test power.
1. The Decision Framework: and
We define two competing hypotheses:
- Null Hypothesis (): The status quo or a specific “no effect” state. Mathematically, it typically specifies a subset of the parameter space .
- Alternative Hypothesis (): The statement we seek to find evidence for, .
A test is a decision rule that maps the sample space to the set . This is often defined via a rejection region :
2. Errors in Decision Making
Errors are unavoidable in frequentist inference. We quantify them as probabilities:
-
Type I Error (): Rejecting when it is true.
-
Type II Error (): Failing to reject when it is false.
-
Power of the Test (): The probability of correctly rejecting a false null hypothesis.
Ideally, we minimize both and . However, for a fixed sample size , there is an inverse relationship: decreasing the “size” () of the test generally increases .
3. Test Statistics and Rejection Regions
A test statistic reduces the dimensionality of the data to a single value used for the decision. Common forms include:
The Z-Test (Known Variance)
Under with known variance :
The T-Test (Unknown Variance)
If is unknown and estimated by sample variance :
The Chi-Squared Test
For variance testing or goodness-of-fit:
4. The P-Value: A Measure of Evidence
The p-value is not the probability that is true. Rather, it is the probability of observing a test statistic at least as extreme as the one computed, assuming is true.
Formally, for a test statistic where large values provide evidence against :
A p-value is a random variable itself. If is true, the p-value is uniformly distributed on for continuous test statistics: .
5. The Neyman-Pearson Lemma
How do we choose the best rejection region ? For a simple null versus a simple alternative , the Neyman-Pearson Lemma provides the Most Powerful (MP) test.
The lemma states that the region that maximizes power for a fixed is defined by the Likelihood Ratio: where is chosen such that . This ratio ensures that we reject when the data is significantly “more likely” under than under .
6. Uniformly Most Powerful (UMP) Tests
When is composite (e.g., ), we seek a test that is the most powerful for all . Such a test is called Uniformly Most Powerful (UMP).
The existence of a UMP test is guaranteed if the family of distributions possesses the Monotone Likelihood Ratio (MLR) property. A family has MLR in if for any , the ratio is a non-decreasing function of .
7. Likelihood Ratio Tests (LRT) and Wilks’ Theorem
For complex, multi-parameter composite hypotheses, we use the generalized Likelihood Ratio Test: where . Small values of lead to rejection.
Wilks’ Theorem: Under certain regularity conditions, as , the distribution of converges in distribution to a distribution with degrees of freedom equal to the difference in dimensionality between and :
8. Multiple Testing and Bonferroni Correction
When conducting independent tests at a significance level , the probability of committing at least one Type I error (Family-Wide Error Rate, FWER) is . As grows, this approaches 1.
The Bonferroni correction guards against this by using a stricter threshold for each individual test:
Python Implementation: T-Test and Visualization
The following code calculates a one-sample t-test and visualizes the rejection region vs. the p-value.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# Parameters
mu_null = 50
sample_size = 30
data = np.random.normal(52, 10, sample_size) # Mean 52, StdDev 10
# Perform t-test
t_stat, p_val = stats.ttest_1samp(data, mu_null)
df = sample_size - 1
# Plotting
x = np.linspace(-4, 4, 1000)
y = stats.t.pdf(x, df)
critical_value = stats.t.ppf(0.95, df)
plt.figure(figsize=(10, 6))
plt.plot(x, y, label=f't-distribution (df={df})')
# Rejection Region (Alpha = 0.05)
plt.fill_between(x, 0, y, where=(x > critical_value), color='red', alpha=0.3, label='Rejection Region')
# P-value area
plt.fill_between(x, 0, y, where=(x > t_stat), color='blue', alpha=0.5, label=f'p-value area (p={p_val:.4f})')
plt.axvline(t_stat, color='black', linestyle='--', label=f'Observed t={t_stat:.2f}')
plt.title('One-Sample T-Test: Rejection Region vs P-value')
plt.legend()
plt.show()