Bayesian Inference & Modeling
Bayesian statistics provides a coherent mathematical framework for updating the probability of a hypothesis as more evidence or information becomes available. Unlike the frequentist approach, which treats parameters as fixed and data as random, the Bayesian paradigm treats parameters themselves as random variables, allowing for a more natural integration of prior knowledge and uncertainty.
1. Frequentist vs. Bayesian Philosophies
The fundamental divide in statistical inference rests on the interpretation of probability:
- Frequentist Probability: Defined as the limit of an event’s relative frequency in a large number of trials. If we say a coin has , we mean that as the number of flips , the proportion of heads converges to . Parameters are unknown but fixed constants.
- Bayesian Probability: Defined as a measure of “degree of belief” or “plausibility.” It is a quantifyable state of knowledge. This allows us to assign probabilities to one-off events (e.g., “the probability that it rained on Mars yesterday”) where frequentist repetition is impossible. Parameters are random variables characterized by probability distributions.
2. The Bayesian Framework
The engine of Bayesian inference is Bayes’ Theorem. Let represent the observed data and denote the parameters of the model.
The Posterior Distribution
The goal is to compute the Posterior Distribution , which represents our updated belief about the parameters after observing the data:
Where:
- Prior : The distribution representing information about before the data is collected.
- Likelihood : The probability of the data occurring given a specific parameter value .
- Evidence (Marginal Likelihood) : The total probability of the data, integrated over all possible parameters: . This serves as a normalizing constant.
In most applications, we focus on the kernel of the distribution: This captures the intuition that the posterior is a compromise between the evidence provided by the data (likelihood) and our pre-existing knowledge (prior).
3. Conjugate Priors
A significant portion of analytic Bayesian statistics relies on Conjugacy. A prior is conjugate to a likelihood if the resulting posterior belongs to the same family of distributions as the prior.
Beta-Binomial Model
Consider successes in Bernoulli trials. The likelihood is Binomial: If we choose a Beta distribution as our prior, : The posterior will be: Which is recognized as . This provides a simple rule: and can be interpreted as “prior successes” and “prior failures.”
Normal-Normal Model
If and the prior , the posterior is also Normal with: This demonstrates that the posterior mean is a precision-weighted average of the prior mean and the sample mean.
4. Uninformative and Objective Priors
When we have no prior information, we want a prior that adds minimal information.
- Principle of Indifference: Using a flat (Uniform) prior. However, a uniform prior on is not uniform on (variance).
- Jeffreys’ Prior: An objective prior that is invariant under reparameterization. It is proportional to the square root of the determinant of the Fisher Information : For a Binomial distribution, the Jeffreys prior is .
5. Point Estimation: Mean, Median, and MAP
Unlike frequentist statistics which gives a point estimate (MLE), Bayes gives a distribution. If a single number is required, we use decision theory:
- Posterior Mean: . Minimizes the expected squared error loss.
- Posterior Median: The value such that . Minimizes the expected absolute error loss.
- Maximum A Posteriori (MAP): The mode of the posterior. MAP is the Bayesian analog to Maximum Likelihood Estimation.
6. Credible Intervals: HPD Intervals
A Bayesian Credible Interval is an interval in which the true parameter lies with probability .
- Equal-Tailed Interval: The interval between the and quantiles.
- Highest Posterior Density (HPD) Interval: The shortest possible interval such that the probability is contained within it. In an HPD interval, every point inside has a higher posterior density than any point outside. For asymmetric distributions, HPD intervals are superior to equal-tailed ones.
7. Model Choice: Bayes Factors and BIC
Bayes Factors
To compare model and , we use the ratio of the marginal likelihoods (the evidence): A Bayes Factor is generally considered “strong” evidence for .
BIC
The Bayesian Information Criterion is an asymptotic approximation to the Bayes Factor: where is the number of parameters and is the maximum likelihood. Lower BIC indicates a better model.
8. Computational Bayesian Statistics: MCMC
For complex models, the integral in the denominator of Bayes’ Theorem () is intractable. We use Markov Chain Monte Carlo (MCMC) to sample from the posterior without knowing the normalizing constant.
The Metropolis-Hastings Algorithm
Metropolis-Hastings generates a sequence of samples such that the distribution of these samples converges to the posterior.
- Start at .
- Propose from a proposal distribution .
- Calculate the acceptance ratio:
- Accept with probability .
Python: Metropolis-Hastings From Scratch
The following script estimates the mean of a Normal distribution with a known variance , using a wide Normal prior.
import numpy as np
def metropolis_hastings(data, p_mu, p_sd, n_iter, prop_sd):
"""
Estimates the mean of a Normal distribution.
data: Observed data points
p_mu, p_sd: Prior mean and standard deviation
n_iter: Number of MCMC iterations
prop_sd: Standard deviation of the proposal distribution (tuning parameter)
"""
theta_curr = 0.0 # Initial guess
chain = []
def log_post_unnorm(theta):
# Log-Likelihood: sum of log-PDFs of Normal(theta, 1)
log_lik = -0.5 * np.sum((data - theta)**2)
# Log-Prior: log-PDF of Normal(p_mu, p_sd)
log_pri = -0.5 * ((theta - p_mu) / p_sd)**2
return log_lik + log_pri
for _ in range(n_iter):
# 1. Propose new theta
theta_prop = np.random.normal(theta_curr, prop_sd)
# 2. Acceptance ratio
log_acc = log_post_unnorm(theta_prop) - log_post_unnorm(theta_curr)
# 3. Decision
if np.log(np.random.rand()) < log_acc:
theta_curr = theta_prop
chain.append(theta_curr)
return np.array(chain)
# Setup
np.random.seed(42)
true_mean = 3.5
data = np.random.normal(true_mean, 1, size=100)
# Execution
trace = metropolis_hastings(data, p_mu=0, p_sd=10, n_iter=5000, prop_sd=0.2)
# Results
burn_in = 1000
final_estimate = np.mean(trace[burn_in:])
print(f'Empirical Mean: {np.mean(data):.3f}')
print(f'Bayesian Posterior Mean: {final_estimate:.3f}')