Bayesian Inference & Modeling

Bayesian statistics provides a coherent mathematical framework for updating the probability of a hypothesis as more evidence or information becomes available. Unlike the frequentist approach, which treats parameters as fixed and data as random, the Bayesian paradigm treats parameters themselves as random variables, allowing for a more natural integration of prior knowledge and uncertainty.

1. Frequentist vs. Bayesian Philosophies

The fundamental divide in statistical inference rests on the interpretation of probability:

Frequentist Probability: Defined as the limit of an event’s relative frequency in a large number of trials. If we say a coin has $P (Heads) = 0.5$ , we mean that as the number of flips $n \to \infty$ , the proportion of heads converges to $0.5$ . Parameters $θ$ are unknown but fixed constants.
Bayesian Probability: Defined as a measure of “degree of belief” or “plausibility.” It is a quantifyable state of knowledge. This allows us to assign probabilities to one-off events (e.g., “the probability that it rained on Mars yesterday”) where frequentist repetition is impossible. Parameters $θ$ are random variables characterized by probability distributions.

2. The Bayesian Framework

The engine of Bayesian inference is Bayes’ Theorem. Let $y$ represent the observed data and $θ$ denote the parameters of the model.

The Posterior Distribution

The goal is to compute the Posterior Distribution $P (θ ∣ y)$ , which represents our updated belief about the parameters after observing the data:

$P (θ ∣ y) = \frac{P ( y ∣ θ ) P ( θ )}{P ( y )}$

Where:

Prior $P (θ)$ : The distribution representing information about $θ$ before the data $y$ is collected.
Likelihood $P (y ∣ θ)$ : The probability of the data $y$ occurring given a specific parameter value $θ$ .
Evidence (Marginal Likelihood) $P (y)$ : The total probability of the data, integrated over all possible parameters: $P (y) = \int_{Θ} P (y ∣ θ) P (θ) d θ$ . This serves as a normalizing constant.

In most applications, we focus on the kernel of the distribution: $P (θ ∣ y) \propto P (y ∣ θ) \times P (θ)$ This captures the intuition that the posterior is a compromise between the evidence provided by the data (likelihood) and our pre-existing knowledge (prior).

3. Conjugate Priors

A significant portion of analytic Bayesian statistics relies on Conjugacy. A prior $P (θ)$ is conjugate to a likelihood $P (y ∣ θ)$ if the resulting posterior $P (θ ∣ y)$ belongs to the same family of distributions as the prior.

Beta-Binomial Model

Consider $y$ successes in $n$ Bernoulli trials. The likelihood is Binomial: $P (y ∣ θ) = (y n) θ^{y} (1 - θ)^{n - y}$ If we choose a Beta distribution as our prior, $θ \sim Beta (α, β)$ : $P (θ) = \frac{1}{B ( α , β )} θ^{α - 1} (1 - θ)^{β - 1}$ The posterior will be: $P (θ ∣ y) \propto θ^{y + α - 1} (1 - θ)^{n - y + β - 1}$ Which is recognized as $Beta (α + y, β + n - y)$ . This provides a simple rule: $α$ and $β$ can be interpreted as “prior successes” and “prior failures.”

Normal-Normal Model

If $y \sim N (θ, σ^{2})$ and the prior $θ \sim N (μ_{0}, τ_{0}^{2})$ , the posterior is also Normal with: $μ_{p os t} = (\frac{μ _{0}}{τ _{0}^{2}} + \frac{\sum y _{i}}{σ ^{2}}) / (\frac{1}{τ _{0}^{2}} + \frac{n}{σ ^{2}})$ This demonstrates that the posterior mean is a precision-weighted average of the prior mean and the sample mean.

4. Uninformative and Objective Priors

When we have no prior information, we want a prior that adds minimal information.

Principle of Indifference: Using a flat (Uniform) prior. However, a uniform prior on $σ$ is not uniform on $σ^{2}$ (variance).
Jeffreys’ Prior: An objective prior that is invariant under reparameterization. It is proportional to the square root of the determinant of the Fisher Information $I (θ)$ : $P (θ) \propto ∣ I (θ) ∣$ For a Binomial distribution, the Jeffreys prior is $Beta (0.5, 0.5)$ .

5. Point Estimation: Mean, Median, and MAP

Unlike frequentist statistics which gives a point estimate (MLE), Bayes gives a distribution. If a single number is required, we use decision theory:

Posterior Mean: $E [θ ∣ y]$ . Minimizes the expected squared error loss.
Posterior Median: The value $m$ such that $P (θ \leq m ∣ y) = 0.5$ . Minimizes the expected absolute error loss.
Maximum A Posteriori (MAP): The mode of the posterior. $\hat{θ}_{M A P} = ar g max_{θ} [P (y ∣ θ) P (θ)]$ MAP is the Bayesian analog to Maximum Likelihood Estimation.

6. Credible Intervals: HPD Intervals

A Bayesian $100 (1 - α) %$ Credible Interval is an interval in which the true parameter lies with probability $1 - α$ .

Equal-Tailed Interval: The interval between the $α /2$ and $1 - α /2$ quantiles.
Highest Posterior Density (HPD) Interval: The shortest possible interval such that the probability $1 - α$ is contained within it. In an HPD interval, every point inside has a higher posterior density than any point outside. For asymmetric distributions, HPD intervals are superior to equal-tailed ones.

7. Model Choice: Bayes Factors and BIC

Bayes Factors

To compare model $M_{1}$ and $M_{2}$ , we use the ratio of the marginal likelihoods (the evidence): $K = \frac{P ( y ∣ M _{1} )}{P ( y ∣ M _{2} )} = \frac{\int P ( y ∣ θ _{1} , M _{1} ) P ( θ _{1} ∣ M _{1} ) d θ _{1}}{\int P ( y ∣ θ _{2} , M _{2} ) P ( θ _{2} ∣ M _{2} ) d θ _{2}}$ A Bayes Factor $K > 10$ is generally considered “strong” evidence for $M_{1}$ .

BIC

The Bayesian Information Criterion is an asymptotic approximation to the Bayes Factor: $BIC = k ln (n) - 2 ln (\hat{L})$ where $k$ is the number of parameters and $\hat{L}$ is the maximum likelihood. Lower BIC indicates a better model.

8. Computational Bayesian Statistics: MCMC

For complex models, the integral in the denominator of Bayes’ Theorem ( $P (y)$ ) is intractable. We use Markov Chain Monte Carlo (MCMC) to sample from the posterior without knowing the normalizing constant.

The Metropolis-Hastings Algorithm

Metropolis-Hastings generates a sequence of samples such that the distribution of these samples converges to the posterior.

Start at $θ^{(0)}$ .
Propose $θ^{*}$ from a proposal distribution $q (θ^{*} ∣ θ^{(t)})$ .
Calculate the acceptance ratio: $a = \frac{P ( y ∣ θ ^{*} ) P ( θ ^{*} ) q ( θ ^{(t)} ∣ θ ^{*} )}{P ( y ∣ θ ^{(t)} ) P ( θ ^{(t)} ) q ( θ ^{*} ∣ θ ^{(t)} )}$
Accept $θ^{*}$ with probability $min (1, a)$ .

Python: Metropolis-Hastings From Scratch

The following script estimates the mean $μ$ of a Normal distribution with a known variance $σ^{2} = 1$ , using a wide Normal prior.

import numpy as np

def metropolis_hastings(data, p_mu, p_sd, n_iter, prop_sd):
    """
    Estimates the mean of a Normal distribution.
    data: Observed data points
    p_mu, p_sd: Prior mean and standard deviation
    n_iter: Number of MCMC iterations
    prop_sd: Standard deviation of the proposal distribution (tuning parameter)
    """
    theta_curr = 0.0 # Initial guess
    chain = []
    
    def log_post_unnorm(theta):
        # Log-Likelihood: sum of log-PDFs of Normal(theta, 1)
        log_lik = -0.5 * np.sum((data - theta)**2)
        # Log-Prior: log-PDF of Normal(p_mu, p_sd)
        log_pri = -0.5 * ((theta - p_mu) / p_sd)**2
        return log_lik + log_pri

    for _ in range(n_iter):
        # 1. Propose new theta
        theta_prop = np.random.normal(theta_curr, prop_sd)
        
        # 2. Acceptance ratio
        log_acc = log_post_unnorm(theta_prop) - log_post_unnorm(theta_curr)
        
        # 3. Decision
        if np.log(np.random.rand()) < log_acc:
            theta_curr = theta_prop
        
        chain.append(theta_curr)
        
    return np.array(chain)

# Setup
np.random.seed(42)
true_mean = 3.5
data = np.random.normal(true_mean, 1, size=100)

# Execution
trace = metropolis_hastings(data, p_mu=0, p_sd=10, n_iter=5000, prop_sd=0.2)

# Results
burn_in = 1000
final_estimate = np.mean(trace[burn_in:])
print(f'Empirical Mean: {np.mean(data):.3f}')
print(f'Bayesian Posterior Mean: {final_estimate:.3f}')

Conceptual Check

Which prior distribution would result in a Beta posterior when the likelihood follows a Binomial process?

Conceptual Check

What is the defining characteristic of a Highest Posterior Density (HPD) interval compared to a simple quantile-based credible interval?

Conceptual Check

In the Metropolis-Hastings algorithm, what happens to the acceptance probability 'a' if the proposed value theta* has a much lower posterior density than the current value?