Introduction to Bayesian Probability

Bayesian probability represents a paradigm shift in statistical inference, treating probability as a measure of belief or certainty rather than just frequency. This approach allows for the incorporation of prior knowledge and continuous updating as new evidence emerges.

Core Bayesian Principles:

  • Probability as Degree of Belief: Quantifies uncertainty about propositions
  • Prior Knowledge: Incorporates existing information before seeing data
  • Bayesian Updating: Continuously updates beliefs with new evidence
  • Full Probability Distributions: Provides complete uncertainty quantification
  • Model Comparison: Enables direct comparison of competing hypotheses

Bayesian Statistics

• Parameters are random variables

• Uses prior distributions

• Computes posterior distributions

• Provides full uncertainty quantification

• Natural for sequential updating

Frequentist Statistics

• Parameters are fixed unknown quantities

• No prior information used

• Relies on sampling distributions

• Provides point estimates and confidence intervals

• Based on long-run frequency properties

Bayesian methods have gained tremendous popularity in recent decades due to advances in computational methods (MCMC) and their natural applicability to complex problems in machine learning, data science, and decision theory.

Bayes' Theorem: The Foundation

Bayes' Theorem provides the mathematical foundation for updating beliefs in light of new evidence. It describes how to invert conditional probabilities.

P(A|B) = [P(B|A) × P(A)] / P(B)

Where:

  • P(A|B): Posterior probability of A given B (what we want to find)
  • P(B|A): Likelihood of B given A (how well A predicts B)
  • P(A): Prior probability of A (our initial belief about A)
  • P(B): Marginal likelihood of B (normalizing constant)
Bayesian Inference Process
1️⃣

Prior Distribution

Initial beliefs about parameters before seeing data

P(θ) - represents existing knowledge or assumptions

2️⃣

Likelihood Function

Probability of observed data given parameters

P(D|θ) - how well parameters explain the data

3️⃣

Posterior Distribution

Updated beliefs after seeing data

P(θ|D) ∝ P(D|θ) × P(θ)

Medical Testing Example:

Suppose a disease affects 1% of the population (prior). A test is 99% accurate (likelihood). If someone tests positive, what's the probability they actually have the disease?

P(Disease|Positive) = [P(Positive|Disease) × P(Disease)] / P(Positive)

P(Positive) = P(Positive|Disease)×P(Disease) + P(Positive|No Disease)×P(No Disease)

P(Positive) = (0.99 × 0.01) + (0.01 × 0.99) = 0.0198

P(Disease|Positive) = (0.99 × 0.01) / 0.0198 ≈ 0.5

Interpretation: Even with a 99% accurate test, a positive result only gives a 50% chance of actually having the disease due to the low prior probability.

Prior & Posterior Distributions

The choice of prior distribution is a fundamental aspect of Bayesian analysis, representing our beliefs about parameters before observing data.

📊

Informative Priors

Based on existing knowledge, previous studies, or expert opinion.

Example: Beta(α=10, β=10) for a probability parameter

Use when: Substantial prior information exists

⚖️

Weakly Informative Priors

Regularize without strongly influencing results.

Example: Normal(μ=0, σ=10) for regression coefficients

Use when: Some prior knowledge exists but is uncertain

🎲

Non-informative Priors

Attempt to represent prior ignorance.

Example: Uniform distribution or Jeffreys prior

Use when: No prior information available

⚠️

Improper Priors

Don't integrate to 1 but yield proper posteriors.

Example: p(σ) ∝ 1/σ for scale parameters

Use with caution: Can lead to improper posteriors

Prior-Posterior Visualization

50 successes out of 100 trials
Posterior Distribution Properties
Property Description Interpretation
Posterior Mean E[θ|D] = ∫ θ p(θ|D) dθ Bayesian point estimate (minimizes squared error loss)
Posterior Median Median of p(θ|D) Point estimate (minimizes absolute error loss)
Posterior Mode (MAP) argmax p(θ|D) Maximum A Posteriori estimate
Credible Interval [a,b] such that P(a ≤ θ ≤ b|D) = 0.95 Bayesian confidence interval (direct probability interpretation)
Posterior Predictive p(y_new|D) = ∫ p(y_new|θ) p(θ|D) dθ Predictive distribution for new observations

Conjugate Priors

Conjugate priors are mathematically convenient choices where the posterior distribution belongs to the same family as the prior distribution.

Prior ∈ Family F, Likelihood ∈ Family L ⇒ Posterior ∈ Family F
Likelihood Conjugate Prior Posterior Parameters Common Use
Bernoulli / Binomial Beta(α, β) Beta(α + successes, β + failures) Proportions, probabilities
Poisson Gamma(α, β) Gamma(α + sum(x), β + n) Count data, rates
Normal (known variance) Normal(μ₀, σ₀²) Normal(μₙ, σₙ²) Means, regression
Normal (known mean) Inverse-Gamma(α, β) Inverse-Gamma(α + n/2, β + SS/2) Variances
Multinomial Dirichlet(α₁,...,αₖ) Dirichlet(α₁ + n₁,...,αₖ + nₖ) Categorical data

Beta-Binomial Conjugacy Example:

Suppose we want to estimate the probability θ of success in a Bernoulli trial.

Prior: θ ∼ Beta(α=2, β=2)
Data: 8 successes in 10 trials
Likelihood: Binomial(n=10, k=8|θ)
Posterior: θ|D ∼ Beta(α=2+8, β=2+2) = Beta(10, 4)
# Python implementation of Beta-Binomial conjugate update import numpy as np from scipy.stats import beta, binom # Prior parameters alpha_prior, beta_prior = 2, 2 # Observed data successes, trials = 8, 10 # Posterior parameters alpha_post = alpha_prior + successes beta_post = beta_prior + (trials - successes) # Posterior mean and 95% credible interval posterior_mean = alpha_post / (alpha_post + beta_post) ci_lower, ci_upper = beta.ppf([0.025, 0.975], alpha_post, beta_post) print(f"Posterior mean: {posterior_mean:.3f}") print(f"95% Credible Interval: [{ci_lower:.3f}, {ci_upper:.3f}]")

Advantages of Conjugate Priors:

  • Analytical tractability: Closed-form posterior distributions
  • Computational efficiency: No need for numerical integration
  • Interpretability: Clear updating rules for parameters
  • Sequential updating: Natural for streaming data

Limitations:

  • May not represent true prior beliefs
  • Limited to specific likelihood families
  • Can be restrictive for complex models

Markov Chain Monte Carlo (MCMC)

MCMC methods enable Bayesian inference for complex models where analytical solutions are intractable by sampling from posterior distributions.

🔄

Metropolis-Hastings

The foundational MCMC algorithm that uses proposal distributions to explore the parameter space.

Key concept: Accept/reject mechanism based on acceptance ratio

Advantage: Very general, works with any posterior

Gibbs Sampling

Samples from full conditional distributions when they're available.

Key concept: Update parameters one at a time

Advantage: No tuning needed, always accepts

🚀

Hamiltonian Monte Carlo

Uses Hamiltonian dynamics to efficiently explore high-dimensional spaces.

Key concept: Momentum variables for guided exploration

Advantage: Efficient for complex, high-dimensional posteriors

🎯

NUTS (No-U-Turn Sampler)

Adaptive variant of HMC that automatically tunes step parameters.

Key concept: Recursive tree-building to avoid U-turns

Advantage: No manual tuning, efficient and robust

Metropolis-Hastings Algorithm
Algorithm: 1. Initialize θ⁽⁰⁾ 2. For t = 1 to T: a. Propose θ* ∼ q(θ*|θ⁽ᵗ⁻¹⁾) b. Compute acceptance probability: α = min(1, [p(θ*|D)q(θ⁽ᵗ⁻¹⁾|θ*)] / [p(θ⁽ᵗ⁻¹⁾|D)q(θ*|θ⁽ᵗ⁻¹⁾)]) c. Accept θ⁽ᵗ⁾ = θ* with probability α, else θ⁽ᵗ⁾ = θ⁽ᵗ⁻¹⁾
# Simplified Metropolis-Hastings implementation in Python import numpy as np def metropolis_hastings(log_posterior, theta_init, proposal_std, n_samples): """Simple Metropolis-Hastings sampler""" theta_current = theta_init samples = [] accepted = 0 for i in range(n_samples): # Propose new value theta_proposed = np.random.normal(theta_current, proposal_std) # Compute acceptance ratio log_alpha = log_posterior(theta_proposed) - log_posterior(theta_current) # Metropolis acceptance step if np.log(np.random.uniform()) < log_alpha: theta_current = theta_proposed accepted += 1 samples.append(theta_current) acceptance_rate = accepted / n_samples return np.array(samples), acceptance_rate

MCMC Convergence Diagnostics

Acceptance Rate: -%
Diagnostic Purpose Target Value
Trace Plots Visual check of convergence and mixing Stationary, well-mixed chains
Gelman-Rubin R̂ Compare within/between chain variance R̂ < 1.1
Effective Sample Size Measure of independent samples ESS > 400 per chain
Autocorrelation Measure of chain dependence ACF decays quickly to 0

Hierarchical Bayesian Models

Hierarchical models (multilevel models) allow parameters to themselves have distributions with hyperparameters, enabling partial pooling and borrowing of strength across groups.

yᵢⱼ ∼ p(y|θⱼ)
θⱼ ∼ p(θ|ϕ)
ϕ ∼ p(ϕ)

Eight Schools Example (Classic Hierarchical Model):

Estimating treatment effects across 8 schools, where each school has its own effect but all effects are assumed to come from a common distribution.

yⱼ ∼ Normal(θⱼ, σⱼ²) # Likelihood for school j
θⱼ ∼ Normal(μ, τ²) # School effects come from common distribution
μ ∼ Normal(0, 10) # Hyperpriors
τ ∼ Half-Cauchy(0, 5)

Key Insight: Schools with less data are shrunk more toward the overall mean μ.

📈

Partial Pooling

Balances between complete pooling (all groups same) and no pooling (each group independent).

Benefit: More stable estimates for small groups

🔄

Borrowing Strength

Information from all groups informs estimates for each individual group.

Benefit: Improved estimates for groups with limited data

⚖️

Regularization

Hierarchical structure naturally regularizes parameter estimates.

Benefit: Reduces overfitting, especially with many parameters

🔍

Uncertainty Propagation

Uncertainty at all levels is properly accounted for in predictions.

Benefit: More honest uncertainty quantification

Real-World Applications

Bayesian methods are widely used across numerous fields due to their flexibility and natural handling of uncertainty.

🤖

Machine Learning

Bayesian Neural Networks: Place distributions over weights

Gaussian Processes: Non-parametric Bayesian regression

Bayesian Optimization: Efficient hyperparameter tuning

Variational Autoencoders: Deep generative models

💊

Clinical Trials

Adaptive Designs: Update trial parameters based on accumulating data

Dose Finding: Bayesian optimal design for phase I trials

Meta-analysis: Combine evidence from multiple studies

Predictive Probability: Early stopping rules

💰

Finance & Economics

Time Series: Bayesian VAR, state-space models

Risk Management: Value at Risk (VaR) estimation

Portfolio Optimization: Black-Litterman model

Forecasting: Bayesian structural models

🔬

Scientific Research

Genomics: Bayesian phylogenetics, GWAS

Ecology: Species distribution modeling

Physics: Parameter estimation in complex models

Psychology: Cognitive modeling, psychometrics

A/B Testing Example

Problem: Compare conversion rates of two website versions (A and B).

θ_A ∼ Beta(α=1, β=1) # Uniform prior for version A
θ_B ∼ Beta(α=1, β=1) # Uniform prior for version B
y_A ∼ Binomial(n_A, θ_A)
y_B ∼ Binomial(n_B, θ_B)
δ = θ_B - θ_A # Difference in conversion rates

Bayesian Advantages:

  • Direct probability statements: P(θ_B > θ_A | data)
  • Natural stopping rule: Stop when P(θ_B > θ_A) > 0.95
  • Easy to incorporate prior information from previous tests
  • Handles optional stopping without correction

Python Implementation

Modern Python libraries make Bayesian analysis accessible and efficient.

🐍

PyMC3

Features: Probabilistic programming, NUTS sampler, Theano backend

import pymc3 as pm with pm.Model(): θ = pm.Beta('theta', alpha=1, beta=1) y = pm.Binomial('y', n=n, p=θ, observed=successes) trace = pm.sample(1000)
🐍

Pyro

Features: Deep probabilistic programming, PyTorch backend, scalable

import pyro def model(data): θ = pyro.sample("theta", dist.Beta(1, 1)) with pyro.plate("data", len(data)): pyro.sample("obs", dist.Bernoulli(θ), obs=data)
🐍

Stan

Features: High-performance, Hamiltonian Monte Carlo, C++ backend

data { int<lower=0> N; int<lower=0> y[N]; } parameters { real<lower=0, upper=1> theta; } model { theta ~ beta(1, 1); y ~ bernoulli(theta); }
🐍

TensorFlow Probability

Features: Integration with TensorFlow, GPU acceleration, deep learning

import tensorflow_probability as tfp tfd = tfp.distributions model = tfd.JointDistributionSequential([ tfd.Beta(1., 1.), lambda theta: tfd.Bernoulli(probs=theta) ])
Complete Bayesian Regression Example
# Bayesian linear regression with PyMC3 import pymc3 as pm import numpy as np import matplotlib.pyplot as plt # Generate synthetic data np.random.seed(42) n = 100 x = np.random.normal(0, 1, n) true_slope = 1.5 true_intercept = 0.5 y = true_intercept + true_slope * x + np.random.normal(0, 0.5, n) # Bayesian linear regression model with pm.Model() as linear_model: # Priors intercept = pm.Normal('intercept', mu=0, sigma=10) slope = pm.Normal('slope', mu=0, sigma=10) sigma = pm.HalfNormal('sigma', sigma=1) # Likelihood mu = intercept + slope * x likelihood = pm.Normal('y', mu=mu, sigma=sigma, observed=y) # Sample from posterior trace = pm.sample(2000, tune=1000, chains=4, return_inferencedata=True) # Analyze results print(pm.summary(trace)) pm.plot_trace(trace) plt.show()

Interactive Bayesian Tools

Bayesian Updating Simulator

Visualize how prior beliefs update to posterior beliefs with new evidence.

α = 2
β = 2
Successes: 8
Trials: 10

Posterior: Beta(α = 10, β = 4)

Posterior Mean: 0.714

95% Credible Interval: [0.467, 0.895]

Probability θ > 0.5: 0.945

Problem 1: Drug Efficacy Trial
A new drug is tested on 50 patients, with 40 showing improvement. Using a Beta(1,1) prior, what is the posterior distribution of the improvement rate? What's the probability the true improvement rate exceeds 70%?

Solution:

Prior: θ ∼ Beta(1, 1) (Uniform distribution)

Data: 40 successes in 50 trials

Posterior: θ|D ∼ Beta(1+40, 1+10) = Beta(41, 11)

Posterior mean: 41/(41+11) = 41/52 ≈ 0.788

P(θ > 0.7|D) = 1 - F(0.7) where F is Beta(41,11) CDF

Using Beta CDF: P(θ > 0.7) ≈ 0.892

Interpretation: There's about 89% probability that the true improvement rate exceeds 70%.

Problem 2: Coin Fairness Test
You flip a coin 100 times and get 65 heads. Starting with a Beta(10,10) prior (centered on fair coin but with uncertainty), what is the posterior probability that the coin is biased toward heads (θ > 0.55)?

Solution:

Prior: θ ∼ Beta(10, 10) (centered at 0.5 with moderate uncertainty)

Data: 65 heads in 100 flips

Posterior: θ|D ∼ Beta(10+65, 10+35) = Beta(75, 45)

Posterior mean: 75/(75+45) = 75/120 = 0.625

P(θ > 0.55|D) = 1 - F(0.55) where F is Beta(75,45) CDF

Using Beta CDF: P(θ > 0.55) ≈ 0.998

Interpretation: Very strong evidence (99.8% probability) that the coin is biased toward heads.

Advanced Bayesian Topics

🧠

Variational Inference

Approximate Bayesian inference by optimizing a simpler distribution to match the true posterior.

Key idea: Minimize KL divergence between approximate and true posterior

Advantage: Much faster than MCMC for large datasets

Trade-off: Approximation error vs computational speed

⚖️

Bayesian Model Comparison

Comparing models using marginal likelihoods and Bayes factors.

Bayes Factor: B₁₂ = P(D|M₁) / P(D|M₂)

Interpretation: Evidence strength for M₁ over M₂

Challenge: Computing marginal likelihoods can be difficult

🎯

Empirical Bayes

Estimate hyperparameters from data rather than specifying full hyperpriors.

Method: Maximize marginal likelihood to estimate hyperparameters

Advantage: Less subjective than full hierarchical Bayes

Limitation: Underestimates uncertainty in hyperparameters

🔮

Bayesian Nonparametrics

Models with infinite-dimensional parameter spaces.

Examples: Dirichlet Process, Gaussian Process, Indian Buffet Process

Advantage: Flexibility to adapt model complexity to data

Application: Clustering, density estimation, function approximation

Current Research Frontiers:

  • Scalable Inference: Methods for billion-scale datasets
  • Deep Probabilistic Programming: Integrating deep learning with Bayesian methods
  • Bayesian Optimization: For hyperparameter tuning and experimental design
  • Causal Inference: Bayesian approaches to causal discovery and estimation
  • Differential Privacy: Bayesian methods with privacy guarantees