Introduction to Bayesian Probability
Bayesian probability represents a paradigm shift in statistical inference, treating probability as a measure of belief or certainty rather than just frequency. This approach allows for the incorporation of prior knowledge and continuous updating as new evidence emerges.
Core Bayesian Principles:
- Probability as Degree of Belief: Quantifies uncertainty about propositions
- Prior Knowledge: Incorporates existing information before seeing data
- Bayesian Updating: Continuously updates beliefs with new evidence
- Full Probability Distributions: Provides complete uncertainty quantification
- Model Comparison: Enables direct comparison of competing hypotheses
Bayesian Statistics
• Parameters are random variables
• Uses prior distributions
• Computes posterior distributions
• Provides full uncertainty quantification
• Natural for sequential updating
Frequentist Statistics
• Parameters are fixed unknown quantities
• No prior information used
• Relies on sampling distributions
• Provides point estimates and confidence intervals
• Based on long-run frequency properties
Bayesian methods have gained tremendous popularity in recent decades due to advances in computational methods (MCMC) and their natural applicability to complex problems in machine learning, data science, and decision theory.
Bayes' Theorem: The Foundation
Bayes' Theorem provides the mathematical foundation for updating beliefs in light of new evidence. It describes how to invert conditional probabilities.
Where:
- P(A|B): Posterior probability of A given B (what we want to find)
- P(B|A): Likelihood of B given A (how well A predicts B)
- P(A): Prior probability of A (our initial belief about A)
- P(B): Marginal likelihood of B (normalizing constant)
Prior Distribution
Initial beliefs about parameters before seeing data
P(θ) - represents existing knowledge or assumptions
Likelihood Function
Probability of observed data given parameters
P(D|θ) - how well parameters explain the data
Posterior Distribution
Updated beliefs after seeing data
P(θ|D) ∝ P(D|θ) × P(θ)
Medical Testing Example:
Suppose a disease affects 1% of the population (prior). A test is 99% accurate (likelihood). If someone tests positive, what's the probability they actually have the disease?
P(Positive) = P(Positive|Disease)×P(Disease) + P(Positive|No Disease)×P(No Disease)
P(Positive) = (0.99 × 0.01) + (0.01 × 0.99) = 0.0198
P(Disease|Positive) = (0.99 × 0.01) / 0.0198 ≈ 0.5
Interpretation: Even with a 99% accurate test, a positive result only gives a 50% chance of actually having the disease due to the low prior probability.
Prior & Posterior Distributions
The choice of prior distribution is a fundamental aspect of Bayesian analysis, representing our beliefs about parameters before observing data.
Informative Priors
Based on existing knowledge, previous studies, or expert opinion.
Example: Beta(α=10, β=10) for a probability parameter
Use when: Substantial prior information exists
Weakly Informative Priors
Regularize without strongly influencing results.
Example: Normal(μ=0, σ=10) for regression coefficients
Use when: Some prior knowledge exists but is uncertain
Non-informative Priors
Attempt to represent prior ignorance.
Example: Uniform distribution or Jeffreys prior
Use when: No prior information available
Improper Priors
Don't integrate to 1 but yield proper posteriors.
Example: p(σ) ∝ 1/σ for scale parameters
Use with caution: Can lead to improper posteriors
Prior-Posterior Visualization
| Property | Description | Interpretation |
|---|---|---|
| Posterior Mean | E[θ|D] = ∫ θ p(θ|D) dθ | Bayesian point estimate (minimizes squared error loss) |
| Posterior Median | Median of p(θ|D) | Point estimate (minimizes absolute error loss) |
| Posterior Mode (MAP) | argmax p(θ|D) | Maximum A Posteriori estimate |
| Credible Interval | [a,b] such that P(a ≤ θ ≤ b|D) = 0.95 | Bayesian confidence interval (direct probability interpretation) |
| Posterior Predictive | p(y_new|D) = ∫ p(y_new|θ) p(θ|D) dθ | Predictive distribution for new observations |
Conjugate Priors
Conjugate priors are mathematically convenient choices where the posterior distribution belongs to the same family as the prior distribution.
| Likelihood | Conjugate Prior | Posterior Parameters | Common Use |
|---|---|---|---|
| Bernoulli / Binomial | Beta(α, β) | Beta(α + successes, β + failures) | Proportions, probabilities |
| Poisson | Gamma(α, β) | Gamma(α + sum(x), β + n) | Count data, rates |
| Normal (known variance) | Normal(μ₀, σ₀²) | Normal(μₙ, σₙ²) | Means, regression |
| Normal (known mean) | Inverse-Gamma(α, β) | Inverse-Gamma(α + n/2, β + SS/2) | Variances |
| Multinomial | Dirichlet(α₁,...,αₖ) | Dirichlet(α₁ + n₁,...,αₖ + nₖ) | Categorical data |
Beta-Binomial Conjugacy Example:
Suppose we want to estimate the probability θ of success in a Bernoulli trial.
Data: 8 successes in 10 trials
Likelihood: Binomial(n=10, k=8|θ)
Posterior: θ|D ∼ Beta(α=2+8, β=2+2) = Beta(10, 4)
Advantages of Conjugate Priors:
- Analytical tractability: Closed-form posterior distributions
- Computational efficiency: No need for numerical integration
- Interpretability: Clear updating rules for parameters
- Sequential updating: Natural for streaming data
Limitations:
- May not represent true prior beliefs
- Limited to specific likelihood families
- Can be restrictive for complex models
Markov Chain Monte Carlo (MCMC)
MCMC methods enable Bayesian inference for complex models where analytical solutions are intractable by sampling from posterior distributions.
Metropolis-Hastings
The foundational MCMC algorithm that uses proposal distributions to explore the parameter space.
Key concept: Accept/reject mechanism based on acceptance ratio
Advantage: Very general, works with any posterior
Gibbs Sampling
Samples from full conditional distributions when they're available.
Key concept: Update parameters one at a time
Advantage: No tuning needed, always accepts
Hamiltonian Monte Carlo
Uses Hamiltonian dynamics to efficiently explore high-dimensional spaces.
Key concept: Momentum variables for guided exploration
Advantage: Efficient for complex, high-dimensional posteriors
NUTS (No-U-Turn Sampler)
Adaptive variant of HMC that automatically tunes step parameters.
Key concept: Recursive tree-building to avoid U-turns
Advantage: No manual tuning, efficient and robust
MCMC Convergence Diagnostics
| Diagnostic | Purpose | Target Value |
|---|---|---|
| Trace Plots | Visual check of convergence and mixing | Stationary, well-mixed chains |
| Gelman-Rubin R̂ | Compare within/between chain variance | R̂ < 1.1 |
| Effective Sample Size | Measure of independent samples | ESS > 400 per chain |
| Autocorrelation | Measure of chain dependence | ACF decays quickly to 0 |
Hierarchical Bayesian Models
Hierarchical models (multilevel models) allow parameters to themselves have distributions with hyperparameters, enabling partial pooling and borrowing of strength across groups.
θⱼ ∼ p(θ|ϕ)
ϕ ∼ p(ϕ)
Eight Schools Example (Classic Hierarchical Model):
Estimating treatment effects across 8 schools, where each school has its own effect but all effects are assumed to come from a common distribution.
θⱼ ∼ Normal(μ, τ²) # School effects come from common distribution
μ ∼ Normal(0, 10) # Hyperpriors
τ ∼ Half-Cauchy(0, 5)
Key Insight: Schools with less data are shrunk more toward the overall mean μ.
Partial Pooling
Balances between complete pooling (all groups same) and no pooling (each group independent).
Benefit: More stable estimates for small groups
Borrowing Strength
Information from all groups informs estimates for each individual group.
Benefit: Improved estimates for groups with limited data
Regularization
Hierarchical structure naturally regularizes parameter estimates.
Benefit: Reduces overfitting, especially with many parameters
Uncertainty Propagation
Uncertainty at all levels is properly accounted for in predictions.
Benefit: More honest uncertainty quantification
Real-World Applications
Bayesian methods are widely used across numerous fields due to their flexibility and natural handling of uncertainty.
Machine Learning
Bayesian Neural Networks: Place distributions over weights
Gaussian Processes: Non-parametric Bayesian regression
Bayesian Optimization: Efficient hyperparameter tuning
Variational Autoencoders: Deep generative models
Clinical Trials
Adaptive Designs: Update trial parameters based on accumulating data
Dose Finding: Bayesian optimal design for phase I trials
Meta-analysis: Combine evidence from multiple studies
Predictive Probability: Early stopping rules
Finance & Economics
Time Series: Bayesian VAR, state-space models
Risk Management: Value at Risk (VaR) estimation
Portfolio Optimization: Black-Litterman model
Forecasting: Bayesian structural models
Scientific Research
Genomics: Bayesian phylogenetics, GWAS
Ecology: Species distribution modeling
Physics: Parameter estimation in complex models
Psychology: Cognitive modeling, psychometrics
Problem: Compare conversion rates of two website versions (A and B).
θ_B ∼ Beta(α=1, β=1) # Uniform prior for version B
y_A ∼ Binomial(n_A, θ_A)
y_B ∼ Binomial(n_B, θ_B)
δ = θ_B - θ_A # Difference in conversion rates
Bayesian Advantages:
- Direct probability statements: P(θ_B > θ_A | data)
- Natural stopping rule: Stop when P(θ_B > θ_A) > 0.95
- Easy to incorporate prior information from previous tests
- Handles optional stopping without correction
Python Implementation
Modern Python libraries make Bayesian analysis accessible and efficient.
PyMC3
Features: Probabilistic programming, NUTS sampler, Theano backend
Pyro
Features: Deep probabilistic programming, PyTorch backend, scalable
Stan
Features: High-performance, Hamiltonian Monte Carlo, C++ backend
TensorFlow Probability
Features: Integration with TensorFlow, GPU acceleration, deep learning
Interactive Bayesian Tools
Bayesian Updating Simulator
Visualize how prior beliefs update to posterior beliefs with new evidence.
Posterior: Beta(α = 10, β = 4)
Posterior Mean: 0.714
95% Credible Interval: [0.467, 0.895]
Probability θ > 0.5: 0.945
A new drug is tested on 50 patients, with 40 showing improvement. Using a Beta(1,1) prior, what is the posterior distribution of the improvement rate? What's the probability the true improvement rate exceeds 70%?
Solution:
Prior: θ ∼ Beta(1, 1) (Uniform distribution)
Data: 40 successes in 50 trials
Posterior: θ|D ∼ Beta(1+40, 1+10) = Beta(41, 11)
Posterior mean: 41/(41+11) = 41/52 ≈ 0.788
P(θ > 0.7|D) = 1 - F(0.7) where F is Beta(41,11) CDF
Using Beta CDF: P(θ > 0.7) ≈ 0.892
Interpretation: There's about 89% probability that the true improvement rate exceeds 70%.
You flip a coin 100 times and get 65 heads. Starting with a Beta(10,10) prior (centered on fair coin but with uncertainty), what is the posterior probability that the coin is biased toward heads (θ > 0.55)?
Solution:
Prior: θ ∼ Beta(10, 10) (centered at 0.5 with moderate uncertainty)
Data: 65 heads in 100 flips
Posterior: θ|D ∼ Beta(10+65, 10+35) = Beta(75, 45)
Posterior mean: 75/(75+45) = 75/120 = 0.625
P(θ > 0.55|D) = 1 - F(0.55) where F is Beta(75,45) CDF
Using Beta CDF: P(θ > 0.55) ≈ 0.998
Interpretation: Very strong evidence (99.8% probability) that the coin is biased toward heads.
Advanced Bayesian Topics
Variational Inference
Approximate Bayesian inference by optimizing a simpler distribution to match the true posterior.
Key idea: Minimize KL divergence between approximate and true posterior
Advantage: Much faster than MCMC for large datasets
Trade-off: Approximation error vs computational speed
Bayesian Model Comparison
Comparing models using marginal likelihoods and Bayes factors.
Bayes Factor: B₁₂ = P(D|M₁) / P(D|M₂)
Interpretation: Evidence strength for M₁ over M₂
Challenge: Computing marginal likelihoods can be difficult
Empirical Bayes
Estimate hyperparameters from data rather than specifying full hyperpriors.
Method: Maximize marginal likelihood to estimate hyperparameters
Advantage: Less subjective than full hierarchical Bayes
Limitation: Underestimates uncertainty in hyperparameters
Bayesian Nonparametrics
Models with infinite-dimensional parameter spaces.
Examples: Dirichlet Process, Gaussian Process, Indian Buffet Process
Advantage: Flexibility to adapt model complexity to data
Application: Clustering, density estimation, function approximation
Current Research Frontiers:
- Scalable Inference: Methods for billion-scale datasets
- Deep Probabilistic Programming: Integrating deep learning with Bayesian methods
- Bayesian Optimization: For hyperparameter tuning and experimental design
- Causal Inference: Bayesian approaches to causal discovery and estimation
- Differential Privacy: Bayesian methods with privacy guarantees