Key Assumptions

Linear Model:
Y = β₀ + β₁X₁ + ... + βₖXₖ + ε
Where:
• ε ~ N(0, σ²)
• Cov(εᵢ, εⱼ) = 0
• No perfect multicollinearity

Introduction to Regression Assumptions

Linear regression is one of the most widely used statistical techniques, but its validity depends on several critical assumptions. Violating these assumptions can lead to biased estimates, incorrect standard errors, and invalid statistical inferences.

What are Regression Assumptions?

Regression assumptions are conditions that must be met for ordinary least squares (OLS) estimators to have desirable properties like unbiasedness, efficiency, and valid hypothesis tests. These assumptions form the foundation of reliable regression analysis.

L

Linearity

The relationship between predictors and response variable is linear.

Key Test: Residual vs Fitted plots

I

Independence

Observations are independent of each other.

Key Test: Durbin-Watson test

H

Homoscedasticity

Constant variance of errors across all observations.

Key Test: Breusch-Pagan test

N

Normality

Errors are normally distributed.

Key Test: Shapiro-Wilk test

M

Multicollinearity

Predictors are not perfectly correlated.

Key Test: VIF (Variance Inflation Factor)

Enhance your learning experience by exploring data trends using the regression-analysis-calculator.

Why Regression Assumptions Matter

Understanding and checking regression assumptions is crucial because violations can have serious consequences for your analysis:

Assumption Consequence of Violation Severity Linearity Biased coefficient estimates, poor predictions High Independence Incorrect standard errors, invalid p-values High Homoscedasticity Inefficient estimates, wrong confidence intervals Medium Normality Invalid hypothesis tests with small samples Low (Large samples) Multicollinearity Unstable estimates, high standard errors Medium
1
The Importance of Diagnostics

Always perform diagnostic checks after fitting a regression model. Skipping assumption checking is one of the most common mistakes in statistical analysis.

  • Visual Diagnostics: Residual plots, Q-Q plots, leverage plots
  • Statistical Tests: Formal hypothesis tests for each assumption
  • Remedial Measures: Transformations, robust methods, alternative models

Evaluate your statistical analysis skills using real-world examples on the regression-analysis-calculator.

Linearity Assumption

The linearity assumption states that the relationship between each predictor variable and the response variable is linear in the parameters.

E(Y|X) = β₀ + β₁X₁ + β₂X₂ + ... + βₖXₖ

Diagnostic Methods

Residual vs Fitted Plot
Residuals vs Fitted Values
(No Pattern = Good)

Look for random scatter around zero

Component + Residual Plot
Partial Residual Plot
(Linear Pattern = Good)

Check linearity for each predictor

Rainbow Test
Rainbow Test Visualization
(Consistent Fit = Good)

Compare middle and full data fits

Linearity Diagnostic Tool

Check if your data shows linear relationships

Select a data pattern and click "Check Linearity"

Solutions for Nonlinearity

# Common transformations for nonlinearity import numpy as np # Logarithmic transformation X_log = np.log(X + 1) # Add 1 to handle zeros # Square root transformation X_sqrt = np.sqrt(X) # Polynomial terms X_poly = np.column_stack([X, X**2, X**3]) # Interaction terms X_interaction = X1 * X2 # Use Generalized Additive Models (GAMs) # gam = LinearGAM(s(0) + s(1)).fit(X, y)

Independence Assumption

The independence assumption requires that observations are independent of each other. This is often violated in time series, spatial data, or clustered data.

Cov(εᵢ, εⱼ) = 0 for all i ≠ j

Common Violations

T

Time Series

Autocorrelation: Today's value depends on yesterday's

Test: Durbin-Watson, Ljung-Box

S

Spatial Data

Spatial autocorrelation: Nearby locations are similar

Test: Moran's I, Geary's C

C

Clustered Data

Within-cluster correlation: Students in same class

Test: Intraclass correlation

R

Repeated Measures

Within-subject correlation: Multiple measurements per person

Test: Mauchly's test

Independence Diagnostic Tool

Check for autocorrelation in your data

Select autocorrelation level and click "Check Independence"

Solutions for Dependence

# Handling autocorrelation in time series import statsmodels.api as sm # Cochrane-Orcutt procedure model = sm.OLS(y, X) results = model.fit() rho = estimate_rho(results.resid) # Estimate autocorrelation # Prais-Winsten transformation y_transformed = y[1:] - rho * y[:-1] X_transformed = X[1:] - rho * X[:-1] # Use Newey-West standard errors model = sm.OLS(y, X) results = model.fit(cov_type='HAC', cov_kwds={'maxlags': 4}) # Mixed effects models for clustered data # model = sm.MixedLM(y, X, groups)

Strengthen your understanding of predictive relationships by practicing with the regression-analysis-calculator.

Homoscedasticity Assumption

Homoscedasticity (constant variance) means the variance of errors is the same across all levels of the independent variables. Violation is called heteroscedasticity.

Var(ε|X) = σ² (constant for all X)

Diagnostic Methods

Scale-Location Plot
√|Standardized Residuals| vs Fitted
(Flat Line = Good)

Look for funnel or fan shape

Residual vs Fitted Plot
Residuals vs Fitted Values
(Constant Spread = Good)

Check for changing variance

Statistical Tests
Breusch-Pagan Test
White Test
Goldfeld-Quandt Test

Formal hypothesis tests

Homoscedasticity Diagnostic Tool

Simulate and detect heteroscedasticity patterns

Select variance pattern and click "Check Variance"

Solutions for Heteroscedasticity

# Weighted Least Squares (WLS) import statsmodels.api as sm # Estimate variance function residuals = model.resid fitted = model.fittedvalues abs_resid = np.abs(residuals) # Regress absolute residuals on fitted values var_model = sm.OLS(abs_resid, sm.add_constant(fitted)).fit() weights = 1 / (var_model.fittedvalues ** 2) # Fit WLS model wls_model = sm.WLS(y, X, weights=weights) wls_results = wls_model.fit() # Robust standard errors (Huber-White) ols_model = sm.OLS(y, X) ols_results = ols_model.fit(cov_type='HC3') # HC3 is recommended # Transformations y_transform = np.log(y) # For multiplicative errors y_transform = np.sqrt(y) # For count data

Normality Assumption

The normality assumption states that the errors are normally distributed. This is important for small sample inference but less critical for large samples due to the Central Limit Theorem.

ε ~ N(0, σ²)

Diagnostic Methods

Q-Q Plot
Quantile-Quantile Plot
(Points on Line = Good)

Compare residuals to normal distribution

Histogram
Residual Histogram
(Bell Shape = Good)

Visual check of distribution shape

Statistical Tests
Shapiro-Wilk Test
Kolmogorov-Smirnov
Anderson-Darling

Formal normality tests

Normality Diagnostic Tool

Check if your residuals follow a normal distribution

Select distribution type and click "Check Normality"

Solutions for Non-Normality

# Transformations for non-normal errors import numpy as np from scipy import stats # Box-Cox transformation (for positive data) y_transformed, lambda_ = stats.boxcox(y + 1) # Add 1 for zeros # Yeo-Johnson transformation (handles negative values) # from sklearn.preprocessing import PowerTransformer # pt = PowerTransformer(method='yeo-johnson') # y_transformed = pt.fit_transform(y.reshape(-1, 1)) # Robust regression methods import statsmodels.api as sm # Quantile Regression mod = sm.QuantReg(y, X) res = mod.fit(q=0.5) # Median regression # Bootstrapping for inference from sklearn.utils import resample n_iterations = 1000 boot_coefs = [] for i in range(n_iterations): X_boot, y_boot = resample(X, y) model = sm.OLS(y_boot, X_boot).fit() boot_coefs.append(model.params)

Put your learning into action by analyzing real datasets with the regression-analysis-calculator.

Multicollinearity Assumption

Multicollinearity occurs when predictor variables are highly correlated with each other. Perfect multicollinearity makes the OLS estimates impossible to compute.

No perfect linear relationship: X'X is invertible

Diagnostic Methods

Correlation Matrix
Correlation Heatmap
(Low Values = Good)

Check pairwise correlations

VIF Values
Variance Inflation Factors
VIF < 10 = Acceptable

VIF > 10 indicates high multicollinearity

Condition Number
Condition Index
κ < 30 = Mild
κ > 100 = Severe

Eigenvalue analysis of X'X

Multicollinearity Diagnostic Tool

Calculate Variance Inflation Factors (VIF)

Select correlation level and click "Check Multicollinearity"

Solutions for Multicollinearity

# Variance Inflation Factor calculation from statsmodels.stats.outliers_influence import variance_inflation_factor import pandas as pd # Calculate VIF for each variable vif_data = pd.DataFrame() vif_data["feature"] = X.columns vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])] # Ridge Regression (L2 regularization) from sklearn.linear_model import Ridge ridge = Ridge(alpha=1.0) # Alpha controls regularization strength ridge.fit(X, y) # Principal Component Regression (PCR) from sklearn.decomposition import PCA from sklearn.linear_model import LinearRegression pca = PCA(n_components=0.95) # Keep 95% variance X_pca = pca.fit_transform(X) pcr_model = LinearRegression().fit(X_pca, y) # Remove highly correlated variables corr_matrix = X.corr().abs() upper_tri = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)) to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > 0.8)] X_reduced = X.drop(columns=to_drop)

Comprehensive Diagnostic Tools

A complete regression diagnostic involves multiple checks. Here's a systematic approach:

1
Visual Diagnostics

Create diagnostic plots to visually assess assumptions:

# Comprehensive diagnostic plots in R (ggplot2 equivalent) library(ggplot2) library(gridExtra) # 1. Residuals vs Fitted p1 <- ggplot(model, aes(.fitted, .resid)) + geom_point() + geom_hline(yintercept=0) + labs(title="Residuals vs Fitted") # 2. Normal Q-Q plot p2 <- ggplot(model, aes(sample=.stdresid)) + stat_qq() + stat_qq_line() + labs(title="Normal Q-Q") # 3. Scale-Location plot p3 <- ggplot(model, aes(.fitted, sqrt(abs(.stdresid)))) + geom_point() + geom_smooth(se=FALSE) + labs(title="Scale-Location") # 4. Residuals vs Leverage p4 <- ggplot(model, aes(.hat, .stdresid)) + geom_point() + geom_hline(yintercept=0) + labs(title="Residuals vs Leverage") # Combine all plots grid.arrange(p1, p2, p3, p4, ncol=2)
2
Statistical Tests

Formal hypothesis tests for each assumption:

Assumption Test Null Hypothesis R Command Linearity Rainbow Test Model is linear raintest(model) Independence Durbin-Watson No autocorrelation dwtest(model) Homoscedasticity Breusch-Pagan Constant variance bptest(model) Normality Shapiro-Wilk Normal errors shapiro.test(resid) Multicollinearity VIF VIF < 10 vif(model)

Comprehensive Diagnostic Check

Run a complete diagnostic analysis on simulated data

Configure settings and click "Run Full Diagnostic"

Check your skills by solving practical data modeling problems with the regression-analysis-calculator.

Solutions for Assumption Violations

When assumptions are violated, you have several options:

T

Transformations

  • Log transformation for skewness
  • Box-Cox for general issues
  • Differencing for autocorrelation
  • Weighting for heteroscedasticity
R

Robust Methods

  • Robust standard errors
  • Quantile regression
  • M-estimators (Huber, Tukey)
  • Bootstrap inference
A

Alternative Models

  • Generalized Linear Models
  • Mixed Effects Models
  • GAMs for nonlinearity
  • Time series models
D

Data Collection

  • Increase sample size
  • Improve measurement
  • Random sampling
  • Experimental design

Decision Tree for Violations

Violation Remediation Decision Tree
  1. Linearity violated? → Try polynomial terms, interactions, or GAMs
  2. Independence violated? → Use time series models, mixed models, or cluster-robust SE
  3. Homoscedasticity violated? → Use WLS, robust SE, or transformations
  4. Normality violated (small n)? → Use transformations or nonparametric methods
  5. Multicollinearity severe? → Use ridge regression, PCA, or remove variables
  6. Multiple violations? → Consider Generalized Linear Models or consult statistician
# Comprehensive solution: Robust regression with bootstrap import numpy as np import statsmodels.api as sm from sklearn.utils import resample # Fit model with robust standard errors model = sm.OLS(y, X) results = model.fit(cov_type='HC3') # Most robust option # Bootstrap for inference (handles non-normality) n_boot = 1000 boot_coefs = np.zeros((n_boot, X.shape[1])) for i in range(n_boot): X_boot, y_boot = resample(X, y) boot_model = sm.OLS(y_boot, X_boot).fit() boot_coefs[i] = boot_model.params # Calculate bootstrap confidence intervals ci_lower = np.percentile(boot_coefs, 2.5, axis=0) ci_upper = np.percentile(boot_coefs, 97.5, axis=0) # Compare with robust SE results print("Robust SE:", results.bse) print("Bootstrap CI:", list(zip(ci_lower, ci_upper)))

Practical Step-by-Step Guide

Follow this systematic approach to checking regression assumptions:

1
Before Fitting the Model
  • Check data quality: Missing values, outliers, measurement errors
  • Examine correlations: Look for obvious multicollinearity
  • Plot relationships: Scatterplots of Y vs each X
  • Consider context: Time series? Clustered data? Repeated measures?
2
After Fitting the Model
  • Create diagnostic plots: Residuals vs fitted, Q-Q, scale-location
  • Calculate VIF: Check for multicollinearity
  • Test assumptions: Durbin-Watson, Breusch-Pagan, Shapiro-Wilk
  • Check influential points: Cook's distance, leverage values
3
If Violations are Found
  • Minor violations: Use robust methods, report limitations
  • Major violations: Consider transformations or alternative models
  • Document everything: Report diagnostic results and remedial actions
  • Sensitivity analysis: Compare results with and without fixes

Practice Problems

Problem 1: You fit a regression model and the residual vs fitted plot shows a clear funnel pattern (variance increases with fitted values). What assumption is violated and what would you do?

Solution:

Violation: Homoscedasticity assumption (heteroscedasticity present)

Steps to address:

  1. Confirm with Breusch-Pagan test
  2. Try transforming the response variable (log, square root)
  3. Use Weighted Least Squares if you know the variance structure
  4. Use robust standard errors (HC3 recommended)
  5. Consider Generalized Least Squares
  6. Report both OLS with robust SE and WLS results
Problem 2: Your regression model has VIF values of 15, 22, and 18 for three predictor variables. The variables are all measures of economic activity. What's the issue and how would you proceed?

Solution:

Issue: Severe multicollinearity (VIF > 10 indicates high multicollinearity)

Steps to address:

  1. Examine correlation matrix to identify highly correlated pairs
  2. Consider creating composite indices or using principal components
  3. Use ridge regression to stabilize estimates
  4. Remove redundant variables (check theoretical importance)
  5. Collect more data if possible (multicollinearity is a data problem)
  6. Interpret coefficients cautiously - focus on model predictions rather than individual coefficients