Regression Assumptions: Complete Guide with Tests and Solutions

Introduction to Regression Assumptions

Linear regression is one of the most widely used statistical techniques, but its validity depends on several critical assumptions. Violating these assumptions can lead to biased estimates, incorrect standard errors, and invalid statistical inferences.

What are Regression Assumptions?

Regression assumptions are conditions that must be met for ordinary least squares (OLS) estimators to have desirable properties like unbiasedness, efficiency, and valid hypothesis tests. These assumptions form the foundation of reliable regression analysis.

L

Linearity

The relationship between predictors and response variable is linear.

Key Test: Residual vs Fitted plots

I

Independence

Observations are independent of each other.

Key Test: Durbin-Watson test

H

Homoscedasticity

Constant variance of errors across all observations.

Key Test: Breusch-Pagan test

N

Normality

Errors are normally distributed.

Key Test: Shapiro-Wilk test

M

Multicollinearity

Predictors are not perfectly correlated.

Key Test: VIF (Variance Inflation Factor)

Enhance your learning experience by exploring data trends using the regression-analysis-calculator.

Why Regression Assumptions Matter

Understanding and checking regression assumptions is crucial because violations can have serious consequences for your analysis:

Assumption Consequence of Violation Severity Linearity Biased coefficient estimates, poor predictions High Independence Incorrect standard errors, invalid p-values High Homoscedasticity Inefficient estimates, wrong confidence intervals Medium Normality Invalid hypothesis tests with small samples Low (Large samples) Multicollinearity Unstable estimates, high standard errors Medium

1

The Importance of Diagnostics

Always perform diagnostic checks after fitting a regression model. Skipping assumption checking is one of the most common mistakes in statistical analysis.

Visual Diagnostics: Residual plots, Q-Q plots, leverage plots
Statistical Tests: Formal hypothesis tests for each assumption
Remedial Measures: Transformations, robust methods, alternative models

Evaluate your statistical analysis skills using real-world examples on the regression-analysis-calculator.

Linearity Assumption

The linearity assumption states that the relationship between each predictor variable and the response variable is linear in the parameters.

E(Y|X) = β₀ + β₁X₁ + β₂X₂ + ... + βₖXₖ

Diagnostic Methods

Residual vs Fitted Plot

Residuals vs Fitted Values
(No Pattern = Good)

Look for random scatter around zero

Component + Residual Plot

Partial Residual Plot
(Linear Pattern = Good)

Check linearity for each predictor

Rainbow Test

Rainbow Test Visualization
(Consistent Fit = Good)

Compare middle and full data fits

Linearity Diagnostic Tool

Check if your data shows linear relationships

Select Data Pattern

Select a data pattern and click "Check Linearity"

Solutions for Nonlinearity

# Common transformations for nonlinearity
import numpy as np

# Logarithmic transformation
X_log = np.log(X + 1)  # Add 1 to handle zeros

# Square root transformation
X_sqrt = np.sqrt(X)

# Polynomial terms
X_poly = np.column_stack([X, X**2, X**3])

# Interaction terms
X_interaction = X1 * X2

# Use Generalized Additive Models (GAMs)
# gam = LinearGAM(s(0) + s(1)).fit(X, y)
          

Independence Assumption

The independence assumption requires that observations are independent of each other. This is often violated in time series, spatial data, or clustered data.

Cov(εᵢ, εⱼ) = 0 for all i ≠ j

Common Violations

T

Time Series

Autocorrelation: Today's value depends on yesterday's

Test: Durbin-Watson, Ljung-Box

S

Spatial Data

Spatial autocorrelation: Nearby locations are similar

Test: Moran's I, Geary's C

C

Clustered Data

Within-cluster correlation: Students in same class

Test: Intraclass correlation

R

Repeated Measures

Within-subject correlation: Multiple measurements per person

Test: Mauchly's test

Independence Diagnostic Tool

Check for autocorrelation in your data

Autocorrelation Level

Select autocorrelation level and click "Check Independence"

Solutions for Dependence

# Handling autocorrelation in time series
import statsmodels.api as sm

# Cochrane-Orcutt procedure
model = sm.OLS(y, X)
results = model.fit()
rho = estimate_rho(results.resid)  # Estimate autocorrelation

# Prais-Winsten transformation
y_transformed = y[1:] - rho * y[:-1]
X_transformed = X[1:] - rho * X[:-1]

# Use Newey-West standard errors
model = sm.OLS(y, X)
results = model.fit(cov_type='HAC', cov_kwds={'maxlags': 4})

# Mixed effects models for clustered data
# model = sm.MixedLM(y, X, groups)
          

Strengthen your understanding of predictive relationships by practicing with the regression-analysis-calculator.

Homoscedasticity Assumption

Homoscedasticity (constant variance) means the variance of errors is the same across all levels of the independent variables. Violation is called heteroscedasticity.

Var(ε|X) = σ² (constant for all X)

Diagnostic Methods

Scale-Location Plot

√|Standardized Residuals| vs Fitted
(Flat Line = Good)

Look for funnel or fan shape

Residual vs Fitted Plot

Residuals vs Fitted Values
(Constant Spread = Good)

Check for changing variance

Statistical Tests

Breusch-Pagan Test
White Test
Goldfeld-Quandt Test

Formal hypothesis tests

Homoscedasticity Diagnostic Tool

Simulate and detect heteroscedasticity patterns

Variance Pattern

Select variance pattern and click "Check Variance"

Solutions for Heteroscedasticity

# Weighted Least Squares (WLS)
import statsmodels.api as sm

# Estimate variance function
residuals = model.resid
fitted = model.fittedvalues
abs_resid = np.abs(residuals)

# Regress absolute residuals on fitted values
var_model = sm.OLS(abs_resid, sm.add_constant(fitted)).fit()
weights = 1 / (var_model.fittedvalues ** 2)

# Fit WLS model
wls_model = sm.WLS(y, X, weights=weights)
wls_results = wls_model.fit()

# Robust standard errors (Huber-White)
ols_model = sm.OLS(y, X)
ols_results = ols_model.fit(cov_type='HC3')  # HC3 is recommended

# Transformations
y_transform = np.log(y)  # For multiplicative errors
y_transform = np.sqrt(y)  # For count data
          

Normality Assumption

The normality assumption states that the errors are normally distributed. This is important for small sample inference but less critical for large samples due to the Central Limit Theorem.

ε ~ N(0, σ²)

Diagnostic Methods

Q-Q Plot

Quantile-Quantile Plot
(Points on Line = Good)

Compare residuals to normal distribution

Histogram

Residual Histogram
(Bell Shape = Good)

Visual check of distribution shape

Statistical Tests

Shapiro-Wilk Test
Kolmogorov-Smirnov
Anderson-Darling

Formal normality tests

Normality Diagnostic Tool

Check if your residuals follow a normal distribution

Error Distribution

Select distribution type and click "Check Normality"

Solutions for Non-Normality

# Transformations for non-normal errors
import numpy as np
from scipy import stats

# Box-Cox transformation (for positive data)
y_transformed, lambda_ = stats.boxcox(y + 1)  # Add 1 for zeros

# Yeo-Johnson transformation (handles negative values)
# from sklearn.preprocessing import PowerTransformer
# pt = PowerTransformer(method='yeo-johnson')
# y_transformed = pt.fit_transform(y.reshape(-1, 1))

# Robust regression methods
import statsmodels.api as sm

# Quantile Regression
mod = sm.QuantReg(y, X)
res = mod.fit(q=0.5)  # Median regression

# Bootstrapping for inference
from sklearn.utils import resample

n_iterations = 1000
boot_coefs = []
for i in range(n_iterations):
    X_boot, y_boot = resample(X, y)
    model = sm.OLS(y_boot, X_boot).fit()
    boot_coefs.append(model.params)
          

Put your learning into action by analyzing real datasets with the regression-analysis-calculator.

Multicollinearity Assumption

Multicollinearity occurs when predictor variables are highly correlated with each other. Perfect multicollinearity makes the OLS estimates impossible to compute.

No perfect linear relationship: X'X is invertible

Diagnostic Methods

Correlation Matrix

Correlation Heatmap
(Low Values = Good)

Check pairwise correlations

VIF Values

Variance Inflation Factors
VIF < 10 = Acceptable

VIF > 10 indicates high multicollinearity

Condition Number

Condition Index
κ < 30 = Mild
κ > 100 = Severe

Eigenvalue analysis of X'X

Multicollinearity Diagnostic Tool

Calculate Variance Inflation Factors (VIF)

Correlation Between Predictors

Select correlation level and click "Check Multicollinearity"

Solutions for Multicollinearity

# Variance Inflation Factor calculation
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd

# Calculate VIF for each variable
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) 
                   for i in range(X.shape[1])]

# Ridge Regression (L2 regularization)
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=1.0)  # Alpha controls regularization strength
ridge.fit(X, y)

# Principal Component Regression (PCR)
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression

pca = PCA(n_components=0.95)  # Keep 95% variance
X_pca = pca.fit_transform(X)
pcr_model = LinearRegression().fit(X_pca, y)

# Remove highly correlated variables
corr_matrix = X.corr().abs()
upper_tri = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [column for column in upper_tri.columns 
           if any(upper_tri[column] > 0.8)]
X_reduced = X.drop(columns=to_drop)
          

Comprehensive Diagnostic Tools

A complete regression diagnostic involves multiple checks. Here's a systematic approach:

1

Visual Diagnostics

Create diagnostic plots to visually assess assumptions:

# Comprehensive diagnostic plots in R (ggplot2 equivalent)
library(ggplot2)
library(gridExtra)

# 1. Residuals vs Fitted
p1 <- ggplot(model, aes(.fitted, .resid)) +
  geom_point() + geom_hline(yintercept=0) +
  labs(title="Residuals vs Fitted")

# 2. Normal Q-Q plot
p2 <- ggplot(model, aes(sample=.stdresid)) +
  stat_qq() + stat_qq_line() +
  labs(title="Normal Q-Q")

# 3. Scale-Location plot
p3 <- ggplot(model, aes(.fitted, sqrt(abs(.stdresid)))) +
  geom_point() + geom_smooth(se=FALSE) +
  labs(title="Scale-Location")

# 4. Residuals vs Leverage
p4 <- ggplot(model, aes(.hat, .stdresid)) +
  geom_point() + geom_hline(yintercept=0) +
  labs(title="Residuals vs Leverage")

# Combine all plots
grid.arrange(p1, p2, p3, p4, ncol=2)
            

2

Statistical Tests

Formal hypothesis tests for each assumption:

Assumption Test Null Hypothesis R Command Linearity Rainbow Test Model is linear raintest(model) Independence Durbin-Watson No autocorrelation dwtest(model) Homoscedasticity Breusch-Pagan Constant variance bptest(model) Normality Shapiro-Wilk Normal errors shapiro.test(resid) Multicollinearity VIF VIF < 10 vif(model)

Comprehensive Diagnostic Check

Run a complete diagnostic analysis on simulated data

Sample Size

Include Violation

Configure settings and click "Run Full Diagnostic"

Check your skills by solving practical data modeling problems with the regression-analysis-calculator.

Solutions for Assumption Violations

When assumptions are violated, you have several options:

T

Transformations

Log transformation for skewness
Box-Cox for general issues
Differencing for autocorrelation
Weighting for heteroscedasticity

R

Robust Methods

Robust standard errors
Quantile regression
M-estimators (Huber, Tukey)
Bootstrap inference

A

Alternative Models

Generalized Linear Models
Mixed Effects Models
GAMs for nonlinearity
Time series models

D

Data Collection

Increase sample size
Improve measurement
Random sampling
Experimental design

Decision Tree for Violations

Violation Remediation Decision Tree

Linearity violated? → Try polynomial terms, interactions, or GAMs
Independence violated? → Use time series models, mixed models, or cluster-robust SE
Homoscedasticity violated? → Use WLS, robust SE, or transformations
Normality violated (small n)? → Use transformations or nonparametric methods
Multicollinearity severe? → Use ridge regression, PCA, or remove variables
Multiple violations? → Consider Generalized Linear Models or consult statistician

# Comprehensive solution: Robust regression with bootstrap
import numpy as np
import statsmodels.api as sm
from sklearn.utils import resample

# Fit model with robust standard errors
model = sm.OLS(y, X)
results = model.fit(cov_type='HC3')  # Most robust option

# Bootstrap for inference (handles non-normality)
n_boot = 1000
boot_coefs = np.zeros((n_boot, X.shape[1]))

for i in range(n_boot):
    X_boot, y_boot = resample(X, y)
    boot_model = sm.OLS(y_boot, X_boot).fit()
    boot_coefs[i] = boot_model.params

# Calculate bootstrap confidence intervals
ci_lower = np.percentile(boot_coefs, 2.5, axis=0)
ci_upper = np.percentile(boot_coefs, 97.5, axis=0)

# Compare with robust SE results
print("Robust SE:", results.bse)
print("Bootstrap CI:", list(zip(ci_lower, ci_upper)))
          

Practical Step-by-Step Guide

Follow this systematic approach to checking regression assumptions:

1

Before Fitting the Model

Check data quality: Missing values, outliers, measurement errors
Examine correlations: Look for obvious multicollinearity
Plot relationships: Scatterplots of Y vs each X
Consider context: Time series? Clustered data? Repeated measures?

2

After Fitting the Model

Create diagnostic plots: Residuals vs fitted, Q-Q, scale-location
Calculate VIF: Check for multicollinearity
Test assumptions: Durbin-Watson, Breusch-Pagan, Shapiro-Wilk
Check influential points: Cook's distance, leverage values

3

If Violations are Found

Minor violations: Use robust methods, report limitations
Major violations: Consider transformations or alternative models
Document everything: Report diagnostic results and remedial actions
Sensitivity analysis: Compare results with and without fixes

Practice Problems

Problem 1: You fit a regression model and the residual vs fitted plot shows a clear funnel pattern (variance increases with fitted values). What assumption is violated and what would you do?

Solution:

Violation: Homoscedasticity assumption (heteroscedasticity present)

Steps to address:

Confirm with Breusch-Pagan test
Try transforming the response variable (log, square root)
Use Weighted Least Squares if you know the variance structure
Use robust standard errors (HC3 recommended)
Consider Generalized Least Squares
Report both OLS with robust SE and WLS results

Problem 2: Your regression model has VIF values of 15, 22, and 18 for three predictor variables. The variables are all measures of economic activity. What's the issue and how would you proceed?

Solution:

Issue: Severe multicollinearity (VIF > 10 indicates high multicollinearity)

Steps to address:

Examine correlation matrix to identify highly correlated pairs
Consider creating composite indices or using principal components
Use ridge regression to stabilize estimates
Remove redundant variables (check theoretical importance)
Collect more data if possible (multicollinearity is a data problem)
Interpret coefficients cautiously - focus on model predictions rather than individual coefficients

Table of Contents

Key Assumptions

Introduction to Regression Assumptions

Linearity

Independence

Homoscedasticity

Normality

Multicollinearity

Why Regression Assumptions Matter

Linearity Assumption

Diagnostic Methods

Linearity Diagnostic Tool

Solutions for Nonlinearity

Independence Assumption

Common Violations

Time Series

Spatial Data

Clustered Data

Repeated Measures

Independence Diagnostic Tool

Solutions for Dependence

Homoscedasticity Assumption

Diagnostic Methods

Homoscedasticity Diagnostic Tool

Solutions for Heteroscedasticity

Normality Assumption

Diagnostic Methods

Normality Diagnostic Tool

Solutions for Non-Normality

Multicollinearity Assumption

Diagnostic Methods

Multicollinearity Diagnostic Tool

Solutions for Multicollinearity

Comprehensive Diagnostic Tools

Comprehensive Diagnostic Check

Solutions for Assumption Violations

Transformations

Robust Methods

Alternative Models

Data Collection

Decision Tree for Violations

Practical Step-by-Step Guide

Practice Problems

Continue Your Mathematical Journey

Complete Guide to Regression Analysis

Interpreting Regression Results

Regression Assumptions Explained

Multiple Regression Step-by-Step