What Is Bias in Linear Regression? Causes and Effects

Bias in linear regression is the systematic error that causes a model’s predictions to consistently miss the true values in one direction. Instead of scattering randomly around the correct answer, a biased model’s predictions are, on average, too high or too low. This differs from random error, which fluctuates unpredictably. Bias is a structural problem: something about the model itself is wrong.

How Bias Works in Simple Terms

Imagine you’re trying to predict home prices using only square footage. Your model draws the best straight line it can through the data, but home prices also depend on location, condition, and dozens of other factors. Because those factors aren’t in the model, the line it draws will systematically over-predict some prices and under-predict others. The average of your predictions won’t land on the true average. That gap between where your model’s predictions land on average and where the true values actually are is bias.

Mathematically, the total prediction error of any model breaks down into three components: bias squared, variance, and irreducible noise. Irreducible noise is the randomness inherent in real-world data that no model can capture. Variance reflects how much your model’s predictions would change if you trained it on a different sample of data. Bias is the piece that remains constant no matter how many samples you train on, because it comes from the model being fundamentally too simple or incorrectly specified for the pattern in the data.

What Causes Bias

Omitting Important Variables

The most common source of bias in linear regression is leaving out a variable that matters. If a variable influences both the outcome you’re predicting and one of the predictors already in your model, its absence warps the coefficients of the included predictors. Going back to the home price example: if houses in expensive neighborhoods also tend to be larger, then leaving out neighborhood makes the square footage coefficient absorb some of the neighborhood effect. The coefficient for square footage becomes inflated, and predictions for small homes in expensive neighborhoods will be systematically wrong.

This is called omitted variable bias, and it requires two conditions to occur. The omitted variable must genuinely affect the outcome, and it must be correlated with at least one variable already in the model. If either condition is missing, leaving the variable out won’t bias your estimates.

Forcing a Line Through a Curve

Linear regression assumes the relationship between predictors and the outcome follows a straight line (or a flat plane in higher dimensions). When the true relationship is curved, a straight line will systematically miss the pattern. It will over-predict in some ranges and under-predict in others, not randomly, but in a consistent, patterned way. This is a form of underfitting: the model is too rigid to capture what’s actually happening in the data.

Measurement Error in Predictors

When your predictor variables contain measurement errors, the regression slope gets pulled toward zero. Research assessing the consequences of measurement error in explanatory variables found that the ordinary least squares method underestimates the regression slope on average, with the bias increasing as the measurement error grows larger and as the true relationship gets stronger. Interestingly, measurement error in the outcome variable (the thing you’re predicting) does not bias the slope, as long as the errors in the predictor and outcome aren’t correlated with each other.

Non-Representative Samples

If the data used to build the model doesn’t represent the population you care about, the regression coefficients will be biased estimates of the true relationships. This selection bias can distort which variables appear important and by how much. Research published in the Annals of Applied Statistics found that large volunteer samples and convenience samples can produce misleading coefficient estimates, with some coefficients showing substantial bias compared to what probability-based samples would yield.

When OLS Estimates Are Unbiased

Ordinary least squares regression, the standard method for fitting a linear model, produces unbiased estimates under specific conditions. The key requirements are that the errors (the differences between predicted and actual values) have a mean of zero and constant variance across all observations, and that no error is correlated with any other. When these conditions hold, OLS is not just unbiased but achieves the lowest possible variance among all linear unbiased estimators. Statisticians call this property “BLUE” (best linear unbiased estimator), established by the Gauss-Markov theorem.

In practice, these assumptions are often violated to some degree. The question isn’t whether bias exists but whether it’s large enough to matter for your specific analysis.

How to Detect Bias

The most practical tool for spotting bias is a residual plot. Residuals are the differences between your model’s predictions and the actual values. Plot them against the predicted values, and look for patterns. If the model is unbiased, residuals should scatter randomly around zero across the entire range of predictions, with no discernible shape or trend.

A telltale sign of bias is a curved pattern in the residuals. If residuals are negative (predictions too high) for low predicted values and positive (predictions too low) for high predicted values, the model is systematically getting the relationship wrong. This often means the true relationship is nonlinear, or an important variable is missing. Another red flag is when residuals fan out or compress as predicted values increase, suggesting the model’s accuracy varies across different ranges.

If your residual plot shows any of these patterns, the regression coefficients and predictions can’t be trusted at face value. The model needs adjustment, whether that means adding variables, transforming existing ones, or using a different functional form.

Bias vs. Variance: The Tradeoff

Bias and variance pull in opposite directions. A very simple model (like a straight line through complex data) has high bias because it can’t capture the true pattern, but low variance because it gives similar results regardless of which data sample you use. A very complex model (with many predictors and interaction terms) has low bias because it can flex to fit the data closely, but high variance because it’s sensitive to the specific sample and may not generalize well to new data.

This tradeoff matters when you’re choosing how complex to make your regression model. Adding more predictors reduces bias by giving the model more information to work with, but at some point the added complexity starts fitting noise rather than signal. The goal is to find the complexity level where total error (bias squared plus variance plus irreducible noise) is minimized. In practice, techniques like cross-validation help identify this sweet spot by testing how well the model performs on data it wasn’t trained on.

What Bias Does to Your Results

A biased model produces coefficients that don’t reflect the true relationships in your data. If you’re using regression to understand which factors matter and by how much (say, whether a marketing campaign increased sales), biased coefficients lead to wrong conclusions. You might attribute an effect to one variable that actually belongs to another, or underestimate a relationship that’s genuinely strong.

For prediction, bias means your model will consistently miss in the same direction for certain types of inputs. It might under-predict high values and over-predict low ones, or systematically get a particular subgroup wrong. Unlike random error, which averages out over many predictions, bias accumulates. If you’re using a biased model to make thousands of predictions, every single one carries that same systematic distortion.