What Is Collinearity and How Does It Affect Regression?

Collinearity is a situation in statistical modeling where two or more predictor variables are strongly correlated with each other, making it difficult to determine which variable is actually driving the outcome. In its simplest form, collinearity between two variables means that knowing one lets you largely predict the other. This creates real problems when you’re trying to build a regression model and figure out which factors matter most.

How Collinearity Works

Imagine you’re building a model to predict housing prices using square footage and number of rooms as your two predictor variables. Bigger houses tend to have more rooms, so these two variables move together. That overlap is collinearity. Your model can still predict housing prices reasonably well overall, but it struggles to say how much of the price comes from square footage versus how much comes from the number of rooms, because the two are tangled together.

A perfect linear relationship between two variables, like one being exactly double the other, is called exact collinearity. But the relationship doesn’t need to be perfect to cause problems. A strong correlation is enough. When the same kind of overlap involves three or more variables at once, the technical term shifts to multicollinearity. In practice, most people use the two terms interchangeably.

Real-World Examples

Collinearity shows up constantly in real data. In a sample of young people, age and years of education move almost in lockstep because most students advance one grade per year. In health research, measurements of body fat taken from the triceps, thigh, and mid-arm all capture the same underlying concept (total body fat), so they’re naturally correlated with each other.

Sometimes the relationship is less obvious. In automotive data, miles per gallon, vehicle weight, and price are all entangled because they reflect different market segments. Trucks are heavier, more expensive, and get worse mileage. A lightweight sedan is cheaper and more fuel-efficient. The variables don’t directly cause each other, but they cluster together because of how the car market works. This kind of hidden collinearity is the type most likely to catch you off guard.

Why It Causes Problems

Collinearity doesn’t necessarily make your model’s overall predictions worse. If you just want to forecast an outcome, the model’s total explanatory power (its R-squared) can remain perfectly fine. The damage shows up when you try to interpret individual variables.

When predictor variables are correlated, the model can’t cleanly separate their individual contributions. This inflates the standard errors of the regression coefficients, which are the estimates of how much each variable matters. Larger standard errors mean wider confidence intervals and higher p-values, making it harder to declare any single variable statistically significant. A variable that genuinely influences the outcome can appear insignificant simply because it shares too much information with another variable in the model.

Worse, the coefficient estimates themselves become unstable. Adding or removing a single data point, or dropping one variable from the model, can cause the remaining coefficients to swing dramatically. You might even see coefficients flip signs, suggesting a variable has the opposite effect of what you’d expect. None of this means the model is broken in a global sense. It means the model can’t tell you which specific variable deserves the credit.

How to Detect It

The simplest check is a correlation matrix. Calculate the Pearson correlation between every pair of predictor variables. Correlations above 0.7 or 0.8 are a common warning sign, though no single cutoff is universal.

A correlation matrix only catches pairwise relationships, though. One variable might be predictable from a combination of three others without being strongly correlated with any single one. For that, the Variance Inflation Factor (VIF) is the standard tool. VIF measures how much the variance of a coefficient is inflated because of collinearity. A VIF of 1 means no collinearity at all. Values above 5 suggest moderate collinearity, and values above 10 are widely considered problematic. Most statistical software can calculate VIF for every variable in your model with a single command.

Prediction vs. Interpretation

Whether collinearity actually matters depends on what you’re trying to do. If your goal is pure prediction, say forecasting next quarter’s sales, collinearity is often harmless. The correlated variables still carry useful information collectively, and the model’s overall accuracy holds up.

If your goal is inference, meaning you want to understand which specific factors drive an outcome and by how much, collinearity is a serious issue. A health researcher trying to determine whether body fat on the thigh or the triceps better predicts heart disease risk needs clean separation between those variables. A policymaker asking whether education or age has a bigger effect on income needs the same. In these cases, leaving collinearity unaddressed leads to misleading results.

How to Fix It

The most straightforward fix is removing one of the correlated variables. If two variables measure essentially the same thing, dropping one often costs you very little predictive power while making the remaining coefficients much more stable and interpretable. Domain knowledge helps here: pick the variable that’s more relevant to your research question or easier to measure.

When you don’t want to lose variables, combining them works well. Principal component analysis (PCA) transforms a set of correlated variables into a smaller set of uncorrelated components. Each component is a weighted blend of the original variables, arranged so the first component captures as much of the overall variation as possible, the second captures the next largest share, and so on. You then use these components as your new predictors. The tradeoff is that the components are harder to interpret than the original variables, since each one is a mixture.

Ridge regression takes a different approach. Standard regression fails completely when variables are perfectly collinear because the math has no unique solution. Ridge regression adds a small penalty term to the equation that forces a solution to exist, producing stable coefficient estimates even under perfect collinearity. The penalty shrinks coefficients toward zero, introducing a small amount of bias in exchange for a large reduction in variance. LASSO regression works similarly but tends to set some coefficients exactly to zero, effectively selecting a subset of variables. One key difference: LASSO can also break down under perfect collinearity, while ridge regression handles it cleanly.

For simpler situations, just centering your variables (subtracting the mean from each value) can reduce collinearity that arises from scale differences, particularly when your model includes interaction terms or polynomial terms like “age squared.”