What Are the Assumptions of Linear Regression?

Linear regression relies on a specific set of assumptions about your data. When these assumptions hold, the model produces reliable, unbiased estimates. When they’re violated, your coefficients, p-values, or confidence intervals can become misleading. There are five core assumptions, and understanding each one helps you judge whether your regression results are trustworthy.

Linearity Between Variables

The most fundamental assumption is that the relationship between your independent variables and the outcome is linear. More precisely, the expected value of the outcome is a straight-line function of each predictor, holding the others fixed. The slope of that line doesn’t depend on the values of other variables, and the effects of different predictors are additive, meaning they stack on top of each other rather than interacting in complex ways.

This doesn’t mean your raw data has to look like a perfect straight line. It means the underlying relationship between predictors and the average outcome follows a linear pattern. You can still include squared terms or log-transformed variables in a linear regression, because “linear” here refers to the model’s coefficients, not the shape of the predictors themselves. A model like “outcome = 3 + 2(age) + 5(age squared)” is still linear in its coefficients even though the relationship with age curves.

If the true relationship is non-linear and you force a straight line through it, your residual plot will show a telltale pattern: residuals that are positive for small values of the predictor, negative in the middle, and positive again for large values. Any systematic, non-random pattern like this signals that a linear model isn’t capturing the real structure in your data.

Independence of Observations

Each observation in your dataset should be independent of the others. The value of the outcome for one data point should not influence or predict the outcome for another. This assumption is especially important in time-series data, where measurements taken close together in time often share similar values, a phenomenon called autocorrelation.

Positive autocorrelation means that when one residual is high, the next one tends to be high too. You can spot this by plotting residuals in the order they were collected. If the points scatter randomly around zero, independence holds. If they trend upward and downward in waves, or show a cyclic pattern (which can reflect systematic environmental changes like shift rotations), the assumption is violated. The Durbin-Watson test is a formal way to check for autocorrelation, but the visual pattern is often enough to identify the problem.

When observations aren’t independent, there’s no simple correction. The standard errors your model produces will be wrong, which means your confidence intervals and p-values won’t reflect the actual uncertainty in your estimates. Time-series models or mixed-effects models are typically better suited for data with built-in dependencies.

Constant Variance (Homoscedasticity)

The spread of your residuals should be roughly the same across all levels of your predictors. In technical terms, the variance of the errors is constant. When this holds, a plot of residuals versus fitted values looks like a uniform band of points. When it doesn’t, you’ll see a funnel shape, where the residuals fan out (or narrow) as the predicted values increase.

This assumption matters more than many analysts realize. Research published in General Psychiatry found that regression inference is actually more sensitive to violations of constant variance than to violations of normality. When variance differs across groups, the rate of false positives (incorrectly concluding an effect exists) climbs above the expected level. And unlike some other assumption violations, this problem gets worse as your sample size increases from 10 to 100 to 1,000, because larger samples give you more statistical power to detect effects that aren’t really there.

If your residuals show unequal spread, you have several options. Weighted least squares adjusts for the unequal variance by giving less weight to observations with larger residual spread. Robust standard errors (sometimes called Huber-White or sandwich estimators) correct the standard errors without changing the coefficient estimates themselves. Log or square root transformations of the outcome variable can also stabilize variance in many practical situations.

Normality of Residuals

The errors in a linear regression model are assumed to follow a normal distribution. This is specifically about the residuals, not the raw outcome variable or the predictors. A common misconception is that your dependent variable itself needs to be normally distributed. It doesn’t. Only the errors, the differences between observed and predicted values, need to approximate a bell curve.

Normality is primarily needed for valid hypothesis testing. Violations do not bias your coefficient estimates. Your model’s best guesses about the effect of each predictor remain accurate even when residuals aren’t perfectly normal. What goes wrong is the estimation of standard errors, which in turn affects confidence intervals and p-values. If the residuals are heavily skewed, your p-values may be too small or too large.

The good news is that this assumption becomes less important as your sample grows. With large samples (generally more than 10 observations per variable in the model), the central limit theorem kicks in and the sampling distribution of the coefficients approaches normality regardless of the residual distribution. A Q-Q plot, which compares your residuals to what you’d expect from a normal distribution, is the standard visual check. Points falling along a diagonal line indicate normality. Systematic departures at the tails suggest skewness or heavy-tailed data.

No Perfect Multicollinearity

Your predictor variables cannot be perfect linear functions of each other. If one predictor can be calculated exactly from another (for example, temperature in Celsius and temperature in Fahrenheit in the same model), the regression cannot produce unique coefficient estimates. The math literally breaks down because there’s no way to separate the effect of one variable from the other.

Perfect collinearity is rare in practice and usually easy to catch. The more common problem is high (but not perfect) multicollinearity, where predictors are strongly correlated without being exact duplicates. This doesn’t violate the assumption in a strict sense, but it inflates the standard errors of your coefficients, making it harder to detect real effects. Your model might show non-significant results for predictors that genuinely matter.

The Variance Inflation Factor (VIF) is the standard diagnostic tool. A VIF of 1 means no correlation between a predictor and the others. Values above 5 to 10 indicate problematic multicollinearity. You can address this by removing one of the correlated predictors, combining them into a single index, or using techniques like ridge regression that are designed to handle correlated inputs.

How to Check These Assumptions

Most assumption checks come down to examining residuals. A residual versus fitted values plot is the single most useful diagnostic. In a well-behaved model, this plot shows a random scatter of points centered around zero with no discernible pattern. A curved pattern suggests non-linearity. A funnel shape suggests unequal variance. Clusters or trends suggest dependence.

A Q-Q plot handles the normality check. Plot your residuals against the theoretical values you’d expect from a normal distribution. If the points hug the diagonal line, normality holds. Deviations at the ends of the plot indicate heavy tails or skewness. For multicollinearity, calculate VIF values for each predictor before interpreting your results.

These checks should happen after fitting the model, not before. You’re evaluating whether the model’s errors behave the way the theory requires, so you need the model’s output (specifically the residuals) to run the diagnostics.

What to Do When Assumptions Fail

Not all violations are equally damaging, and not all require the same fix. Non-linearity is arguably the most serious because it means your model is fundamentally wrong about the relationship in the data. Adding polynomial terms, using splines, or transforming variables can help capture curved relationships.

For unequal variance, robust standard errors are the most popular modern fix because they correct your inference without requiring you to restructure the model. Weighted least squares is another option that gives less influence to observations where the variance is highest. Bootstrapping, which draws thousands of random samples from your data to estimate standard errors empirically, works well for both non-normality and unequal variance problems.

For autocorrelation in time-series data, transforming both the outcome and predictor using lagged values can remove the dependency structure. For multicollinearity, the simplest approach is dropping redundant predictors or combining highly correlated variables into a composite score. When the issue is mild, simply being aware that your standard errors are inflated may be enough, especially if your coefficients are still statistically significant despite the inflation.