What Is OLS Regression and How Does It Work?

OLS regression, short for ordinary least squares regression, is the most common method for fitting a straight line (or plane, with multiple variables) through a set of data points. It works by finding the line that minimizes the total squared distance between each data point and the line’s prediction. If you’ve ever seen a “line of best fit” drawn through a scatterplot, OLS is almost certainly how it was calculated.

How OLS Finds the Best Fit

Imagine you have a scatterplot with data points scattered around a general trend. You could draw many possible lines through those points. OLS picks the one where the sum of all squared errors is as small as possible. An “error” here is simply the vertical gap between a real data point and the value the line predicts for it. These gaps are called residuals.

Why squared? Squaring does two things. It makes all the gaps positive, so points above and below the line don’t cancel each other out. And it penalizes large misses more heavily than small ones, pushing the line toward a position that avoids big errors rather than tolerating a few in exchange for many small ones. The math behind this produces a direct formula for the line’s slope and intercept, so there’s no guessing or trial and error involved. Given your data, there is exactly one OLS solution.

What the Output Tells You

An OLS regression produces coefficients: one intercept and one slope for each predictor variable. The intercept is the predicted value of your outcome when every predictor equals zero. Each slope coefficient tells you how much the outcome changes, on average, for a one-unit increase in that predictor, holding all other predictors constant.

For example, if you regress writing test scores on a binary variable for student sex, a coefficient of 5.4 means one group scores about 5.4 points higher on average than the other. If your predictor or outcome is log-transformed, the interpretation shifts to percentage changes or ratios rather than raw units, but the underlying logic is the same: the coefficient quantifies how much one variable moves when another changes.

Testing Whether a Relationship Is Real

Every coefficient in an OLS model comes with a p-value. This tests a specific question: could this coefficient be zero in the broader population, with the pattern you see arising from chance alone? That’s the null hypothesis. If the p-value falls below 0.05, the conventional threshold, you reject that null hypothesis and conclude the relationship is statistically significant.

The p-value is calculated using a t-distribution, which accounts for sample size. With very small samples, the bar for significance is higher because there’s more uncertainty. A statistically significant result doesn’t mean the effect is large or practically important. It only means the data provide enough evidence that the relationship isn’t zero.

Measuring How Well the Model Fits

R-squared is the most common measure of model performance. It represents the proportion of variation in your outcome that the model explains, ranging from 0 (explains nothing) to 1 (explains everything). An R-squared of 0.60 means your predictors account for 60% of the variation in the outcome.

There’s a catch. R-squared can never decrease when you add another predictor, even a useless one, because more variables always allow the model to fit the existing data a little better. This makes standard R-squared an optimistic estimate of how well your model would perform on new data. Adjusted R-squared corrects for this by penalizing the addition of predictors that don’t improve the model enough to justify their inclusion. It can actually go down when you add a weak variable, making it more honest about your model’s true explanatory power.

The Assumptions Behind OLS

OLS isn’t just a convenient formula. Under certain conditions, it’s provably the best option. The Gauss-Markov theorem states that when the following assumptions hold, OLS produces estimates with the lowest possible variance among all linear, unbiased estimators:

Linearity: The relationship between predictors and the outcome is linear (or at least well-approximated by a line).
Zero mean errors: The errors average out to zero. The model isn’t systematically over- or under-predicting.
Constant variance: The spread of errors is roughly the same across all values of the predictors. This property is called homoscedasticity.
Uncorrelated errors: One data point’s error doesn’t predict another’s. This matters especially with time-series data, where consecutive observations can be related.

When these assumptions hold, OLS gives you the tightest, most reliable estimates possible without needing more complex methods. When they don’t, the estimates may still be unbiased, but you lose the guarantee that they’re the most precise available.

Checking Your Model With Residual Plots

The most practical way to check whether your assumptions hold is to plot the residuals (the gaps between predicted and actual values) against the fitted values. A well-behaved model produces a residual plot that looks like a random cloud centered on zero, with no obvious pattern.

If you see a curve, such as residuals that are positive for low predicted values, negative in the middle, and positive again for high values, your relationship isn’t linear and you may need a different model structure. If you see a fan or funnel shape where residuals spread out (or tighten) as predicted values increase, your error variance isn’t constant. Both patterns signal that OLS results may be unreliable, and they’re easy to spot visually even without formal statistical tests.

Where OLS Struggles

OLS performs well when the data are clean and the assumptions are reasonably met. Two problems can seriously undermine it.

Outliers, data points that fall far from the general pattern, have an outsized effect on OLS because squaring the errors magnifies their influence. A single extreme point can pull the entire regression line toward it, distorting coefficients and inflating error estimates. Multicollinearity, which occurs when predictor variables are highly correlated with each other, is the other common problem. When two predictors move in lockstep, the model can’t reliably separate their individual effects, leading to unstable coefficients that may swing wildly with small changes in the data. When both problems occur simultaneously, OLS estimates become especially unreliable.

How Much Data You Need

Rules of thumb for minimum sample sizes vary. Some guidelines suggest at least 50 observations plus 8 per predictor variable. Others recommend 10 to 20 observations per predictor. A study published in PLOS One found that a minimum of 25 observations is needed for data with moderate to high variance, while as few as 8 can work when the data follow a very tight pattern with little spread. In practice, more data nearly always helps, particularly when you have several predictors or expect noisy measurements.

Common Uses in Research

OLS regression appears across virtually every field that works with quantitative data. In medical research, it’s used to assess relationships between predictors like treatment group or patient sex and outcomes like blood pressure, and to predict outcome values based on one or more measured variables. In economics, it models how wages relate to education and experience. In ecology, it links species counts to environmental conditions. The method’s simplicity, interpretability, and strong theoretical foundation make it the default starting point for understanding relationships in data, even when researchers eventually move on to more specialized techniques.