What Is OLS Regression? Key Concepts and Applications

OLS regression, short for ordinary least squares regression, is the most common method for fitting a straight line (or a more complex equation) through a set of data points. It works by finding the line that makes the total squared distance between each data point and the line as small as possible. If you’ve ever seen a “line of best fit” on a scatter plot, it was almost certainly drawn using OLS.

How OLS Finds the Best Fit

Imagine you have a scatter plot with data points spread across it. You want to draw a line that captures the general trend. OLS does this by measuring the vertical gap between each data point and a candidate line. These gaps are called residuals. Some residuals are positive (the point sits above the line) and some are negative (below the line), so simply adding them up would let positive and negative errors cancel each other out. To avoid that, OLS squares each residual, then adds them all together. The “best” line is the one where that total sum of squared residuals is the smallest it can be.

For a simple case with one input variable, the model looks like this: the outcome equals an intercept plus a slope times the input, plus some error. The intercept tells you where the line crosses the vertical axis, and the slope tells you how much the outcome changes for each one-unit increase in the input. OLS calculates the exact intercept and slope that minimize the squared errors. When you have multiple input variables, the same principle applies, just extended into more dimensions.

Why Squaring the Errors Matters

Squaring serves two purposes. First, it treats overshooting and undershooting equally, since both become positive numbers. Second, it penalizes large errors more heavily than small ones. A residual of 10 contributes 100 to the sum, while a residual of 2 contributes only 4. This means OLS tries especially hard to avoid big misses, which is usually desirable but also makes it sensitive to outliers (more on that below).

The Assumptions Behind OLS

OLS will always produce a line, but that line is only trustworthy under certain conditions. These conditions come from a result in statistics called the Gauss-Markov theorem, which says that when the assumptions hold, OLS gives you the most precise estimates possible among all methods that use a straight-line formula.

The key assumptions are:

  • Linearity. The true relationship between your variables is linear, or at least close enough that a line is a reasonable approximation.
  • Zero-mean errors. On average, the errors balance out to zero. The model isn’t systematically too high or too low.
  • Constant variance (homoscedasticity). The spread of the errors is roughly the same across all values of the input variable. If errors fan out as the input grows, this assumption is violated.
  • Uncorrelated errors. One data point’s error doesn’t predict another’s. This matters especially with time-series data, where today’s value often influences tomorrow’s.
  • No perfect multicollinearity. When using multiple inputs, none of them can be an exact linear combination of the others.

When these assumptions hold, OLS is called the “best linear unbiased estimator,” often abbreviated BLUE. “Best” here means it has the lowest possible sampling variability, so your estimates bounce around less from sample to sample. If you further assume the errors follow a normal (bell-curve) distribution, OLS also turns out to be identical to maximum likelihood estimation, another widely used statistical technique. In other words, under normality, no other method can extract more information from the data than OLS can.

Reading the Output

When you run an OLS regression, the software gives you several pieces of information for each input variable. The most important are the coefficient, the standard error, and the p-value.

The coefficient (also called the slope) tells you the direction and size of the relationship. A positive coefficient means the outcome tends to increase as the input increases. A negative coefficient means the outcome tends to decrease. For example, if you’re predicting house prices and the coefficient on square footage is 150, that means each additional square foot is associated with a $150 increase in price, holding everything else constant.

The standard error tells you how precise that estimate is. A small standard error relative to the coefficient suggests a stable, reliable estimate. A large one suggests the estimate could shift a lot with different data.

The p-value tests whether the relationship you see in your sample is likely to exist in the broader population, or whether it could easily be a fluke of random variation. A p-value below 0.05 is the conventional threshold for calling a result “statistically significant,” meaning there’s less than a 5% chance you’d see a relationship this strong if there were truly no relationship at all. Variables with p-values above 0.05 are often dropped from the final model.

Measuring How Well the Model Fits

The most common measure of overall model quality is R-squared. It represents the proportion of variation in your outcome that the model explains. An R-squared of 0.80 means the model accounts for 80% of the variability in the data, with the remaining 20% unexplained.

R-squared has a known weakness, though: it never decreases when you add more input variables, even if those variables are irrelevant. Because OLS minimizes squared errors, any additional variable can only help (or at worst do nothing) in the sample you’re working with. This makes the standard R-squared an optimistic estimate of how well your model would perform on new data.

Adjusted R-squared corrects for this by penalizing the addition of variables that don’t meaningfully improve the model. The gap between R-squared and adjusted R-squared grows when the sample size is small, when many predictors are included, or when R-squared itself is low. If you’re comparing models with different numbers of inputs, adjusted R-squared is the more honest metric.

Checking Your Work With Residual Plots

The best way to verify that OLS assumptions hold is to look at the residuals after fitting the model. Two plots are especially useful.

A residuals-versus-fitted-values plot places the model’s predictions on the horizontal axis and the residuals on the vertical axis. If everything looks good, you’ll see a random cloud of points with no pattern. A funnel shape (residuals spreading wider as predictions grow) signals that the constant-variance assumption is violated. A curved pattern means the relationship isn’t truly linear and you may need to transform your variables or use a different model.

A normal Q-Q plot compares the distribution of your residuals to what a perfect bell curve would look like. If the points fall roughly along a diagonal line, the normality assumption is reasonable. Large departures at the tails suggest skewed or heavy-tailed errors, which can affect the reliability of your p-values and confidence intervals.

The Outlier Problem

Because OLS squares the residuals, extreme data points exert outsized influence on the results. A single observation far from the rest of the data can tilt the entire regression line toward it. Technically, the breakdown point of OLS equals one divided by the sample size, meaning even one sufficiently extreme outlier can make the estimates unreliable.

There are a few ways to handle this. Transforming the data (for instance, taking the logarithm of a skewed variable) can compress extreme values and reduce their pull. Removing outliers directly is sometimes done, but deleting data points without a clear justification risks introducing bias. A more principled approach is to use robust regression methods that automatically downweight extreme observations instead of eliminating them. These methods produce estimates that are less sensitive to outliers while still using all available data.

Where OLS Shows Up in Practice

OLS is the default tool across a remarkable range of fields. In economics, it’s used to estimate relationships like how consumer spending changes with income. A classic model might express consumption as a function of disposable income, where the slope represents the propensity to consume: for every additional dollar of income, how many cents go toward spending. When economists believe the relationship is exponential rather than linear, they often take the logarithm of both variables and still run OLS on the transformed data, which lets them interpret the coefficient as an elasticity (a percentage change in one variable per percentage change in another).

In public health, OLS helps quantify the relationship between risk factors and outcomes. In marketing, it estimates how advertising spend relates to sales. In environmental science, it models how pollutant levels change over time. Virtually any situation where you want to quantify a relationship between measurable variables is a candidate for OLS.

One important caution with real-world data, especially time series: many variables trend upward or downward over time, which can create spurious correlations. Two completely unrelated variables that both happen to be growing will appear strongly correlated in an OLS model. A classic, deliberately absurd example from econometrics: Somalia’s population and U.S. GDP both increased over the same period, but one obviously doesn’t cause the other. Careful analysts check for this by examining whether the underlying data is stationary or by using techniques designed for trending data.