OLS (ordinary least squares) regression is used to measure the relationship between one or more input variables and a continuous outcome, and to predict future values of that outcome. It’s the most widely used form of linear regression in economics, psychology, public health, and machine learning. The method works by finding the line (or plane, with multiple inputs) that minimizes the total squared distance between each data point and the predicted value.
How OLS Regression Works
The core idea is simple: you have data points scattered on a graph, and you want to draw a straight line that best represents the trend. OLS defines “best” in a specific way. It calculates the vertical distance between each actual data point and the line, squares each of those distances, and then adds them all up. The line that produces the smallest total is the OLS solution.
Squaring the distances serves two purposes. It prevents positive and negative errors from canceling each other out, and it penalizes large errors more heavily than small ones. A data point that’s 10 units away from the line contributes 100 to the total, while a point just 2 units away contributes only 4. This means OLS naturally tries harder to pull the line toward points that are far away from it.
Finding the exact line involves calculus. For a model with one input variable, OLS solves for two unknowns: the intercept (where the line crosses the vertical axis) and the slope (how steeply it rises or falls). It takes the derivative of the total squared error with respect to each unknown, sets both derivatives to zero, and solves the resulting system of equations. With multiple input variables, the same logic applies but uses matrix algebra to handle the additional unknowns simultaneously.
Three Main Uses
Predicting outcomes. Once you’ve fit an OLS model to existing data, you can plug in new input values and get a predicted outcome. Businesses use this to forecast sales or revenue based on advertising spend. Climate scientists use it to project temperature trends from historical data. Election analysts use regression models to forecast vote shares from polling numbers and demographic variables.
Measuring relationships. OLS tells you how much the outcome changes, on average, when an input variable increases by one unit, while holding other inputs constant. In economics, this might mean estimating how an extra year of education affects earnings. In psychology, it might mean quantifying how social background variables relate to attitudes measured on a standardized scale. The key detail is that the outcome variable must be a quantitative measure, not a category like “yes” or “no.”
Testing hypotheses. Researchers use OLS to determine whether a specific input variable has a statistically significant effect on the outcome. The model produces not just a coefficient for each variable but also a standard error that indicates how precisely that coefficient is estimated. If the coefficient is large relative to its standard error, you have evidence that the relationship is real rather than a product of random noise in the data.
Reading the Results
Two numbers matter most when interpreting an OLS model: the coefficients and R-squared.
Each coefficient tells you the expected change in the outcome for a one-unit increase in that input variable. If you’re modeling home prices and the coefficient on square footage is 150, that means each additional square foot is associated with a $150 increase in price, all else being equal. The intercept represents the predicted outcome when all input variables equal zero, which sometimes has a meaningful interpretation and sometimes doesn’t.
R-squared measures how well the model fits the data overall. It represents the proportion of variation in the outcome that your input variables collectively explain. An R-squared of 0.80 means the model accounts for 80% of the variation in the outcome, leaving 20% unexplained. An R-squared of 0 means the model does no better than simply predicting the average outcome for every observation. Perfect prediction yields an R-squared of 1, which almost never happens with real data.
When OLS Is the Right Choice
The primary decision point is the nature of your outcome variable. OLS is designed for continuous outcomes: things like income, temperature, blood pressure, test scores, or days of hospitalization. If your outcome is categorical (survived vs. died, purchased vs. didn’t purchase), logistic regression is the appropriate tool instead. This distinction is the single most important factor in choosing between the two methods.
OLS also assumes a roughly linear relationship between the inputs and the outcome. If the true relationship is curved, a straight line will systematically miss the pattern. You can sometimes address this by transforming your variables (using the logarithm of income instead of raw income, for example), but if the underlying relationship is fundamentally nonlinear, other methods may fit better.
Assumptions That Must Hold
OLS earns its reputation as the “best linear unbiased estimator” only when certain conditions are met. The Gauss-Markov theorem proves that when the errors in the model have a mean of zero and constant variance, and are uncorrelated with each other, no other linear unbiased method will produce more precise estimates. When these conditions break down, the results can become misleading.
The most commonly violated assumption is constant variance, known as homoscedasticity. This means the spread of errors should be roughly the same across all levels of the input variables. If errors fan out as the predicted value gets larger (common in financial data, where higher-value transactions have more variability), you have heteroscedasticity. The consequence is serious: parameter estimates can become biased, standard errors may be inflated or deflated, and hypothesis tests can produce incorrect conclusions. The overall validity of the model suffers.
Another important assumption is that the errors are normally distributed. While OLS can still produce reasonable coefficient estimates with non-normal errors, the confidence intervals and significance tests depend on normality to be accurate, especially in smaller samples.
Checking Whether Your Model Works
After fitting an OLS model, you check its validity by examining the residuals, which are the differences between actual and predicted values. Two diagnostic plots do most of the heavy lifting.
A residuals vs. fitted values plot displays predicted values on the horizontal axis and residuals on the vertical axis. If the model is well-specified, you should see residuals scattered randomly around a horizontal line at zero, with no discernible pattern. A curved pattern suggests the relationship isn’t linear. A funnel shape (residuals spreading wider as fitted values increase) signals heteroscedasticity.
A Q-Q (quantile-quantile) plot checks whether the residuals follow a normal distribution. It plots the residuals against the values you’d expect if they were perfectly normal. Points falling along a straight diagonal line indicate normality. Severe deviations at either end suggest heavy tails or skewness in the errors, which can undermine the reliability of your significance tests.
Sensitivity to Outliers
Because OLS minimizes squared errors, it gives disproportionate weight to extreme values. A single data point that sits far from the general pattern can pull the entire regression line toward it, distorting both the slope and intercept. Research published in PLOS ONE found that OLS has a low “breakdown point,” meaning even one outlier can meaningfully bias the results. In a study of antiplatelet therapy costs, for instance, OLS produced lower estimates of treatment benefit than robust alternatives specifically because outliers in the cost data dragged the estimates down.
This sensitivity is both a strength and a weakness. In clean data, it produces highly efficient estimates. In messy real-world data, especially financial or medical cost data where extreme values are common, it may require supplementary techniques. Identifying and investigating outliers before finalizing your model is a standard and necessary step in any OLS analysis.

