What Is the Method of Least Squares and How It Works

The method of least squares is a mathematical technique for finding the line (or curve) that best fits a set of data points. It works by minimizing the total squared distance between each observed data point and the predicted value on the line. First published by the French mathematician Adrien-Marie Legendre in 1805, it remains the most widely used approach for fitting models to data in statistics, engineering, economics, and the sciences.

The Core Idea: Why Square the Errors?

Imagine you’ve plotted a handful of data points on a graph and you want to draw a straight line through them. No single line will pass through every point perfectly, so each point will sit some distance above or below your line. That vertical distance between a data point and the line is called a “residual,” and it represents how far off your prediction is for that particular observation.

The method of least squares finds the line where the sum of all those residuals, squared and added together, is as small as possible. If you have 20 data points, you square each of the 20 residuals and add them up. The line that produces the smallest total is the least squares fit.

Squaring the residuals serves two purposes. First, it prevents positive and negative errors from canceling each other out. A point five units above the line and a point five units below the line would sum to zero if you just added the raw residuals, making it look like your predictions were perfect when they clearly weren’t. Second, squaring penalizes large errors more heavily than small ones. A residual of 10 contributes 100 to the total, while a residual of 2 contributes only 4. This forces the best-fit line to avoid straying too far from any single point.

How the Best-Fit Line Is Calculated

For the simplest case, a straight line through a scatter plot, the method produces two numbers: a slope and an intercept. Together these define the equation of the line, typically written as ŷ = b₀ + b₁x, where b₀ is the intercept (where the line crosses the vertical axis) and b₁ is the slope (how steeply the line rises or falls).

The slope is calculated by dividing the sample covariance of x and y by the sample variance of x. In plain terms, the covariance measures how much x and y tend to move together, and the variance measures how spread out the x values are. Dividing one by the other tells you how much y changes, on average, for each one-unit increase in x. The intercept is then calculated by taking the average y value and subtracting the slope times the average x value. This guarantees the best-fit line always passes through the point defined by the two averages.

These formulas have a clean, closed-form solution, meaning you can compute them in one pass through the data without trial and error. That mathematical convenience is one reason the method became so dominant. For models with multiple predictors (say, predicting house price from square footage, number of bedrooms, and neighborhood), the same principle applies but the algebra extends into matrix form.

What Makes It “Best”

The method of least squares isn’t just popular by convention. Under certain conditions, it produces the best possible estimates of the relationship between variables. Specifically, if the errors in your data have an average value of zero and are equally spread out (no pattern of larger errors in one region of the data than another), the least squares solution has the lowest possible variance among all unbiased linear estimators. This result is known as the Gauss-Markov theorem, and it means no other linear method can consistently get closer to the true underlying relationship.

“Unbiased” here means the estimates don’t systematically overshoot or undershoot the true values. “Lowest variance” means the estimates are as stable as possible from one dataset to the next. Together, these properties make least squares the default starting point for fitting a model to data. When the assumptions hold, you genuinely cannot do better with a linear approach.

The Outlier Problem

Because squaring amplifies large errors, the method is highly sensitive to outliers. Even a single extreme data point can pull the best-fit line dramatically toward itself, distorting the results for the rest of the data. A study in BMC Bioinformatics demonstrated that when an outlier is present, the least squares curve can be “dramatically influenced,” producing misleading parameter estimates and inflated error measures.

This happens because an outlier with a residual of, say, 50 contributes 2,500 to the sum of squares, while a typical point with a residual of 3 contributes just 9. The optimization process will bend the line toward that extreme point to reduce its outsized contribution to the total. If the outlier is a genuine measurement error rather than a real data pattern, the resulting line misrepresents the actual trend.

One alternative is to minimize the sum of absolute deviations instead of squared deviations. This approach treats a residual of 50 as only about 17 times worse than a residual of 3, rather than nearly 280 times worse. The result is a fit that’s more robust to outliers, though it sacrifices the mathematical elegance and optimality guarantees of least squares. In practice, analysts often start with least squares, check for outliers, and switch to a robust method if needed.

Beyond Straight Lines

The least squares principle extends well beyond fitting a straight line. You can use it with polynomial curves, exponential growth models, or any function where you want to minimize squared errors. When the model is linear in its parameters (even if it includes squared or cubed terms of x), the same direct formulas apply.

When the model is genuinely nonlinear, meaning the parameters appear inside functions like exponentials or logarithms in ways that can’t be rearranged into a linear form, there’s no single formula for the answer. Instead, algorithms iterate toward the solution, adjusting the parameters step by step until the sum of squares stops decreasing. The two most common approaches are the Gauss-Newton method, which uses information about the curve’s shape to choose each step, and the Levenberg-Marquardt algorithm, which adds a safeguard to prevent the steps from overshooting. Both repeat the process until the model fits the data satisfactorily.

Where Least Squares Shows Up in Practice

Any time you see a trend line on a graph, a regression equation in a research paper, or a prediction model in a software tool, there’s a good chance least squares is running underneath. In economics, it estimates the relationship between variables like income and spending. In engineering, it calibrates sensor measurements against known standards. In medicine, penalized versions of least squares have been used to model how radioactive tracers move through the body during brain imaging, to identify which brain regions are most active during specific tasks, and to link brain structure measurements to cognitive test scores.

The method also forms the backbone of machine learning techniques like linear regression and ridge regression. Even more complex models often use least squares as a building block, adding penalties or constraints on top of the basic sum-of-squares objective to handle high-dimensional data or prevent overfitting.

A Brief Origin Story

Legendre published the method in 1805 as an appendix to a book on determining the orbits of comets, calling it “la Méthode des moindres quarrés.” Four years later, the German mathematician Carl Friedrich Gauss published his own account, claiming he had been using the technique since 1795 but acknowledging that Legendre published it first. The priority dispute between the two was never fully resolved, but the method’s connection to both mathematicians persists: the standard version is still called “ordinary least squares,” and the theorem guaranteeing its optimality bears Gauss’s name alongside Markov’s.