How to Use Linear Regression and Interpret the Results

Linear regression estimates the relationship between two or more variables by fitting a straight line through your data. You use it to predict an outcome (like sales revenue) based on one or more inputs (like advertising spend), or to measure how strongly those inputs influence the outcome. The core idea is simple: find the line that best fits your data points, then use that line to make predictions or draw conclusions.

The Basic Equation

Every linear regression model follows the same structure. You have a dependent variable (Y), which is the thing you’re trying to predict, and one or more independent variables (X), which are the inputs you believe influence Y. In simple linear regression, with just one input, the equation looks like this:

Y = a + bX

Here, “a” is the Y-intercept (the value of Y when X is zero), and “b” is the slope (how much Y changes for each one-unit increase in X). If you’re predicting house prices based on square footage, Y is the price, X is the square footage, “a” is the baseline price, and “b” tells you how much each additional square foot adds to the price.

Multiple linear regression extends this by adding more input variables. You might predict house prices using square footage, number of bedrooms, and distance to the nearest school. The equation just adds more terms: Y = a + b₁X₁ + b₂X₂ + b₃X₃. Each coefficient (b₁, b₂, b₃) tells you the effect of that variable while holding the others constant.

How the Line Gets Fitted

The model finds the “best” line using a method called ordinary least squares (OLS). It calculates the distance between every data point and a proposed line, squares each distance, adds them all up, and then adjusts the line until that total is as small as possible. Squaring the distances ensures that points far from the line get penalized more heavily than those close to it. The resulting line minimizes these squared errors, which is why it’s called the “best-fitting line.”

You don’t need to do this math by hand. Software like Python (scikit-learn, statsmodels), R, Excel, or even Google Sheets can fit the model in seconds. Your job is understanding what the output means and whether the model is trustworthy.

Four Assumptions Your Data Must Meet

Linear regression only produces reliable results when four conditions hold. Violating them can lead to misleading coefficients, inflated confidence, or predictions that fall apart on new data.

  • Linearity. The relationship between each input variable and the outcome is a straight line, not a curve. If the true relationship is curved, your model will systematically over-predict in some ranges and under-predict in others.
  • Independence. Each data point’s error is unrelated to the others. This assumption breaks down with time-series data, where today’s value often depends on yesterday’s. You can check this with the Durbin-Watson statistic, which ranges from 0 to 4. A value near 2 indicates no autocorrelation. Values closer to 0 suggest positive autocorrelation, and values near 4 suggest negative autocorrelation.
  • Homoscedasticity. The spread of errors stays consistent across all levels of the input variables. If your predictions are accurate for low values of X but wildly off for high values, this assumption is violated.
  • Normality of errors. The differences between your predictions and the actual values (residuals) follow a roughly bell-shaped distribution centered on zero. Small departures from normality are usually fine, especially with large datasets, but major skew or heavy tails can distort your results.

Reading the Output

Once you fit a model, the software gives you several key numbers. Knowing what each one means is the difference between using linear regression and actually understanding your results.

Coefficients

Each input variable gets a coefficient that represents its effect on the outcome. A coefficient of 150 for square footage in a house price model means each additional square foot is associated with a $150 increase in price, assuming all other variables stay the same. Negative coefficients mean the outcome decreases as that input increases.

P-Values

Each coefficient comes with a p-value that helps you judge whether the relationship is likely real or could have appeared by chance. The conventional threshold is 0.05: a p-value below 0.05 is typically considered “statistically significant.” But a p-value doesn’t tell you the size or practical importance of the effect. A variable can be statistically significant while having a tiny real-world impact. And a p-value of 0.06 doesn’t mean the relationship is fake; it simply didn’t clear that particular bar.

R-Squared

R-squared tells you what proportion of the variation in your outcome is explained by your model. An R-squared of 0.50 means your inputs account for about half the variation. The rest comes from factors you haven’t included or from random noise. What counts as “good” depends entirely on your field. In physics, you might expect 0.99. In social science or behavioral research, 0.30 can be perfectly useful.

One catch: R-squared always increases when you add more input variables, even if those variables are meaningless. Adjusted R-squared corrects for this by penalizing the addition of variables that don’t genuinely improve the model. If adjusted R-squared drops when you add a new variable, that variable is probably not helping. Always use adjusted R-squared when comparing models with different numbers of inputs.

Checking Your Model With Residual Plots

The single most useful diagnostic tool is a residual plot: a scatterplot of your predicted values (x-axis) against the residuals (y-axis). A healthy model produces a random cloud of points scattered around zero with no visible pattern.

If you see a U-shape or an arc, the relationship isn’t linear. Your model is missing a curve in the data, and you may need to add a squared term or use a different approach. If the residuals fan out (narrow on one side, wide on the other), you have unequal error variance, which means your predictions are more reliable in some ranges than others. Individual points with standardized residuals beyond 3 or below -3 are potential outliers worth investigating, since you’d expect fewer than 0.2% of observations to fall that far from the mean.

Scaling Your Input Variables

When your input variables are measured on very different scales, you should standardize or normalize them before fitting the model. If you’re predicting house prices using square footage (values in the thousands) and number of bedrooms (values from 1 to 5), the raw coefficients will be hard to compare directly. Standardization converts each variable to have a mean of 0 and a standard deviation of 1, putting them on equal footing so you can see which inputs have the strongest relative influence.

Scaling is especially important if you’re using gradient descent to fit your model (common in machine learning implementations). Without scaling, the optimization process can take much longer to converge because large-valued features cause erratic updates to the coefficients.

Watching for Multicollinearity

In multiple regression, problems arise when two or more input variables are highly correlated with each other. If square footage and number of rooms both increase together, the model struggles to separate their individual effects. Coefficients become unstable, and small changes in the data can flip their signs or dramatically change their values.

You can detect this using the Variance Inflation Factor (VIF), which measures how much the variance of each coefficient is inflated by correlation with other inputs. A VIF above 5 to 10 signals problematic multicollinearity. The fix is usually to drop one of the correlated variables, combine them into a single variable, or collect more data.

When Linear Regression Doesn’t Work

Linear regression requires that the model is “linear in the parameters,” meaning the equation is built by multiplying each coefficient by a variable and adding the terms together. You can actually include squared variables, logged variables, or interaction terms and still have a linear model. Predicting Y using X² is fine because the coefficient itself isn’t being raised to a power or transformed.

The model breaks down when the true relationship requires the parameters themselves to be exponentiated, logged, or otherwise transformed. Power functions, exponential growth curves, and wave-like patterns (Fourier functions) all require nonlinear regression. If your residual plot shows a clear curve that persists even after adding polynomial terms, you’ve likely reached the limits of what a linear model can capture.

Linear regression also only handles continuous numerical outcomes. If you’re trying to predict a yes/no outcome (whether a customer will buy something), you need logistic regression instead. And if your data has strong outliers that you can’t remove, the squared-error penalty in OLS gives those outliers outsized influence on your line, pulling it away from where it best fits the majority of your data.