What Does R-Squared Tell You — and What It Doesn’t

R-squared tells you the proportion of variation in your data that is explained by your model. It’s a number between 0 and 1 (or 0% and 100%), where higher values mean the model accounts for more of the variability in the outcome you’re measuring. An R-squared of 0.85, for example, means your model explains 85% of why the data points differ from one another, with the remaining 15% left unexplained.

How R-Squared Works

Imagine you have a set of data points and you draw a flat horizontal line through their average. The total spread of those points around that average is called the total sum of squares. It captures all the variation in your data before you try to explain any of it.

Now you fit a regression line (or curve) through those points. Some of the original spread is now “explained” by the pattern your model found. The leftover spread, the part the model missed, is the residual sum of squares. R-squared is simply 1 minus the ratio of that leftover spread to the total spread. If the leftover is small relative to the total, R-squared is close to 1. If the model barely reduces the spread at all, R-squared is close to 0.

What the Values Actually Mean

An R-squared of 0 means your model does no better than just predicting the average for every observation. An R-squared of 1 means every data point falls exactly on the predicted line, with zero unexplained variation. Most real-world results land somewhere in between, and what counts as “good” depends entirely on the field you’re working in.

In physics and chemistry, where experiments can tightly control variables, R-squared values of 0.70 to 0.99 are considered good. In social sciences and psychology, where human behavior introduces far more noise, values as low as 0.10 to 0.30 are often acceptable and still considered meaningful. A model predicting crop yield from rainfall will naturally explain more variation than a model predicting job satisfaction from salary, simply because people are more complicated than plants.

R-Squared and Correlation

In simple linear regression (one predictor, one outcome), R-squared is literally the square of the Pearson correlation coefficient, r. If the correlation between hours studied and exam score is 0.80, R-squared is 0.64, meaning study time explains 64% of the variation in scores. Squaring the correlation always produces a positive number, so R-squared doesn’t tell you the direction of the relationship. A correlation of -0.80 also gives an R-squared of 0.64. The sign of the original correlation matches the sign of the slope: positive means both variables rise together, negative means one falls as the other rises.

Why a High R-Squared Can Be Misleading

A high R-squared does not mean your model is correct, and it certainly does not mean one variable causes changes in the other. Two variables can move together for reasons that have nothing to do with a direct cause-and-effect link. Ice cream sales and drowning rates both rise in summer, producing a high R-squared, but buying ice cream doesn’t cause drowning. Temperature drives both.

R-squared can also look impressive even when the model is the wrong shape for the data. A straight-line model forced through data that actually curves might still produce a decent R-squared overall, while being consistently wrong in specific ranges. This is why checking a residual plot matters. If the leftover errors show a clear pattern (a U-shape, a fan, or clusters) rather than random scatter around zero, the model is biased regardless of what R-squared says. Residual plots reveal non-linearity, unequal error spread, and outliers that a single summary number like R-squared can hide.

It’s also possible for R-squared to be negative in certain situations. This happens when a constrained model, or a non-linear equation that fits poorly, actually predicts worse than a simple horizontal line at the average. A negative R-squared is a strong signal that the model is fundamentally wrong for the data.

R-Squared vs. Adjusted R-Squared

Every time you add a new variable to a regression model, R-squared will go up, even if that variable is meaningless noise. This is a mathematical inevitability, not evidence that the model improved. If you threw in random numbers as a predictor, R-squared would still tick upward slightly.

Adjusted R-squared corrects for this by penalizing the addition of variables. It accounts for both the sample size and the number of predictors, and it will actually decrease if a new variable doesn’t improve the model enough to justify its inclusion. Adjusted R-squared is always smaller than R-squared, though the gap is usually tiny unless you’re fitting too many variables to too small a dataset. When comparing models with different numbers of predictors, adjusted R-squared gives you a fairer comparison.

What R-Squared Doesn’t Tell You

R-squared tells you how well the model fits the data you already have. It doesn’t directly tell you how accurate your predictions will be on new data. A model can fit historical data beautifully (high R-squared) yet perform poorly on future observations, especially if it’s overfit, meaning it learned the noise in the training data rather than the true underlying pattern.

It also doesn’t tell you the size of prediction errors in practical units. An R-squared of 0.90 sounds great, but if you’re predicting house prices and the typical error is still $50,000, that matters. The standard error of the regression (reported alongside R-squared in most software) gives you that missing piece: the average distance between predicted and actual values, in the same units as your outcome variable. R-squared and the standard error of the regression always move in opposite directions. When one improves, the other does too.

Finally, R-squared says nothing about whether you included the right variables, whether the relationship is linear, or whether your sample is representative. It’s one diagnostic tool among several, useful for gauging overall explanatory power but never sufficient on its own to validate a model.