R-squared is a number between 0 and 1 that tells you how much of the variation in your outcome variable is explained by your regression model. An R-squared of 0.75, for example, means the model accounts for 75% of the variation in the data, while the remaining 25% is unexplained. It’s one of the first metrics people look at when evaluating a regression, and understanding what it actually measures (and what it doesn’t) will help you use it correctly.
How R-Squared Works
Every dataset has natural variation. If you measured the test scores of 100 students, those scores would spread out around an average. R-squared quantifies how much of that spread your model captures versus how much is left over as error.
The calculation compares two things: the total variation in your data and the variation your model fails to explain (the errors, or “residuals”). The formula is:
R² = 1 − (sum of squared errors / total sum of squares)
The total sum of squares measures how far each data point falls from the overall average. The sum of squared errors measures how far each data point falls from your model’s predicted value. If your model’s predictions are much closer to the actual values than the simple average is, R-squared will be high. If your model barely improves on just guessing the average every time, R-squared will be near zero.
R-Squared and the Correlation Coefficient
In simple linear regression, where you have one predictor and one outcome, R-squared is literally the square of the Pearson correlation coefficient (r). If the correlation between a child’s height and their lung capacity is 0.85, R-squared is 0.72, meaning height accounts for 72% of the variation in lung capacity among those children. This relationship only holds cleanly in simple regression. Once you add multiple predictors, R-squared reflects the combined explanatory power of all variables together, not a single correlation.
What Counts as a “Good” Value
There’s no universal threshold for a good R-squared. It depends entirely on your field and what you’re trying to predict.
- Physical sciences and engineering: Values above 0.70 are typically expected, and in physics or chemistry, 0.70 to 0.99 is common because controlled experiments reduce noise.
- Finance: Values between 0.40 and 0.70 are often considered good, depending on the type of analysis.
- Ecology: Values from 0.20 to 0.50 can be acceptable, given the complexity of natural systems.
- Social sciences and psychology: Values as low as 0.10 to 0.30 are frequently considered acceptable because human behavior is inherently hard to predict.
- Clinical medicine: An R-squared of 0.15 to 0.20 has been proposed as a reasonable benchmark, though researchers often use arbitrary cutoffs.
A model predicting how far a ball falls under gravity will have an R-squared near 1.00. A model predicting which patients will respond to a medication might sit at 0.20 and still be clinically useful. Judging R-squared without context is a common mistake.
Adjusted R-Squared: Penalizing Extra Variables
Regular R-squared has a quirk: it never decreases when you add more predictor variables to a model, even if those variables are meaningless. Throw in a column of random numbers as a predictor and R-squared will tick up slightly, simply because the model has more flexibility to fit the data. This makes it unreliable for comparing models with different numbers of predictors.
Adjusted R-squared fixes this by imposing a penalty for each additional predictor. It accounts for both the number of observations and the number of variables in the model. If a new variable genuinely improves the model’s explanatory power, adjusted R-squared goes up. If the variable adds noise without real value, adjusted R-squared goes down. When you’re deciding between two models with different numbers of predictors, the one with the higher adjusted R-squared is generally the better choice.
Where R-Squared Can Mislead You
R-squared is easy to misinterpret, and a high value can create false confidence in several situations.
Non-Linear Relationships
R-squared can be very high even when your model is fundamentally wrong. A University of Virginia analysis demonstrated this by generating clearly non-linear data and fitting a straight line through it. The R-squared came out to 0.85, suggesting an excellent fit, but the linear model completely missed the actual curved pattern in the data. If you only looked at R-squared and never plotted your data, you’d conclude the model was solid when it was capturing the wrong relationship entirely.
Outliers
Extreme data points can dramatically distort R-squared in either direction. Outliers dominate the sum-of-squares calculations that R-squared relies on. A single unusual point can inflate R-squared by stretching the regression line toward it, or it can deflate R-squared by adding large residual error. The tricky part is that outliers can pull the fitted line so far toward themselves that their own residuals look small, masking the problem. Plotting your data before trusting any summary statistic is essential.
Correlation, Not Causation
An R-squared of 0.90 between two variables means they move together in a highly predictable way. It says nothing about whether one causes the other. Ice cream sales and drowning deaths both rise in summer, and a regression of one on the other would yield a high R-squared. The model describes a pattern, not a mechanism.
R-Squared vs. RMSE
R-squared tells you what proportion of variation your model explains, but it doesn’t tell you how far off your predictions are in real units. That’s where RMSE (root mean square error) comes in. If you’re predicting house prices, R-squared might tell you the model explains 80% of the variation, while RMSE tells you the average prediction is off by $23,000. Both metrics evaluate the same model, but they answer different questions.
R-squared is more intuitive for understanding how well your predictors explain the outcome. RMSE is more useful for comparing the accuracy of different models, because it’s expressed in the same units as your outcome variable and gives you a concrete sense of prediction error. In practice, you’ll want both: R-squared for the big picture and RMSE for practical accuracy.
Practical Tips for Using R-Squared
Always plot your data. R-squared is a single number summary, and single numbers hide patterns. A scatterplot with the regression line overlaid will reveal non-linear relationships, clusters, and outliers that R-squared alone cannot detect.
Use adjusted R-squared when comparing models with different numbers of predictors. Regular R-squared will always favor the model with more variables, regardless of whether those variables contribute meaningful information.
Interpret R-squared relative to your field. A model explaining 30% of the variance in human behavior might represent a genuine advance, while a physics model at 30% would signal something is badly wrong. The question isn’t whether R-squared is “high enough” in the abstract. It’s whether the model explains enough variation to be useful for your specific purpose.

