R-squared tells you what percentage of the variation in your outcome is explained by your model. If you’re using study hours to predict test scores and get an R-squared of 0.75, that means 75% of the differences in scores across students can be accounted for by how long they studied. The remaining 25% comes from other factors your model doesn’t capture.
How R-Squared Works
Every dataset has natural variation. Test scores differ from student to student, house prices differ from neighborhood to neighborhood, crop yields differ from field to field. R-squared (also called the coefficient of determination) measures how much of that natural spread your regression model accounts for. It’s expressed as a value between 0 and 1, or equivalently as a percentage between 0% and 100%.
An R-squared of 1.0 means your model perfectly predicts every data point, with zero leftover error. An R-squared of 0 means your model explains none of the variation at all, and you’d do just as well predicting the average every time. Most real-world models land somewhere in between.
The math behind it compares two quantities: how much your data points vary around the model’s predictions (the leftover error) versus how much they vary around the simple average. If a dataset on skin cancer mortality across U.S. cities has a total variation of about 53,637, and latitude alone explains 36,464 of that while 17,173 remains as random error, the R-squared is 36,464 divided by 53,637, or roughly 0.68. Latitude explains 68% of the differences in mortality rates across those cities.
What Counts as a “Good” R-Squared
There’s no universal threshold. What qualifies as a good R-squared depends entirely on the field you’re working in and how messy the thing you’re measuring is.
- Social sciences and psychology: Values as low as 0.10 to 0.30 are often considered acceptable. Human behavior is inherently unpredictable, so explaining even 10% of the variance can be meaningful.
- Finance: Good values typically range from 0.40 to 0.70, depending on the type of analysis and data quality.
- Ecology: Values from 0.20 to 0.50 are considered acceptable, given how many uncontrollable factors affect natural systems.
- Physical sciences and engineering: Expectations are much higher, generally above 0.70. Physics and chemistry studies often target 0.70 to 0.99.
The pattern is straightforward: the more controllable and measurable your variables are, the higher an R-squared you should expect. Predicting the boiling point of a chemical compound from its molecular weight will yield a tighter fit than predicting voter turnout from income levels.
Low R-Squared Can Still Be Valuable
A low R-squared doesn’t automatically mean your model is useless. In medical research, for example, a new drug might have highly variable effects from patient to patient, producing a low R-squared when predicting individual outcomes. Yet the drug’s average benefit could still be statistically significant across thousands of patients. That kind of result can save lives and be worth millions of dollars, even though the model explains only a small fraction of individual variation.
The key distinction is between statistical significance and explanatory power. Your model’s coefficients (the relationships it identifies) can be reliably different from zero, confirmed by low p-values, even when R-squared is modest. A low R-squared with significant coefficients tells you: “This factor genuinely matters, but there’s a lot of other stuff going on too.” That’s a perfectly valid finding in many contexts.
That said, you should be more cautious with low R-squared models. Make sure your variables were chosen based on a real hypothesis rather than by fishing through random options. Check that the data is clean, without outliers or measurement problems skewing results. And ideally, test the model on a separate dataset to see if the relationships hold up outside the original sample.
A High R-Squared Can Be Misleading
A high R-squared doesn’t guarantee your model is correct. Penn State’s statistics program highlights a striking example: a regression of tire groove depth on mileage produced an R-squared of 95.26%, which looks excellent on paper. But the residual plot (a graph of the prediction errors) revealed a clear curved pattern, meaning a straight-line model was the wrong shape. Prediction would improve substantially with a nonlinear model instead. The high R-squared told you that mileage matters for predicting groove depth, but it didn’t tell you the model was the right one.
This is why experienced analysts never rely on R-squared alone. Always look at residual plots to check whether the errors scatter randomly or show a pattern. Patterns in residuals signal that your model is missing something structural about the relationship.
The Overfitting Problem
R-squared has a mathematical quirk that can trap you: it never decreases when you add more variables to a model. Throw in a completely irrelevant predictor, like shoe size in a model predicting salary, and R-squared will either stay the same or tick upward. This is guaranteed to happen with any fixed dataset. The model simply finds tiny coincidental patterns in the noise, tailoring itself to the specific data you have rather than capturing real relationships.
This is called overfitting. The model looks better on your training data but performs worse on new data, because it learned the noise instead of the signal. The same thing happens with polynomial models: increase the degree of the polynomial and R-squared climbs steadily, even though the added complexity isn’t improving genuine predictive power.
Adjusted R-Squared Fixes This
Adjusted R-squared was created specifically to address overfitting. It applies a penalty based on the number of predictors relative to the number of data points. Unlike regular R-squared, adjusted R-squared only increases when a new variable improves the model more than you’d expect by random chance. If a new predictor just adds noise, adjusted R-squared will decrease, even as regular R-squared inches up. When comparing models with different numbers of variables, adjusted R-squared is the more trustworthy metric.
R-Squared Does Not Prove Causation
A high R-squared between two variables means they move together. It does not mean one causes the other. Most data used in regression comes from observational studies, where the researcher simply records what happens rather than controlling conditions. In an experiment, you can manipulate one variable and watch what happens to another, which gives you grounds for causal claims. In observational data, you can’t.
Ice cream sales and drowning deaths both rise in summer, producing a high R-squared if you regress one on the other. The actual cause is temperature, not ice cream. This is an obvious example, but subtler versions of the same problem appear constantly in real analysis. R-squared quantifies how well variables track each other. Figuring out why they track each other requires a different kind of evidence entirely.

