What Is Explained Variance? Definition and Examples

Explained variance is the proportion of a dataset’s total variability that a statistical model accounts for. If a model has 70% explained variance, that means 70% of the spread in your data can be attributed to the factors your model captures, while the remaining 30% is left unexplained. It’s one of the most common ways to judge whether a model is doing a good job.

How Total Variance Gets Split

Every dataset has natural spread. People’s heights vary, test scores vary, monthly sales figures vary. That total spread, measured as the sum of squared differences from the overall average, is called the total sum of squares (SS Total). The core idea behind explained variance is splitting that total into two pieces: the part your model predicts and the part it doesn’t.

In a regression model, the explained portion (called the regression sum of squares) captures how much the predicted values differ from the overall average. The unexplained portion (the error sum of squares) captures how much the actual data points differ from the predictions. These two pieces always add up to the total:

SS Total = SS Explained + SS Error

This partitioning works the same way in an ANOVA (analysis of variance), where the explained portion is called “between-group” variance and the unexplained portion is “within-group” variance. The between-group piece measures how much the group averages differ from each other. The within-group piece measures how much individual data points scatter around their own group’s average. If the groups truly differ, the between-group variance will be large relative to the within-group variance.

R-Squared: The Most Common Measure

The coefficient of determination, written as R², is explained variance expressed as a ratio. It equals the regression sum of squares divided by the total sum of squares, or equivalently, one minus the error sum of squares divided by the total:

R² = 1 − (SS Error / SS Total)

The result falls between 0 and 1. An R² of 0.45 means 45% of the variation in your outcome variable is explained by the predictor(s) in your model. You can interpret it this way: “45% of the variation in y is explained by the variation in x.” The remaining 55% comes from factors the model doesn’t include, measurement error, or randomness.

To see how this connects to correlation: if two variables have a correlation of r = 0.4, squaring that gives R² = 0.16. Only 16% of the variation in one variable is explained by the other. A correlation that feels moderate actually explains a fairly small share of the total variance, which is why squaring the correlation is so useful. It gives you a more honest picture of how much one variable actually tells you about another.

What Counts as High or Low

There’s no universal threshold for “good” explained variance. It depends entirely on the field you’re working in. Jacob Cohen, a statistician whose benchmarks are still widely cited, proposed general guidelines for the social sciences based on correlation strength. A small effect corresponds to r = 0.10 (explaining about 1% of variance), a medium effect to r = 0.30 (about 9%), and a large effect to r = 0.50 (about 25%).

Those numbers might seem surprisingly low if you’re used to thinking in physics or engineering, where R² values above 0.90 are routine. But in fields like personality psychology, clinical research, and sociology, variables are harder to measure precisely and human behavior is influenced by countless overlapping factors. An R² of 0.09 for the relationship between self-concept and academic achievement, for example, qualifies as a medium-sized effect by Cohen’s standards. In clinical psychology, where measurement challenges are especially steep, that same effect size might be considered large.

The practical takeaway: always compare your explained variance to what’s typical in your specific domain. An R² of 0.30 could be excellent or mediocre depending on context.

The Overfitting Problem and Adjusted R²

Regular R² has a well-known flaw. Every time you add another predictor to a regression model, R² either stays the same or goes up, even if that predictor has no real relationship with the outcome. This happens because each new variable gives the model more flexibility to fit the noise in your specific dataset. With enough predictors, you can inflate R² to look impressive while the model is actually learning patterns that won’t hold up on new data.

Adjusted R² corrects for this by penalizing model complexity. Its formula divides the error and total sums of squares by their respective degrees of freedom (which account for sample size and the number of predictors) before taking the ratio:

Adjusted R² = 1 − [(SS Error / (n − p)) / (SS Total / (n − 1))]

Here, n is the number of observations and p is the number of parameters. If you add a predictor that doesn’t meaningfully reduce the error, the penalty from losing a degree of freedom makes adjusted R² decrease. This makes it a more honest measure when comparing models of different complexity.

Explained Variance in Dimensionality Reduction

The concept extends beyond regression. In principal component analysis (PCA), a technique for reducing the number of variables in a dataset, each component captures a share of the total variance. The first component explains the largest share, the second explains the next largest, and so on. Each component’s explained variance ratio tells you what percentage of the original dataset’s information that component retains.

If the first three components together account for 85% of the total variance, you could potentially replace dozens of original variables with just three components and lose only 15% of the information. The cumulative explained variance ratio guides the decision of how many components to keep: enough to capture the bulk of the variance without dragging along components that contribute almost nothing.

Why It Matters in Practice

Explained variance gives you a concrete, interpretable number for how well your model captures reality. A prediction model with 5% explained variance is barely doing better than guessing the average every time. One with 80% explained variance is capturing most of what’s going on. That distinction matters when you’re deciding whether to trust a model’s predictions, whether adding more predictors is worth the complexity, or whether the factors you’re studying actually drive the outcome you care about.

It also keeps expectations realistic. When someone reports a “statistically significant” correlation of r = 0.15, squaring it reveals that the predictor explains only about 2% of the variance. Statistically significant and practically meaningful are very different things, and explained variance is one of the clearest ways to tell them apart.