Effect size is a way of measuring how big a difference or relationship actually is, not just whether it exists. While a p-value tells you whether a result is likely due to chance, the effect size tells you whether that result matters in practical terms. Think of it this way: a p-value answers “is there a difference?” while effect size answers “how much of a difference?”
Why P-Values Aren’t Enough
A p-value depends heavily on sample size. With a large enough group of people, even a tiny, meaningless difference will show up as statistically significant. The classic example is the Physicians Health Study, which followed more than 22,000 subjects to test whether aspirin prevents heart attacks. The study produced a highly significant p-value of less than .00001, and it was stopped early because the evidence seemed so conclusive. But the actual effect size was extremely small: aspirin reduced heart attack risk by just 0.77%. The statistical test screamed “real effect!” while the effect size whispered “barely moves the needle.”
This is why researchers and major organizations like the American Psychological Association now require effect sizes alongside p-values in published studies. As the statistician Jacob Cohen put it, “The primary product of a research inquiry is one or more measures of effect size, not p-values.”
The Most Common Type: Cohen’s d
When a study compares two groups (say, a treatment group and a placebo group), the most widely used effect size measure is Cohen’s d. It works by taking the difference between the two group averages and dividing it by the standard deviation, which is a measure of how spread out the data are. This standardization puts different studies on the same scale, making them directly comparable even when they measured completely different things.
Cohen proposed benchmarks that are still widely used today:
- Small (d = 0.2): 58% of the higher-scoring group exceeds the average of the lower-scoring group
- Medium (d = 0.5): 69% of the higher-scoring group exceeds that average
- Large (d = 0.8): 79% of the higher-scoring group exceeds that average
To make this concrete: if a tutoring program produces a Cohen’s d of 0.5, that means about 69% of tutored students scored above the average of the non-tutored students. That’s a noticeable, meaningful boost.
Effect Sizes for Other Types of Data
Cohen’s d works well when comparing two groups, but researchers use different measures depending on the type of analysis.
Correlations (Pearson’s r)
When measuring how strongly two things are related (like hours of sleep and test performance), the correlation coefficient r is itself an effect size. It ranges from -1 to 1, with 0 meaning no relationship at all. The standard benchmarks: r = 0.1 is small, r = 0.3 is medium, and r = 0.5 is large.
Variance Explained (Eta-Squared)
When a study compares three or more groups, researchers often report eta-squared, which tells you what proportion of the total variation in outcomes is explained by group membership. It ranges from 0 to 1. An eta-squared of 0.234, for instance, means the grouping variable accounts for about 23% of the differences in outcomes.
Odds Ratios and Risk Ratios
In medical research, outcomes are often binary: you either had a heart attack or you didn’t. Here, effect size is typically expressed as an odds ratio or risk ratio. A value of 1 means no difference between groups. Values above 1 mean higher odds for one group, values below 1 mean lower odds. An odds ratio of 2.5 means one group has 2.5 times the odds of the outcome compared to the other.
Cohen’s f² for Regression
When researchers use regression models to predict outcomes from multiple variables, Cohen’s f² measures how much a specific variable contributes. The benchmarks: f² of 0.02 is small, 0.15 is medium, and 0.35 is large.
Why “Small” Doesn’t Always Mean “Unimportant”
Cohen’s benchmarks are useful starting points, but they can be misleading if applied without context. A small effect size can be enormously important when the outcome is serious, the population is large, or the effects accumulate over time. Aspirin’s effect on heart attack prevention has a correlation of roughly r = 0.03, which looks trivial by any standard benchmark. But because heart attacks are catastrophic and aspirin is cheap with minimal side effects, that tiny effect translates to about 85 prevented heart attacks for every 10,845 people treated. That’s a small effect size with massive real-world value.
The same logic works in reverse. A large effect size for a rare condition affecting twelve people may not justify an expensive intervention. Context determines whether a given number is meaningful, and researchers increasingly argue that rigid benchmarks should be treated as rough guides rather than absolute rules. One researcher has proposed higher thresholds for social science research, suggesting a minimum meaningful effect of d = 0.41 or r = 0.20, while acknowledging that studies with highly reliable outcomes like death or disease status can justify much lower values.
Effect Size and Clinical Significance
In healthcare, there’s a related concept called the minimal clinically important difference, or MCID. This is the smallest change in a health outcome that a patient would actually notice or care about. Some methods for calculating the MCID are directly tied to effect size, using benchmarks of 0.2 or 0.3 as a starting point. But purely statistical approaches have a weakness: they don’t capture the patient’s perspective. A change that’s statistically detectable isn’t necessarily a change someone feels in their daily life. This is why clinical researchers often combine effect size calculations with patient surveys to determine whether a treatment difference is genuinely meaningful.
How to Read Effect Sizes in Studies
When you encounter effect sizes in research articles, news stories, or health recommendations, a few principles help you interpret them. First, check what type of effect size is being reported. A “0.5” means very different things depending on whether it’s a Cohen’s d, a correlation, or an odds ratio. Second, compare the effect size to other interventions for the same outcome rather than relying solely on generic benchmarks. A d of 0.3 might be excellent for a low-cost educational app but disappointing for an intensive year-long program. Third, look at the confidence interval around the effect size if one is reported. A study might estimate d = 0.5, but if the confidence interval ranges from 0.1 to 0.9, there’s a lot of uncertainty about the true size of the effect.
Effect size is ultimately about answering the question that matters most: not “did something happen?” but “how much happened, and should I care?” It’s the difference between knowing a medication works and knowing whether it works enough to be worth taking.

