Effect size is a statistical measure that tells you how big or meaningful a difference, relationship, or change actually is. Where a p-value only tells you whether a result is likely real or due to chance, the effect size tells you how much it matters. Think of it this way: a p-value answers “Is there a difference?” while effect size answers “How large is that difference?”
Why P-Values Aren’t Enough
A p-value tells you the probability that the difference you observed between two groups happened by chance. If that probability is very low (typically below 0.05), researchers call the result “statistically significant.” But here’s the catch: statistical significance depends heavily on sample size. With a large enough sample, almost any tiny difference will register as statistically significant, even if it’s meaningless in practice.
Consider a weight-loss trial with 10,000 participants. If the treatment group loses an average of 0.5 kg more than the control group, that difference could easily be statistically significant. But losing half a kilogram is not a meaningful outcome for someone trying to manage their weight. The p-value says “this is real,” but the effect size reveals “this is trivial.”
Effect size is independent of sample size. It captures the magnitude of an effect on its own terms, which is why researchers increasingly treat it as the primary product of a study, not just a supplement to the p-value. Both pieces of information together give the full picture: whether a finding is real, and whether it’s worth caring about.
How Cohen’s d Works
The most common way to express effect size when comparing two groups is Cohen’s d. The calculation is straightforward: take the difference between the two group averages and divide it by the combined standard deviation. The result tells you how many standard deviations apart the two groups are. If a tutoring program raises test scores by a full standard deviation compared to a control group, that’s a Cohen’s d of 1.0, which is large. If the difference is only a quarter of a standard deviation, d equals 0.25, which is small.
Jacob Cohen, the statistician who popularized this measure, proposed a set of rough benchmarks that are still widely used today:
- Small effect: d = 0.2
- Medium effect: d = 0.5
- Large effect: d = 0.8
Cohen intended these as guidelines for situations where researchers had little prior data to judge what counts as a meaningful effect in their field. A “medium” effect was meant to represent roughly the average effect you might see across a whole discipline. These thresholds aren’t rigid rules. In some fields, an effect of 0.3 could be groundbreaking; in others, 0.8 might be expected. Context always matters more than the label.
Other Ways to Measure Effect Size
Cohen’s d is the go-to metric when you’re comparing the averages of two groups, but different types of data call for different measures.
Correlation Coefficients
When researchers measure the strength of a relationship between two variables (say, hours of exercise per week and blood pressure), the Pearson correlation coefficient, r, doubles as an effect size. It ranges from -1 to 1, where 0 means no relationship at all. Cohen’s benchmarks for correlations are: 0.1 for a small effect, 0.3 for medium, and 0.5 for large. Squaring r gives you the proportion of variance explained. A correlation of 0.5, for example, means one variable accounts for about 25% of the variation in the other.
Odds Ratios and Risk Ratios
When the outcome is binary (something either happened or it didn’t, like developing a disease versus staying healthy), researchers use odds ratios or risk ratios. A risk ratio of 2.0 means the event is twice as likely in one group compared to another. A value of 1.0 means no difference between groups. These are especially common in medical research and clinical trials. The further the ratio is from 1.0 in either direction, the larger the effect.
Eta-Squared
When comparing more than two groups at once (for instance, testing three different doses of a medication), researchers often use eta-squared. This metric represents the proportion of total variability in the outcome that can be attributed to the grouping variable. An eta-squared of 0.10 means the groups explain 10% of the variation in the outcome. A related metric, partial eta-squared, is preferred in more complex designs with multiple variables because it isolates the effect of one variable without being diluted by the others.
Small Samples Need a Correction
Cohen’s d has a known weakness: it’s slightly biased when sample sizes are small, tending to overestimate the true effect. A variant called Hedges’ g corrects for this by applying a small adjustment factor. In practice, the two numbers are nearly identical for large studies, but Hedges’ g is the safer choice when working with groups of fewer than 20 or so participants. Most statistical software can calculate either one.
Effect Size in Meta-Analysis
One of the most powerful uses of effect size is combining results across multiple studies. A meta-analysis gathers every relevant study on a question (does this drug reduce symptoms? does this teaching method improve outcomes?) and pools their findings into a single overall estimate. This only works because effect sizes provide a common metric. A study done in 50 patients and another done in 5,000 patients may have wildly different p-values, but their effect sizes can be directly compared and averaged. Each study’s effect size and its variance are calculated individually, then weighted and combined to produce a global estimate that’s more reliable than any single study alone.
How to Interpret Effect Sizes in Practice
Reading an effect size means asking two questions: is it statistically meaningful, and is it practically meaningful? A large Cohen’s d paired with a tiny p-value is the clearest signal that a real, substantial effect exists. A statistically significant but tiny effect size suggests the finding is real but may not matter in everyday terms. A large effect size that fails to reach statistical significance usually means the study was too small to confirm the result definitively.
The specific threshold for what counts as “meaningful” depends entirely on the domain. In psychotherapy research, a Cohen’s d of 0.3 might represent genuine clinical improvement for patients. In engineering, where precision matters, even an effect of 0.1 could influence design decisions. And in education, where interventions reach millions of students, a small effect size can translate into large real-world impact simply because of the scale.
When you encounter research results, look for the effect size alongside the p-value. If a study only reports statistical significance without any measure of effect magnitude, you’re only getting half the story. The effect size is what tells you whether the finding is large enough to change anything in the real world.

