Effect Size in Psychology: What It Is and Why It Matters

Effect size is a standardized way of measuring how big a difference or relationship actually is in a psychological study. Where a p-value only tells you whether a result is likely real or due to chance, effect size tells you whether that result matters in practical terms. It’s one of the most important concepts in psychological research because it answers the question researchers and readers actually care about: how strong is this effect?

Why P-Values Aren’t Enough

Statistical significance (the familiar p-value) tells you the probability that an observed difference between groups happened by chance. If p is less than 0.05, researchers generally call the result “significant.” But significance depends heavily on sample size. Run a study with thousands of participants and you can get a statistically significant result for a tiny, practically meaningless difference. A therapy that improves anxiety scores by half a point on a 100-point scale could easily produce p < 0.05 with a large enough group, even though that half-point change wouldn’t make any real difference in someone’s life.

Effect size is independent of sample size. It captures the magnitude of the difference, not the probability that the difference is real. Two studies can both report p = 0.01, but one might reflect a large, clinically meaningful improvement while the other reflects a trivial one. Without effect size, you can’t tell them apart.

Cohen’s d: Comparing Two Groups

The most widely used effect size measure in psychology is Cohen’s d, which quantifies the difference between two group averages in standard deviation units. If a treatment group scores 10 points higher than a control group on a depression scale, and the pooled standard deviation is 20 points, Cohen’s d is 0.5. That means the treatment group scored half a standard deviation above the control group.

Jacob Cohen, the statistician who popularized this measure, proposed benchmarks that researchers still use today:

  • Small effect: d = 0.2. The difference exists but is hard to see with the naked eye. Think of the height difference between 15-year-old and 16-year-old girls.
  • Medium effect: d = 0.5. The difference is noticeable. This is roughly the size of effect you’d expect from many well-established psychological interventions.
  • Large effect: d = 0.8. The difference is obvious and substantial.

Cohen himself cautioned that these benchmarks are rough guides, not rigid cutoffs. A “small” effect size for a low-cost, easy-to-implement intervention might still be worth pursuing, while a “large” effect size from a single underpowered study might not hold up.

Correlation as Effect Size

When researchers measure the strength of a relationship between two variables (say, sleep quality and test performance), they often use Pearson’s r as the effect size. The same small/medium/large framework applies, with different numbers:

  • Small: r = 0.10
  • Medium: r = 0.30
  • Large: r = 0.50

A correlation of r = 0.30 between childhood adversity and adult anxiety, for example, would be considered a medium effect. It means the relationship is real and meaningful, but plenty of other factors also contribute. Correlation values range from -1 to +1, where 0 means no relationship at all.

Eta-Squared: Comparing Three or More Groups

When a study compares more than two groups (for instance, testing three different therapy approaches against each other), researchers use a measure called eta-squared or partial eta-squared. This value represents the proportion of overall variation in the outcome that can be attributed to the group differences. The benchmarks here are smaller numbers because they represent proportions rather than standard deviation units:

  • Small effect: 0.01 (1% of variation explained)
  • Medium effect: 0.06 (6% of variation explained)
  • Large effect: 0.14 (14% of variation explained)

If a study comparing cognitive behavioral therapy, medication, and a waitlist control finds a partial eta-squared of 0.10, that means the type of treatment accounted for about 10% of the differences in patient outcomes, a medium-to-large effect.

Hedges’ g: A Correction for Small Samples

Cohen’s d has a slight bias: it tends to overestimate the true effect when sample sizes are small. Hedges’ g corrects for this by applying a small adjustment factor, and it’s generally preferred when total sample sizes fall below about 50 participants or when the two groups being compared are very different in size. For large, balanced samples, Cohen’s d and Hedges’ g produce nearly identical numbers and use the same interpretation benchmarks.

Effect Size, Sample Size, and Statistical Power

Effect size plays a direct role in how researchers design studies. Statistical power is the probability that a study will detect a real effect if one exists, and the standard target is 80% power (meaning a 4-in-5 chance of catching a true effect). Three things determine power: the significance threshold (usually p < 0.05), the sample size, and the expected effect size.

The practical consequence is striking. To detect a large effect (d = 2.5) with 80% power, you need roughly 8 participants. For a medium effect (d = 1.0), you need about 34. For a small effect (d = 0.2), you need around 788. This is why psychology studies looking for subtle effects, like the impact of a brief mindfulness exercise on reaction time, require far more participants than studies testing powerful interventions. Running an underpowered study with too few participants to detect a small effect is essentially a waste of resources.

The Role of Effect Size in Meta-Analysis

One of the most important uses of effect size is in meta-analysis, where researchers combine results from dozens or even hundreds of studies on the same topic. Individual studies might use different scales, different populations, and different specific measures, but because effect sizes are standardized, they can be directly compared and averaged across studies. A meta-analysis of therapy for depression, for instance, might combine studies that used the Beck Depression Inventory with studies that used the Hamilton Rating Scale. The raw scores are incomparable, but the effect sizes speak the same language.

This is why effect sizes are sometimes called the “common currency” of research synthesis. Without them, the field couldn’t systematically determine whether a treatment works across different settings and populations.

Clinical Significance vs. Statistical Significance

Effect size bridges the gap between a statistically significant finding and one that actually matters to people. Consider two studies of different treatments, both producing p = 0.01. The first finds that the treatment extends life expectancy by five years. The second finds a different treatment extends life by five months. Both results are statistically significant at the same level, but their practical importance is vastly different. Effect size captures that distinction in a way that p-values cannot.

In psychology specifically, this matters constantly. A new educational program might produce a statistically significant improvement in reading scores, but if the effect size is d = 0.1, the improvement is so small that it may not justify the cost of implementing the program. Conversely, a therapy for PTSD with d = 0.8 represents a meaningful change in people’s daily functioning.

Reporting Standards in Psychology

The American Psychological Association requires researchers to report effect sizes alongside their statistical tests. This has been part of APA style guidelines for years, reflecting a broader shift in the field away from relying solely on p-values. When you read a well-reported psychology study, you should find not just whether the result was significant, but how large the effect was, typically expressed as Cohen’s d, Hedges’ g, r, or eta-squared depending on the type of analysis.

This reporting standard exists because effect size gives readers something a p-value never can: a sense of scale. Knowing that a finding is “significant” tells you it’s probably real. Knowing its effect size tells you whether it’s worth caring about.