What Is Effect Size in Meta-Analysis and Why It Matters

Effect size is a standardized number that tells you how large a difference or relationship actually is, stripped of the sample sizes and measurement quirks of individual studies. In meta-analysis, it’s the common currency that lets researchers combine results from dozens or even hundreds of studies into a single, more reliable estimate. Without effect sizes, meta-analysis wouldn’t work, because you can’t meaningfully merge findings reported in different units, scales, or formats.

Why Effect Size Matters More Than P-Values

A p-value tells you whether a result is statistically significant, but it says nothing about whether that result is meaningful in the real world. A massive study with 50,000 participants can produce a tiny, trivial difference that still reaches statistical significance. Effect size fills this gap by quantifying how big the difference or relationship actually is.

Consider a study testing whether a new therapy reduces anxiety. The p-value might tell you the reduction is “statistically significant,” but the effect size tells you whether patients improved by a barely noticeable amount or experienced a genuinely meaningful drop in symptoms. When a meta-analysis pools many studies together, it calculates a combined effect size that represents the best available estimate of the true magnitude of an effect across all the evidence.

Common Types of Effect Size

Different research questions call for different effect size measures. The three most widely used in meta-analysis each handle a specific type of comparison.

Standardized Mean Difference

When studies compare two groups on a continuous outcome (like test scores, blood pressure, or pain ratings), the standardized mean difference expresses how far apart the group averages are in standard deviation units. The two most common versions are Cohen’s d and Hedges’ g, which differ only in a small correction Hedges’ g applies to avoid overestimating effects in small samples. A value of 0.2 is generally considered a small effect, 0.5 medium, and 0.8 large, though these benchmarks depend heavily on the field. In clinical medicine, a “small” effect size can represent thousands of lives saved when applied across a population.

Odds Ratios and Risk Ratios

When the outcome is binary (something either happens or it doesn’t, like developing a disease or recovering from one), meta-analyses typically use odds ratios or risk ratios. An odds ratio of 1.0 means there’s no difference between groups. An odds ratio of 2.0 means the event is roughly twice as likely in one group compared to the other. These are especially common in medical meta-analyses comparing treatments or examining risk factors.

Correlation Coefficients

When studies measure the strength of a relationship between two variables rather than comparing groups, the correlation coefficient (r) serves as the effect size. Values range from -1 to 1, where 0 means no relationship. In behavioral research, correlations of 0.10, 0.30, and 0.50 are often used as rough benchmarks for small, medium, and large effects.

How Meta-Analysis Combines Effect Sizes

The core mechanics of a meta-analysis involve converting each study’s results into a common effect size metric, then combining those individual effect sizes into a single pooled estimate. This isn’t a simple average. Each study’s effect size gets weighted, typically by the inverse of its variance, which means larger, more precise studies contribute more to the final result than smaller, less precise ones.

Think of it like polling. If one survey interviews 100 people and another interviews 10,000, you wouldn’t trust both equally. The meta-analytic weighting works on the same principle. The pooled effect size that emerges represents a weighted average across all included studies, and it comes with a confidence interval that tells you the range within which the true effect likely falls.

Researchers choose between two statistical models for this pooling. A fixed-effect model assumes every study is estimating the exact same underlying effect, and all variation between studies is due to chance. A random-effects model assumes the true effect might genuinely vary from study to study (because of differences in populations, settings, or methods) and accounts for that extra variability. Random-effects models are more conservative and more commonly used, since it’s rarely realistic to assume all studies share an identical true effect.

Heterogeneity: When Effect Sizes Disagree

One of the most important things a meta-analysis examines is how much the individual effect sizes vary from study to study. This variability is called heterogeneity. If 20 studies all find effect sizes clustered tightly around 0.4, you can be fairly confident in that estimate. If they’re scattered from -0.1 to 1.2, the pooled average is less informative on its own, and the more interesting question becomes why the results differ so much.

Researchers quantify heterogeneity using a statistic called I², which represents the percentage of variation across studies that reflects real differences rather than sampling error. An I² of 0% means all variability is consistent with chance. Values around 25% suggest low heterogeneity, 50% moderate, and 75% or higher substantial heterogeneity. When heterogeneity is high, meta-analysts often conduct subgroup analyses or meta-regression to identify what factors (patient age, dosage, study quality, treatment duration) explain the differences in effect sizes across studies.

Interpreting Effect Sizes in Context

The benchmarks mentioned earlier (Cohen’s small, medium, and large categories) are useful starting points, but they can be misleading if applied blindly. What counts as a “meaningful” effect size depends entirely on the context. In education research, an effect size of 0.2 on student achievement might represent months of additional learning. In preventive medicine, an odds ratio of 1.3 for a common disease could translate to millions of additional cases across a population.

It also helps to translate effect sizes into more intuitive metrics. Some meta-analyses convert their findings into the “number needed to treat,” which tells you how many patients would need to receive a treatment for one additional person to benefit. Others express effect sizes as overlap between two groups’ distributions, showing what percentage of the treatment group outperformed the average person in the control group. A Cohen’s d of 0.5, for instance, means that roughly 69% of the treatment group scored above the control group’s average, compared to the 50% you’d expect with no effect at all.

What Can Distort Effect Sizes

Pooled effect sizes are only as trustworthy as the studies that feed into them. Several well-documented problems can inflate or distort the estimate a meta-analysis produces.

Publication bias is the most widely recognized threat. Studies with large, statistically significant effects are more likely to get published than studies with null or small findings. If the unpublished studies are missing from the meta-analysis, the pooled effect size will be artificially inflated. Researchers test for this using funnel plots (which graph each study’s effect size against its precision and look for asymmetry) and statistical tests that estimate how many missing studies it would take to nullify the result.

Small-study effects present a related concern. Smaller studies tend to report larger effect sizes, sometimes because only the most dramatic small-study results get published, sometimes because of methodological differences. Low study quality can also inflate effects: trials without proper blinding or randomization tend to produce larger effect sizes than rigorously designed studies. Good meta-analyses account for these issues by conducting sensitivity analyses, removing studies one at a time or excluding lower-quality evidence to see whether the pooled estimate holds up.

Measurement differences between studies can also complicate things. If one study measures depression with a 10-item questionnaire and another uses a clinical interview, their effect sizes might not be perfectly comparable even after standardization. This is one reason why clearly defined inclusion criteria and careful coding of study characteristics are so important in producing a reliable meta-analytic result.