What Is Statistical Power and Why Does It Matter?

Statistical power is the probability that a study will detect a real effect when one truly exists. Expressed as a number between 0 and 1 (or 0% to 100%), a power of 0.80 means an experiment has an 80% chance of producing a statistically significant result if the thing being tested is genuinely real. The standard target in most research is 0.80, or 80%, though higher is always better.

Why Power Matters

Every study carries two risks of getting things wrong. A Type I error means concluding something works when it actually doesn’t (a false positive). A Type II error means missing a real effect entirely (a false negative). Power is directly tied to the second risk: it equals 1 minus the Type II error rate. So if your power is 0.80, there’s a 20% chance you’ll fail to detect a real effect and wrongly conclude nothing is going on.

This has serious real-world consequences. An underpowered clinical trial testing a genuinely effective treatment might conclude the treatment doesn’t work, simply because the study wasn’t set up to catch the signal. Patients and future researchers are then steered away from something that could have helped. Underpowered trials are also less likely to get published, meaning the time, money, and participation of volunteers can be wasted entirely. Some ethicists argue that running an underpowered trial is itself unethical, because it exposes participants to risks without a reasonable chance of producing useful knowledge.

The Four Ingredients of Power

Power analysis involves four interconnected parameters. If you know any three of them, you can calculate the fourth. This is what makes power analysis so useful for planning a study.

Sample size (N): The number of participants or observations. Larger samples give you more information and a better chance of detecting a real effect.
Effect size: How big the real-world difference or relationship actually is. A drug that cuts symptoms in half is easier to detect than one that reduces them by 5%.
Significance level (alpha): The threshold for calling a result “statistically significant,” usually set at 0.05. This is the maximum false positive rate you’re willing to accept.
Power (1 minus beta): The probability of detecting a true effect, conventionally set at 0.80.

The most common use of power analysis is plugging in a target power (0.80), a significance level (0.05), and an estimated effect size to figure out how many participants you need before the study begins.

How Sample Size Drives Power

Of all the factors that influence power, sample size is the one researchers have the most control over. The relationship is straightforward: more data means more power. With a small sample, natural variability can easily drown out a real signal. With a larger sample, random noise averages out and genuine patterns become visible.

This is why you’ll see large clinical trials enrolling thousands of participants. When the expected effect is small, like a modest reduction in blood pressure from a new medication, you need a lot of people to reliably distinguish the drug’s effect from normal fluctuation. A study of 30 people might have only 40% power to detect that small difference, meaning it would miss the effect more often than it caught it. The same study with 300 people might reach 90% power.

Effect Size: The Signal You’re Trying to Catch

Effect size describes how large the real phenomenon is. Think of it as the signal strength. A loud signal is easy to pick up even with crude equipment. A faint signal requires much more sensitive detection.

In statistical terms, the standardized effect size accounts for both the magnitude of the difference and how much natural variation exists in the population. There are two routes to a large standardized effect size: either the actual difference is big, or the variation among individuals is small. A study comparing a potent painkiller to a placebo has a large effect size because the difference in pain relief is dramatic. A study comparing two similar painkillers has a small effect size because the difference between them is subtle, and you’ll need far more participants to detect it.

Researchers often estimate effect sizes from previous studies or pilot data. When no prior data exists, conventions developed by the statistician Jacob Cohen provide rough benchmarks for small, medium, and large effects. These estimates are critical because an overestimated effect size leads to an undersized study, one that looks well-planned on paper but can’t actually detect what it set out to find.

The Trade-Off With False Positives

Power and the false positive rate sit on opposite ends of a seesaw. When you lower the significance threshold to reduce false positives (say, from 0.05 to 0.005), you make it harder to reach statistical significance. That increased strictness reduces power unless you compensate by increasing sample size. This is a real tension in science: tightening your standards for evidence means you need more resources to maintain the same ability to detect true effects.

Reducing the probability of one type of error inherently increases the risk of the other, unless you adjust something else. In practice, this usually means recruiting more participants, which costs more time and money. The conventional balance of alpha at 0.05 and power at 0.80 is not based on any deep scientific principle. It’s a widely accepted convention that researchers and funding agencies have settled on as a reasonable trade-off.

When Power Analysis Should (and Shouldn’t) Be Done

Power analysis is most valuable before a study begins. This is called an a priori power analysis, and it’s the standard approach for responsible study design. The process involves choosing the right statistical test for your research question, estimating the expected effect size, setting alpha (usually 0.05) and target power (usually 0.80), then calculating the sample size needed. Free software tools like G*Power walk researchers through these steps for a wide range of statistical tests.

What researchers should avoid is calculating power after a study is already finished, known as post hoc (or retrospective) power analysis. This is a surprisingly common mistake, sometimes even requested by peer reviewers, but it’s both mathematically and conceptually flawed. The core problem is that post hoc power is calculated from the observed effect size, which may not reflect the true effect size. Worse, post hoc power has a one-to-one relationship with the p-value from the study’s results. If a study produces a p-value above 0.05 (a non-significant result), the post hoc power will always be below 50%, regardless of how many participants were enrolled. This means post hoc power can’t distinguish between a study that failed because it was too small and a study that failed because there was simply no real effect to find. It tells you nothing you didn’t already know from the p-value itself.

When a study produces non-significant results, the more useful approach is to report confidence intervals, which show the range of effect sizes compatible with the data. This gives readers a much clearer picture of whether the study was too imprecise to be conclusive or whether the true effect is likely very small or nonexistent.

What 80% Power Actually Means in Practice

Setting power at 0.80 means accepting a 1-in-5 chance of missing a real effect. That might sound surprisingly high, and in some contexts it is. For a pivotal trial of a life-saving medication, some researchers argue power should be 0.90 or higher. For an exploratory study with limited resources, 0.80 may be a reasonable compromise. The “right” level of power depends on what’s at stake if you get a false negative result.

It’s also worth remembering that power applies to the study as a whole, not to individual results. If you run a study with 80% power, it doesn’t guarantee you’ll detect the effect. It means that if you could repeat the same study many times, about 80% of those repetitions would produce a statistically significant finding. The remaining 20% would miss it, even though the effect is real. This is why replication across multiple studies is so important in science. No single study, no matter how well powered, is a guarantee.