How to Calculate Statistical Power Step by Step

Statistical power is calculated with a simple formula: Power = 1 − β, where β (beta) is the probability of a false negative, meaning you miss a real effect. In practice, though, calculating power means plugging four interconnected variables into a formula or software tool. If you know any three of them, you can solve for the fourth. Most researchers aim for a power of at least 80%, which means an 80% chance of detecting a real effect if one exists.

The Four Variables You Need

Every power calculation relies on four pieces:

Significance level (alpha): The threshold for calling a result “statistically significant.” In most academic research, this is set at 0.05 (a 5% chance of a false positive). In clinical trials submitted to the FDA, the standard is stricter: 0.025, one-sided.
Effect size: How large of a difference or relationship you expect to find. A bigger expected effect means you need fewer participants to detect it.
Sample size: The number of observations or participants in your study.
Variability (standard deviation): How spread out your data is. More variability in measurements makes effects harder to detect, which lowers power. You typically estimate this from a pilot study or prior research.

These four variables are locked together like a seesaw. Fix three and you can solve for the remaining one. The most common use case is fixing alpha at 0.05, choosing your target power (usually 80% or 90%), estimating your effect size, and then solving for the sample size you need.

How Effect Size Works

Effect size is the piece most people struggle with because it requires you to predict how large the difference or relationship will be before you’ve run the study. There are standard benchmarks to guide you. For Cohen’s d, which measures the difference between two group means in standard deviation units, the conventional thresholds are 0.20 for a small effect, 0.50 for medium, and 0.80 for large. For correlations (Pearson’s r), the equivalent benchmarks are 0.10, 0.30, and 0.50.

These benchmarks come from Jacob Cohen’s 1988 guidelines, and they’re still widely used. However, real-world research often produces smaller effects than these categories suggest. A large-scale analysis of over 2,900 studies in aging research found that actual “large” effects averaged around 0.76 for group comparisons and 0.32 for correlations, both below Cohen’s original thresholds. The practical takeaway: if you’re unsure what effect size to expect, lean toward a smaller estimate. Overestimating your effect size leads to an underpowered study.

The best source for your effect size estimate is previous research on the same topic. If no prior work exists, a small pilot study can give you a rough number to work with.

A Step-by-Step Example

Say you’re planning a study comparing test scores between two groups after different teaching methods. Here’s how you’d walk through the power calculation:

First, set your significance level. For most purposes, alpha = 0.05. Next, decide on your target power. The standard minimum is 80%, meaning you accept a 20% chance of missing a real effect. If the stakes are high, bump this to 90%. Now estimate your effect size. Based on similar studies, you expect a medium effect (Cohen’s d = 0.50). Finally, you either know your sample size and want to check whether it’s large enough, or you want to find the minimum sample size needed.

With alpha = 0.05, power = 0.80, and d = 0.50, a two-sample t-test requires roughly 64 participants per group (128 total). If you can only recruit 40 per group, your power drops well below 80%, and you’d need to either accept that risk or redesign the study.

Before the Study vs. After the Study

Power analysis is meant to be done before you collect data. This is called a priori power analysis, and its purpose is to determine how many participants you need so your study has a realistic chance of detecting a meaningful effect. It shapes your budget, your timeline, and your recruitment strategy.

Post-hoc power analysis, done after data collection, is a different story and widely considered flawed. The core problem is that power describes the probability of a future event: detecting an effect that may exist. Once you’ve already collected the data, that event has already happened. A post-hoc calculation substitutes the observed effect size for the true population effect size, but these can differ substantially due to sampling variability. The result is a power estimate that’s mathematically tied to your p-value and tells you nothing you didn’t already know. If your result was significant, post-hoc power will be high; if it wasn’t, post-hoc power will be low. It’s circular.

Software Tools for Power Calculations

You don’t need to solve power equations by hand. G*Power is the most popular free tool for this. It has a graphical interface, supports t-tests, chi-square tests, F-tests, z-tests, and exact tests, and walks you through selecting your test type, input parameters, and output. You can download it from Heinrich-Heine-Universität Düsseldorf’s website.

If you work in R, the pwr package handles the same calculations with a few lines of code. For a two-sample t-test, you’d call pwr.t.test(d = 0.5, sig.level = 0.05, power = 0.80, type = "two.sample"), and it returns the required sample size per group. Python users can use the statsmodels library, which includes similar power analysis functions. For simple scenarios, online power calculators from sites like ClinCalc or Raosoft work fine, though they’re limited to basic test types.

What Lowers Power (and How to Fix It)

Small sample sizes are the most obvious power killer, but they’re not the only one. High variability in your measurements has the same effect. If your outcome measure is noisy, with scores bouncing around widely from person to person, even a real effect gets buried in the noise. You can combat this by using more precise measurement tools, standardizing your procedures to reduce observer bias, or choosing outcome measures with less natural variability.

Your choice of statistical test matters too. Non-parametric tests, which rank data instead of using raw values, are inherently less powerful than parametric tests like the t-test. This tradeoff is sometimes necessary when your data isn’t normally distributed, but it means you’ll need more participants to achieve the same power. Combining a non-parametric test with a small sample size is a double hit to your ability to detect effects.

A few other design strategies can boost power without simply adding more participants. Reducing participant dropout (non-adherence) preserves the effective sample size you planned for. Using a paired or repeated-measures design, where you measure the same people before and after an intervention, reduces variability because each person serves as their own control. And choosing a one-tailed test instead of a two-tailed test increases power, though this is only appropriate when you have a strong directional hypothesis and can justify it before data collection.

Common Power Targets by Field

The 80% power standard applies broadly across academic research, but some fields demand more. FDA-regulated clinical trials conventionally require 80% to 90% power at a significance level of 0.025 (one-sided), which is more conservative than the typical 0.05 used in academic work. Sponsors submitting trial designs to the FDA must demonstrate that their sample size achieves the desired power under a range of plausible effect sizes, not just the most optimistic one.

In fields like psychology and education, where effect sizes tend to be small, reaching 80% power often requires sample sizes in the hundreds. In pharmacology, where drug effects can be large and measurable, smaller trials may suffice. The numbers always come back to the same tradeoff: the smaller the effect you’re trying to detect, the more data you need to detect it reliably.