When to Correct for Multiple Comparisons and When Not To

You should correct for multiple comparisons whenever you’re testing several hypotheses on the same dataset and a single false positive would undermine your conclusions. That’s the short answer, but the real question most researchers face is more nuanced: which situations genuinely demand correction, which don’t, and how aggressive should the correction be? The answer depends on whether your work is confirmatory or exploratory, how many tests you’re running, and what the cost of a false positive looks like in your specific context.

Why Multiple Tests Create a Problem

Every time you run a statistical test at the conventional 0.05 significance level, you accept a 5% chance of a false positive. That’s manageable for a single test. But the probability of at least one false positive across a set of tests compounds quickly, following the formula 1 − (1 − α)n, where n is the number of tests. If you run 20 independent tests and the null hypothesis is true for all of them, the chance of at least one false positive jumps to 0.64. That means you’d get a “significant” result more often than not, purely by chance.

This inflated error rate across a set of tests is called the family-wise error rate. Correction methods exist to keep it at or below your chosen threshold (usually 0.05), so that the overall risk of any false positive remains controlled. The trade-off is real, though: every correction that reduces false positives also increases false negatives. An effective treatment can look no better than placebo simply because the significance bar was raised too high.

When Correction Is Clearly Necessary

Confirmatory research, where you’re testing predefined hypotheses to draw firm conclusions, is the clearest case for correction. Clinical trials with multiple primary or secondary endpoints fall squarely here. The FDA’s guidance on multiple endpoints in clinical trials states that failure to account for multiplicity can lead to false positive conclusions about a drug’s effects, and multiplicity adjustments are expected for trials intended to demonstrate effectiveness and support approval.

Any study where you’re comparing many groups after an omnibus test (like ANOVA) also calls for correction. If you find a significant overall effect and then test every possible pair of groups, the pairwise comparisons need adjustment. The same applies to brain imaging studies testing thousands of locations, or any analysis where the number of tests is large enough that chance alone would produce “hits.”

The most extreme example is genomics. Genome-wide association studies test hundreds of thousands or millions of genetic variants simultaneously. The field has settled on a fixed significance threshold of 5 × 10−8, roughly equivalent to a Bonferroni correction for one million independent tests. That standard has held for over a decade because the cost of chasing a false positive through expensive laboratory follow-up is high.

When Correction May Not Be Needed

Not every set of multiple tests belongs to the same “family” requiring joint correction. The key principle is that correction is justified when several tests collectively address a single overarching question. If your tests address genuinely separate research questions, each with its own hypothesis, many statisticians argue they can be evaluated independently at the standard threshold.

Preplanned comparisons are another common exception. If you specify a small number of specific contrasts before collecting data, based on theory or prior evidence, these carry more weight than post-hoc exploration. A set of fewer than roughly ten planned contrasts is generally treated differently from exhaustive pairwise testing, though even planned comparisons benefit from at least a mild correction when the number grows.

Exploratory or hypothesis-generating research occupies a gray area. When you’re screening many variables to identify patterns worth investigating further, aggressive correction can bury real signals. The consensus here is unsettled. Some methodologists argue that correction is only justified when you need to test an omnibus null hypothesis piecewise, and that exploratory analyses should flag promising findings for future confirmation rather than applying strict thresholds that eliminate them prematurely.

Choosing the Right Correction Method

Controlling for Any False Positive (FWER)

When the cost of even one false positive is high, you want family-wise error rate control. The Bonferroni correction is the simplest approach: divide your significance level by the number of tests. Testing six hypotheses at α = 0.05 means each individual test must reach p < 0.0083. It’s easy to apply but becomes very conservative with many tests or when tests are correlated, making it harder to detect real effects.

The Holm method (sometimes called Holm-Bonferroni) offers the same family-wise protection with more statistical power. It works by ranking your p-values from smallest to largest and comparing each to a progressively less strict threshold. The smallest p-value must meet the full Bonferroni criterion, but subsequent ones face easier cutoffs. If any p-value fails its threshold, all remaining (larger) p-values are declared nonsignificant. There’s no reason to use the original Bonferroni when Holm is available, since Holm is never less powerful and controls the same error rate.

Tolerating Some False Positives (FDR)

When you’re running many tests and care more about the proportion of false positives among your discoveries than about avoiding any single one, false discovery rate control is more appropriate. The Benjamini-Hochberg procedure controls FDR: at the q = 0.05 level, it ensures that roughly 95% of results declared significant are likely to be true effects. This is less strict than FWER methods, which means more statistical power to detect real effects.

FDR control is widely used in genomics, proteomics, and other high-throughput settings where thousands of tests are routine. If you’re testing 10,000 genes and are willing to accept that 5% of your “hits” might be false, Benjamini-Hochberg is the standard choice. If you need to be confident that every single hit is real, as in a confirmatory clinical trial, FWER methods are more appropriate.

Post-Hoc Tests After ANOVA

When comparing group means after a significant ANOVA result, the choice of post-hoc test depends on your study design. Tukey’s HSD test is the most commonly recommended option for unplanned pairwise comparisons. Simulation studies have found it less conservative than alternatives like the Bonferroni or Dunn-Šidák corrections, with lower rates of missed true effects. For unequal sample sizes, the Tukey-Kramer variant should be used instead.

If you’re only comparing several treatment groups to a single control group rather than testing every possible pair, Dunnett’s test is designed specifically for that situation and will give you more power than a general method. Scheffé’s test is the most conservative of the common options but has a useful property: it’s fully consistent with the ANOVA result, meaning a nonsignificant ANOVA will never produce a significant pairwise difference. It’s best suited for studies that test linear combinations of means rather than simple pairwise comparisons.

For nonparametric data where you can’t assume normal distributions, the Dunn procedure and Games-Howell test are commonly used for unplanned comparisons.

A Practical Decision Framework

The decision comes down to three questions. First, are these tests part of the same family? Tests addressing the same overarching question, run on the same dataset, belong together. Tests addressing separate hypotheses that happen to appear in the same paper may not. Second, is this confirmatory or exploratory? Confirmatory work demands correction; exploratory work may not, as long as findings are clearly labeled as preliminary. Third, what’s the cost of a false positive versus a false negative? When a false positive triggers expensive follow-up, regulatory action, or clinical decisions, use FWER control. When missing a true effect is equally costly and you can tolerate some noise, FDR control preserves more power.

Sample size also matters practically. Having more observations per group does more to reduce false negatives than having more groups does to increase them. If you’re planning a study with many comparisons, investing in larger samples per group will help offset the power loss that any correction method introduces.

One common mistake is applying no correction during analysis and then selectively reporting only the significant results. This is effectively the same as running uncorrected multiple tests, and it inflates false positive rates just as badly. If you test 20 variables and report only the three that reached significance, the reader has no way to evaluate whether those results are real or expected by chance. Transparency about the total number of tests conducted is just as important as the correction method you choose.