What Is an Adjusted P-Value in Multiple Comparisons?

An adjusted p-value is a p-value that has been recalculated upward to account for the fact that multiple statistical tests were performed at the same time. The adjustment corrects for the inflated risk of false positives that occurs whenever you test more than one hypothesis in the same study. Without it, a study testing dozens or thousands of variables will almost certainly produce results that look significant but are actually coincidental.

Why Regular P-Values Break Down

A standard p-value threshold of 0.05 means you accept a 5% chance that any single result is a false positive. That works fine when you’re running one test. But the math changes quickly when you run several.

The probability of getting at least one false positive across multiple tests follows a simple formula: 1 minus (1 minus 0.05) raised to the power of however many tests you run. With just three comparisons, your real false-positive rate climbs to about 14%. With six comparisons, it hits 26.5%. Run 10 independent tests and there’s a 40% chance at least one will appear significant by pure coincidence. At 100 tests, that probability reaches 99.4%, meaning a false positive is virtually guaranteed.

1 test	5% false-positive risk
3 tests	14.3%
5 tests	22.6%
10 tests	40%
20 tests	64%

Adjusted p-values compensate for this inflation. Each individual p-value is pushed upward so that the overall error rate across all your tests stays at 0.05 (or whatever threshold you’ve chosen). A result that looked significant before adjustment may no longer clear the bar once you account for how many tests were performed.

The Two Main Approaches

Adjustment methods fall into two broad camps, and they control for different things.

Family-Wise Error Rate (FWER)

FWER methods try to keep the chance of making even a single false discovery below 0.05 across all tests combined. They’re strict. The most well-known is the Bonferroni correction, which simply multiplies each raw p-value by the total number of tests (or equivalently, divides the significance threshold by the number of tests). If you run 20 tests, a result needs a raw p-value below 0.0025 to count as significant.

The main criticism of Bonferroni is that it’s overly conservative, especially with large numbers of comparisons. By guarding so aggressively against false positives, it increases the risk of missing real effects. The interpretation of any single finding also becomes dependent on how many other tests happened to be in the study, which some statisticians find philosophically problematic. Stepwise alternatives like the Holm, Hochberg, and Hommel methods improve on Bonferroni’s power while still controlling the family-wise error rate. In terms of statistical power (ability to detect real effects), these rank from least to most powerful in this order: Bonferroni, Holm, Hochberg, Hommel.

False Discovery Rate (FDR)

FDR methods take a different, more lenient approach. Instead of preventing any single false positive, they control the expected proportion of false positives among the results you call significant. An FDR-adjusted p-value of 0.05 means you’re willing to accept that 5% of the results you’ve flagged as significant could be wrong. Compare that with a regular p-value of 0.05, which means 5% of all tests (significant or not) could be false positives.

The most common FDR method is the Benjamini-Hochberg (BH) procedure. It works by ranking all your raw p-values from smallest to largest, then comparing each one to a threshold that scales with its rank. The smallest p-value gets the strictest threshold, and each subsequent one gets a slightly more generous cutoff. This makes BH less conservative than Bonferroni, giving it more power to detect true effects while still keeping the proportion of false discoveries in check.

A variant called Benjamini-Yekutieli (BY) is designed for situations where the tests are not independent of each other, which is common in biological data where genes or proteins interact in networks.

Q-Values and FDR-Adjusted P-Values

In genomics and other high-throughput fields, you’ll often see results reported as “q-values” rather than adjusted p-values. A q-value is simply an FDR-adjusted p-value. A q-value of 0.05 for a particular gene means that among all genes with q-values at or below that level, you’d expect about 5% to be false positives. The distinction matters: a regular p-value of 0.05 tells you about the error rate across all tests, while a q-value of 0.05 tells you about the error rate among just the significant results.

A Concrete Example

Imagine you’ve measured whether 10 genes are expressed differently between a disease group and a healthy group. You get the following raw p-values: 0.0001, 0.001, 0.006, 0.03, 0.095, 0.117, 0.234, 0.552, 0.751, 0.985. At the standard 0.05 cutoff, four genes look significantly different.

After Bonferroni correction (multiplying each p-value by 10), the adjusted values become: 0.001, 0.01, 0.06, 0.30, 0.95, 1.0, 1.0, 1.0, 1.0, 1.0. Now only two genes remain significant. The third gene, with a raw p-value of 0.006, no longer passes. Without the correction, you would have reported two additional genes as meaningfully different when they may not have been.

This example involves only 10 tests. In a modern genomics experiment testing 20,000 genes simultaneously, uncorrected p-values would produce roughly 1,000 false positives at the 0.05 level. Adjustment is not optional in that context.

Choosing the Right Method

The choice comes down to what kind of errors you can tolerate. FWER methods like Bonferroni are appropriate when false positives carry serious consequences: clinical trials, regulatory decisions, or any situation where acting on a wrong result is costly. FDR methods like Benjamini-Hochberg are better suited for exploratory research where you’re screening many variables and plan to follow up on the significant ones with further testing. You’d rather cast a wider net and accept a small fraction of false leads than miss genuinely important findings.

Most statistical software includes these methods as built-in options. In R, for example, the p.adjust function offers Bonferroni, Holm, Hochberg, Hommel, Benjamini-Hochberg, and Benjamini-Yekutieli corrections. Python’s statsmodels library provides similar functionality. Whichever method you use, the key is to report it. The American Statistical Association emphasizes that proper inference requires full transparency, and that scientific conclusions should not rest on whether a p-value (adjusted or otherwise) crosses any single threshold. Effect sizes, confidence intervals, and the broader context of the study all matter alongside the adjusted p-value.