When to Reject a P-Value in Hypothesis Testing

You reject the null hypothesis when your p-value is less than your chosen significance level, commonly called alpha. Most fields set alpha at 0.05, meaning you reject the null hypothesis when the p-value falls below 0.05. That simple rule, though, hides important nuances about what the threshold should be, when to adjust it, and why a low p-value alone doesn’t tell the full story.

The Basic Decision Rule

Before running a statistical test, you pick a significance level (alpha) that represents the maximum false-positive risk you’re willing to accept. Alpha of 0.05 means you’re comfortable with a 5% chance of concluding there’s an effect when there actually isn’t. After running the test, you compare the resulting p-value to that threshold. If the p-value is smaller than alpha, you reject the null hypothesis and call the result statistically significant. If the p-value is equal to or larger than alpha, you fail to reject the null.

That comparison is the entire mechanical rule. Everything else in this article is about choosing the right alpha, understanding what rejection actually means, and avoiding the mistakes that come from treating p-values as the final word.

Choosing the Right Alpha

Alpha of 0.05 is a convention, not a law. The right threshold depends on the consequences of being wrong. A false positive (called a Type I error) means you conclude something is real when it’s not. If that error would be costly or dangerous, like approving an ineffective drug or convicting an innocent person, you should use a stricter threshold such as 0.01 or 0.001. If the cost of missing a real effect (a Type II error) is more concerning than a false alarm, a more lenient alpha like 0.10 may be appropriate.

Some fields have adopted their own standards. Genome-wide association studies in genetics use an alpha of 5 × 10⁻⁸, roughly one in twenty million, because these studies test millions of genetic variants at once and need to guard against a flood of false positives. Particle physics famously requires a “five-sigma” result, corresponding to a p-value of about 3 × 10⁻⁷, before claiming a discovery. Social science research typically sticks with 0.05, though there’s growing pressure to tighten that standard.

Why Multiple Tests Change the Threshold

Every time you test a hypothesis at alpha = 0.05, you have a 5% chance of a false positive. Run 20 tests and you’d expect one false positive even if nothing real is going on. This is the multiple comparisons problem, and it means your effective threshold needs to shrink as the number of tests grows.

The simplest fix is the Bonferroni correction: divide your alpha by the number of tests you’re running. If you’re testing six hypotheses at once, your new threshold becomes 0.05 / 6, or roughly 0.0083. Only p-values below that adjusted cutoff count as significant. This method is conservative, meaning it’s good at preventing false positives but can miss real effects.

Less strict alternatives exist. The Holm method tests hypotheses in order from smallest to largest p-value, with a slightly different threshold for each step, giving you more power to detect real effects while still controlling the overall error rate. The Benjamini-Hochberg method takes a different approach entirely, controlling the expected proportion of false positives among your significant results rather than guarding against any false positive at all. This makes it popular in fields like genomics where you expect many true effects among thousands of tests.

What “Failing to Reject” Actually Means

When the p-value exceeds your alpha, the correct conclusion is “we fail to reject the null hypothesis.” That phrasing matters. It does not mean the null hypothesis is true. It means you didn’t find strong enough evidence to rule it out. The difference is like a jury returning “not guilty” rather than “innocent.” The data are consistent with the null hypothesis, but they might also be consistent with a small real effect that your study wasn’t powered enough to detect.

This is where Type II errors come in. A Type II error happens when you fail to reject the null hypothesis even though it’s actually false. Studies with small sample sizes are especially prone to this, missing real effects simply because they didn’t collect enough data to distinguish a signal from noise. So a p-value of 0.08 in a small study doesn’t mean the effect doesn’t exist. It means the study couldn’t pin it down with enough certainty.

A Low P-Value Is Not the Whole Story

A p-value tells you whether an effect is likely to exist. It tells you nothing about how large or important that effect is. A study with thousands of participants can produce a tiny p-value for a difference so small it has no practical relevance. A blood pressure drug that lowers systolic pressure by 0.5 mmHg might achieve p = 0.001 in a large enough trial, but that reduction is clinically meaningless.

This is why effect size matters alongside statistical significance. Effect size quantifies the magnitude of the difference or relationship you’re studying. As statistician Jacob Cohen put it, the primary product of research should be measures of effect size, not p-values. The American Statistical Association made a similar point in a landmark 2016 statement: a p-value does not measure the size of an effect or the importance of a result. Both the p-value and the effect size are essential pieces of the picture.

The ASA statement also emphasized that scientific conclusions and policy decisions should not be based only on whether a p-value crosses a specific threshold. A result with p = 0.049 is not meaningfully different from one with p = 0.051, yet treating the 0.05 line as a cliff edge is common practice.

Common Misinterpretations to Avoid

The p-value is not the probability that the null hypothesis is true. If you get p = 0.03, it does not mean there’s a 3% chance the effect is due to random chance. What it actually means is: if the null hypothesis were true, there would be a 3% probability of seeing data at least as extreme as what you observed. That distinction sounds subtle but leads to very different reasoning.

A p-value also doesn’t tell you the probability that your results would replicate. Two studies can both get p = 0.04 and have very different underlying realities, one reflecting a genuine effect and the other a statistical fluke. Replication depends on effect size, sample size, and study design, not on the p-value of the original finding.

Complementary Tools for Better Decisions

Confidence intervals give you something a p-value can’t: a range of plausible values for the true effect. A 95% confidence interval that runs from 0.2 to 4.8 tells you the effect is likely positive but could be anywhere from trivially small to moderately large. That’s far more informative than simply knowing p < 0.05. If the interval is narrow and entirely above zero, you can be more confident the effect is both real and meaningful.

Bayes factors offer another alternative. Instead of asking “how surprising are these data if the null is true,” a Bayes factor compares how well two competing hypotheses explain the data. It can provide evidence in favor of the null hypothesis, something a p-value is structurally unable to do. Some researchers and journals have moved toward Bayesian methods for this reason, and at least one psychology journal (Basic and Applied Social Psychology) banned p-values entirely in 2015, recommending greater use of descriptive statistics, effect sizes, and graphical displays instead.

None of these tools replace careful thinking about study design, sample size, and what the results mean in context. The p-value remains useful as one piece of evidence, but the decision to reject or not should always factor in effect size, the precision of your estimates, and the real-world consequences of being wrong.