When to Reject the Null Hypothesis: P-Value Rules

You reject the null hypothesis when your p-value is less than your chosen significance level, commonly set at 0.05. If your p-value comes in at, say, 0.03, it falls below that 0.05 threshold, and you reject the null hypothesis in favor of the alternative. If your p-value is 0.07, it doesn’t clear the bar, and you fail to reject the null. That’s the core decision rule, but applying it well requires understanding what’s actually happening behind that comparison.

The Basic Decision Rule

Before running any statistical test, you pick a significance level, called alpha (α). This is the cutoff you’ll use to judge your results. The most common alpha is 0.05, but researchers also use 0.01 or 0.10 depending on the context. Once your test produces a p-value, you compare it directly to alpha:

P-value < alpha: Reject the null hypothesis. Your result is statistically significant at your chosen level.
P-value ≥ alpha: Fail to reject the null hypothesis. Your result is not statistically significant at your chosen level.

So if you set alpha at 0.05 and your test returns a p-value of 0.02, you reject the null. If it returns 0.06, you don’t. The p-value itself represents the probability of observing results at least as extreme as yours if the null hypothesis were actually true. A small p-value means your data would be unlikely under the null hypothesis, which is why it counts as evidence against it.

Why the Threshold Isn’t Always 0.05

The 0.05 level is a convention, not a law of nature. In some fields, the stakes demand a much stricter cutoff. Genome-wide association studies, for instance, use a threshold of 5 × 10⁻⁸ (0.00000005) because these studies test millions of genetic variants simultaneously. Without that extreme threshold, random noise would produce floods of false positives. The more comparisons you make, the more likely you are to stumble on a “significant” result by chance alone, so the threshold has to shrink to compensate.

In exploratory research or fields where missing a real effect is more costly than a false alarm, researchers sometimes use an alpha of 0.10. The key point is that you set alpha before looking at your data, not after. Choosing your threshold after seeing your p-value is a form of cherry-picking that undermines the entire framework.

What “Fail to Reject” Actually Means

You’ll notice the language is “fail to reject the null hypothesis,” not “accept the null hypothesis.” This distinction matters more than it might seem. A non-significant p-value doesn’t prove the null hypothesis is true. It simply means your data didn’t provide strong enough evidence to rule it out.

The reason is mathematical: the null hypothesis typically claims that some value (like the difference between two groups) is exactly zero. Proving something equals exactly zero would require perfect data with infinite precision. No real experiment can do that. So when your p-value is 0.15, you haven’t shown there’s no effect. You’ve shown that your study, with its particular sample size and measurement precision, couldn’t detect one. The effect might be real but too small for your study to pick up, or your sample might have been too small to generate the statistical power needed.

Treating a non-significant result as proof of no effect is one of the most common mistakes in research interpretation. A study that fails to reject the null might simply have been underpowered.

Type I and Type II Errors

Every time you make a reject-or-don’t decision, two kinds of mistakes are possible. A Type I error is a false positive: you reject the null hypothesis when it’s actually true. Think of it as convicting an innocent person. The probability of making this error is exactly your alpha level. Setting alpha at 0.05 means you’re accepting a 5% chance of declaring a significant result when nothing real is going on.

A Type II error is a false negative: you fail to reject the null hypothesis when it’s actually false. This is like letting a guilty person walk free. The probability of a Type II error is called beta (β), and it depends on your sample size, the true size of the effect, and how much variability exists in your data. There’s a natural tension between the two. Lowering alpha to reduce false positives makes it harder to detect real effects, increasing your false negative rate. You’re always balancing these risks.

A Small P-Value Doesn’t Mean a Big Effect

One of the most important things to understand is that statistical significance and practical significance are not the same thing. A p-value tells you whether an effect likely exists. It tells you nothing about how large or meaningful that effect is.

Here’s why this matters: with a large enough sample, virtually any difference becomes statistically significant, even trivially small ones. If you enroll 10,000 people in a study, you can detect differences so tiny they have no real-world relevance. A drug might lower blood pressure by 0.5 mmHg with a p-value of 0.001, but that reduction is clinically meaningless. The p-value looks impressive, but the actual impact on a patient’s health is negligible.

This is why researchers are increasingly expected to report effect sizes alongside p-values. Effect size measures the magnitude of a difference, independent of sample size. A p-value tells you whether a finding is likely real. Effect size tells you whether it’s worth caring about. Both pieces of information are essential for interpreting results properly. As statistician Jacob Cohen put it, “The primary product of a research inquiry is one or more measures of effect size, not p-values.”

Confidence Intervals as a Cross-Check

Confidence intervals offer another way to visualize the same decision. A 95% confidence interval contains all the values for the true effect that your data are compatible with at the 5% significance level. If that interval doesn’t include the null value (usually zero for a difference, or one for a ratio), you would reject the null hypothesis at α = 0.05. If the interval does include the null value, you wouldn’t.

Confidence intervals carry more information than a p-value alone because they show the range of plausible effect sizes. A confidence interval of 0.2 to 8.5 technically excludes zero and is “significant,” but it also tells you the true effect could be anywhere from barely noticeable to very large. That uncertainty is invisible if you only look at the p-value. A confidence interval of 3.1 to 3.9, by contrast, not only excludes zero but tells you the effect is fairly precisely estimated. Both are significant, but you’d have much more confidence in the practical interpretation of the second result.

Putting It All Together

The mechanical rule is straightforward: compare your p-value to your pre-set alpha, and reject the null hypothesis if the p-value is smaller. But using this rule well means keeping several things in mind. Your alpha should be chosen before you analyze your data, and it should reflect the consequences of being wrong. A non-significant result is not evidence that the null hypothesis is true. A significant result doesn’t tell you the effect is large or important. And the p-value is most informative when paired with an effect size estimate and a confidence interval, which together tell you whether the effect is real, how big it probably is, and how precisely you’ve measured it.