A p-value greater than 0.05 means your data did not provide strong enough evidence to rule out the possibility that the result happened by chance. In formal terms, you “fail to reject the null hypothesis.” But this is not the same as proving nothing happened, and the distinction matters more than most statistics courses let on.
What “Fail to Reject” Actually Means
When you run a statistical test, you start with a null hypothesis, which typically states there is no difference or no effect. A p-value of 0.05 or less is conventionally taken as enough evidence to reject that null hypothesis. When the p-value lands above 0.05, the formal conclusion is: “The null hypothesis cannot be rejected at the 0.05 significance level.”
Notice the careful wording. You don’t “accept” the null hypothesis. You don’t conclude that there is no effect. You simply didn’t find sufficient evidence against it. This is a critical distinction that trips up students and working scientists alike. A non-significant result tells you one thing: the data you collected are reasonably compatible with a world where the null hypothesis is true. But they may also be compatible with a world where a real effect exists and you just didn’t detect it.
The astronomer Carl Sagan put it simply: “Absence of evidence is not evidence of absence.” A p-value of 0.07 or 0.12 or 0.30 does not prove the effect is zero. It means your particular study, with its particular sample size and design, didn’t catch it.
Why 0.05 Is the Threshold (and Why It’s Arbitrary)
The 0.05 cutoff dates back to Ronald Fisher in the 1920s. He noted that a value falling more than about two standard deviations from the mean of a normal distribution corresponds to a probability of roughly 0.05, or 1 in 20. At a time before computers, this was a convenient round number tied to easy mental math. Fisher himself wrote that “no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses.” His own statistical tables included multiple significance columns, not just one, because he never intended 0.05 to be a universal gate.
Despite that, 0.05 hardened into a bright line that separates “significant” from “not significant” in most fields. Some researchers have pushed back. In 2016, the American Statistical Association released its first-ever formal statement warning against misuse of p-values and significance thresholds. A 2018 proposal from a group of 72 researchers suggested lowering the default threshold to 0.005. Others, writing in Nature, argued for abandoning fixed thresholds entirely. One psychology journal banned p-values altogether in 2015. The debate is ongoing, but the consensus among statisticians is clear: treating 0.05 as a magic number oversimplifies the science.
Three Reasons a Real Effect Can Produce a High P-Value
A p-value above 0.05 doesn’t always mean the effect isn’t there. Several common scenarios can mask a genuine result.
Small sample size. Statistical power is the probability that your test will detect a real effect when one exists. Power depends heavily on how many participants or observations you have. A study with 30 people testing a subtle treatment effect might easily return p = 0.15, while the same study with 300 people would return p = 0.002. When sample sizes are small, the risk of a Type II error (missing a real effect) goes up substantially. If your study was underpowered, a non-significant result is genuinely ambiguous.
Small effect size. Some real effects are small. A therapy that improves symptoms by 5% is harder to detect statistically than one that improves them by 40%, even if both are real. A treatment with a moderate effect size (say, a Cohen’s d of 0.7) paired with a small sample could still produce a p-value well above 0.05.
High variability. If the measurements in your data are noisy, with people responding very differently to the same intervention, the signal gets buried in the noise. High variability inflates the standard error and pushes the p-value upward, even when the average difference between groups is meaningful.
Statistical Significance vs. Practical Significance
The relationship between p-values and real-world importance runs in both directions, and neither is straightforward. A statistically significant result can be practically meaningless: in a clinical trial with 10,000 participants, a weight loss of 0.5 kg in the treatment group might produce a p-value below 0.05, but half a kilogram is not a clinically meaningful change for most patients. The massive sample size made a tiny difference detectable.
Conversely, a non-significant result can hide something that matters. A cancer treatment that extends life by an average of three months could fail to reach p < 0.05 in a small trial, yet those three months represent a clinically important difference for patients and their families. This is why major medical journals now insist that p-values should never be reported alone. The Journal of the American Medical Association’s guidelines state that results should always include the actual comparison (the size of the difference), a confidence interval, and the p-value together. Reporting just “p = 0.08” or “not significant” strips away the context a reader needs.
What Confidence Intervals Tell You That P-Values Don’t
When a result is non-significant, one of the most useful things you can do is look at the confidence interval. A 95% confidence interval gives you a range of plausible values for the true effect. Two studies can both produce non-significant results but tell very different stories.
Imagine you’re testing whether a new drug lowers blood pressure more than a placebo, and the difference between groups is measured in mmHg. If the 95% confidence interval for the difference is (-0.2 to +0.1), the data suggest the true effect, if any, is tiny. You can feel reasonably confident that there’s no large hidden benefit. But if the confidence interval is (-10 to +20), the result is also non-significant (because the interval includes zero), yet the true effect could be anywhere from a 10-point advantage for the placebo to a 20-point advantage for the drug. That’s an enormous range of uncertainty, and concluding “no effect” would be reckless.
This is why modern statistical guidelines favor confidence intervals over a simple significant/not-significant label. They preserve the uncertainty in your data rather than collapsing it into a binary answer.
How to Report a Non-Significant Result
If you’re writing up results with p-values above 0.05, a few conventions are widely accepted across major journals. Report the exact p-value, not just “NS” or “not significant.” The New England Journal of Medicine asks for p-values above 0.01 to be reported to two decimal places (p = 0.08, not p > 0.05). The Annals of Internal Medicine similarly requires actual values, with rounding to the nearest hundredth for values above 0.20.
Always pair the p-value with the effect size and a confidence interval. A statement like “the treatment group scored 4.2 points higher (95% CI: -1.1 to 9.5, p = 0.12)” gives readers something to interpret. A statement like “the difference was not significant (p = 0.12)” does not.
Avoid language that implies the null hypothesis has been confirmed. Phrases like “there was no effect” or “the treatment did not work” overstate what a non-significant p-value can tell you. More accurate phrasing: “We did not find statistically significant evidence of a difference at the 0.05 level.”
What Not to Do With a Non-Significant Result
One common but problematic response is to run a “post-hoc power analysis,” calculating the power your study had after the fact using the observed effect size. This sounds reasonable but is mathematically circular: post-hoc power is a direct function of the p-value itself, so it adds no new information. If your p-value was 0.08, the post-hoc power will be low, and that’s guaranteed by the math, not by anything useful about your study. Multiple reviews of this practice have concluded that confidence intervals are a better tool for interpreting non-significant findings.
Another mistake is treating p = 0.049 and p = 0.051 as fundamentally different results. The difference between “just significant” and “just non-significant” is not itself statistically meaningful. A study that finds p = 0.049 has not proven an effect, and a study that finds p = 0.051 has not disproven one. The ASA’s statement is explicit: scientific conclusions should not be based solely on whether a p-value crosses a specific threshold.
The Bayesian Alternative
One limitation of p-values is that they only evaluate the data under the assumption that the null hypothesis is true. They never directly ask, “How likely is the null hypothesis compared to the alternative?” Bayesian statistics addresses this through something called a Bayes factor, which compares how well the data fit under each hypothesis. A Bayes factor of 1 means the data support both hypotheses equally. Values above 1 favor the null hypothesis, and values below 1 favor the alternative.
This matters for non-significant results because a Bayes factor can distinguish between “the evidence supports no effect” and “the evidence is inconclusive.” A traditional p-value of 0.35 could correspond to either scenario, and there’s no way to tell which one from the p-value alone. Bayesian methods remain less common in many fields, but they’re increasingly used when the goal is to quantify evidence for the null hypothesis rather than simply failing to reject it.

