The standard threshold for statistical significance is 5%, meaning there is less than a 5% probability that the observed result happened by chance alone. This 5% cutoff, written as a p-value of 0.05, is the most widely used convention in research. But it’s not a universal law, and understanding what it actually means (and what it doesn’t) matters more than memorizing the number.
Why 5% Is the Standard Cutoff
When researchers test whether a treatment works or whether two groups differ, they start by assuming there’s no real effect. This assumption is called the null hypothesis. The p-value then measures how likely it would be to see results as extreme as the ones observed if that assumption were true.
A p-value of 0.05 means there’s a 5% chance the results would look this way even if nothing real were happening. Setting your threshold at 5% means you’re accepting a 5-in-100 risk of declaring something significant when it isn’t. Flip that around, and you’re 95% confident your conclusion is correct.
This convention traces back to the statistician R.A. Fisher in the early 20th century. There is no mathematical proof that 0.05 is the “right” number. It caught on through practice and became the default in most scientific fields. Researchers are free to set stricter or looser thresholds depending on the stakes involved.
Other Common Significance Levels
While 5% dominates, two other thresholds appear regularly:
- 1% (p = 0.01): Used when the consequences of a false positive are serious, such as in pharmaceutical trials or particle physics. You’re allowing only a 1-in-100 chance of a false alarm.
- 10% (p = 0.10): Sometimes used in exploratory research or social sciences when the goal is to identify trends worth investigating further, not to make definitive claims.
In fields like genetics, where thousands of comparisons are made simultaneously, thresholds can be far stricter. Genome-wide association studies typically require p-values below 0.00000005 to account for the sheer number of tests being run.
What a P-Value Actually Tells You
The American Statistical Association issued a formal statement in 2016 clarifying what p-values do and don’t measure, because misinterpretation had become so widespread. The key points are worth knowing if you’re reading any research.
A p-value does not tell you the probability that a hypothesis is true. A result with p = 0.03 does not mean there’s a 97% chance the treatment works. It means that if the treatment had zero effect, you’d see data this extreme only 3% of the time. That’s a subtle but important distinction.
P-values also say nothing about the size of an effect. A tiny, meaningless difference can produce a very small p-value if the study includes enough participants. Conversely, a large and important difference can fail to reach significance if the study was too small. The p-value tells you whether something probably isn’t zero. It doesn’t tell you whether it’s big enough to matter.
How Confidence Intervals Relate to the 5% Threshold
A 95% confidence interval is the flip side of a p-value of 0.05. If you’re testing whether a treatment differs from a placebo, the confidence interval gives you a range of plausible values for the true effect. When that range doesn’t include zero (or whatever value represents “no difference”), the result is statistically significant at the 0.05 level.
Confidence intervals are often more useful than p-values alone because they show you the likely size of the effect, not just whether it exists. A drug that lowers blood pressure by somewhere between 1 and 15 points tells you a very different story than one that lowers it by somewhere between 8 and 10 points, even if both are statistically significant.
Statistical Significance vs. Practical Significance
A result can be statistically significant without being meaningful in real life. If a new diet produces a statistically significant weight loss of 0.3 pounds over six months, that’s real in a mathematical sense but useless in practice. This distinction between statistical significance and practical (or clinical) significance trips up a lot of readers.
Practical significance asks whether the size of the effect actually improves someone’s life, changes a decision, or justifies the cost. In medicine, a clinically significant result is one where patients function better, feel better, or live longer in ways that outweigh the downsides of treatment. Researchers measure this using something called effect size, which quantifies how large the difference between groups actually is rather than just confirming it’s not zero.
Common benchmarks for effect size: a small effect means the difference between groups is noticeable only with careful measurement, a medium effect is large enough to be apparent in everyday observation, and a large effect is obvious and hard to miss. As a rough guide, when comparing group averages, a standardized difference below 0.2 is trivial, 0.2 to 0.5 is small, 0.5 to 0.8 is medium, and above 0.8 is large.
Why a Result Above 5% Isn’t Automatically Meaningless
Treating 0.05 as a hard line creates a false binary. A study with p = 0.049 gets celebrated as significant while one with p = 0.051 gets dismissed, even though the two results are nearly identical. Fisher himself never intended the threshold to work this way. He viewed the p-value as a continuous measure of evidence, where smaller values provide stronger evidence against the null hypothesis but no single cutoff separates “real” from “not real.”
A p-value of 0.06 or 0.08 still represents reasonably strong evidence that something is going on, especially if the study was small or the effect size is meaningful. Context matters: the study design, sample size, whether the findings align with other research, and whether the results have been replicated all factor into how seriously you should take a finding.
Statistical Power and the Other Kind of Error
The 5% threshold protects against one type of mistake: concluding something is real when it isn’t (a false positive). But there’s a second type of mistake: missing a real effect because your study wasn’t large or sensitive enough to detect it (a false negative).
Statistical power measures your ability to avoid that second error. The standard convention is 80% power, meaning an 80% chance of detecting a real effect if one exists. This is the benchmark most funding agencies, including the NIH, expect researchers to meet. A study with 80% power still has a 20% chance of missing a genuine effect, which is why replication across multiple studies matters so much.
In practice, many published studies are underpowered, meaning they had too few participants to reliably detect the effects they were looking for. This is one reason that findings sometimes fail to replicate: the original study may have gotten lucky with a small sample, producing a significant p-value that doesn’t hold up when tested again.
How to Read Significance Claims
When you encounter a claim that something is “statistically significant,” ask three questions. First, what was the threshold? Most studies use 5%, but not all, and the choice matters. Second, how large was the effect? A significant p-value with a tiny effect size is less impressive than it sounds. Third, how big was the study? Large studies can make trivially small differences reach significance, while small studies may miss important ones.
The 5% threshold is a useful convention, not a magic number. It gives researchers a shared standard for evaluating evidence, but no single percentage can tell you whether a finding is true, important, or relevant to your life. The best evidence comes from looking at the full picture: the p-value, the effect size, the confidence interval, and whether other studies have found the same thing.

