Why Is the Significance Level 0.05 and Is It Arbitrary?

The 0.05 significance level is not a law of mathematics or nature. It became the standard largely because of a computational shortcut from the 1920s that stuck around long after the original reason for it disappeared. Understanding where it came from, what it actually means, and why many statisticians now question it can help you interpret research results with much sharper judgment.

The 0.05 Threshold Started as a Convenience

The story traces back to Ronald Fisher, one of the founders of modern statistics. In his hugely influential 1925 book, Fisher noted that “the value for which P = .05, or 1 in 20, is 1.96 or nearly 2” standard deviations from the mean of a normal distribution. He suggested it was “convenient to take this point as a limit in judging whether a deviation is to be considered significant or not.” The key word is convenient.

In the 1920s, there were no computers. Researchers calculated statistics by hand and looked up values in printed tables. The 0.05 cutoff mapped neatly onto a round number: roughly two standard deviations from the mean. That made it easy to remember and quick to apply. It also happened to correspond to three “probable errors,” an older measure of spread that was common in early statistics. So a single rule of thumb worked on two different scales that researchers already used. Fisher himself wrote that “we shall not often be astray if we draw a conventional line at .05,” framing it as a useful guideline, not a sacred boundary.

Over the following decades, textbooks codified 0.05 as the default, and generations of students learned it as though it were a fixed rule. What began as one statistician’s practical suggestion for an era of hand calculations became the near-universal standard across science.

What the 0.05 Level Actually Means

When you set the significance level at 0.05, you’re deciding how much risk of a false alarm you’re willing to accept. Specifically, if the thing you’re testing for doesn’t actually exist (no real effect, no real difference), there’s a 5% chance your data will look convincing enough to make you incorrectly declare a discovery. Statisticians call this a Type I error, or a false positive.

Think of it like a courtroom. The default assumption (the null hypothesis) is “no effect,” similar to “innocent until proven guilty.” Setting your significance level at 0.05 means you accept a 1-in-20 chance of convicting an innocent defendant. The system is deliberately conservative because wrongly claiming something works, whether it’s a drug, an intervention, or a theory, can have serious consequences.

The math ties directly to the normal distribution. About 95% of values in a bell curve fall within roughly two standard deviations of the center. The remaining 5% sits in the extreme tails, split between both ends. When your test result lands in that extreme 5% zone, it passes the 0.05 threshold and gets labeled “statistically significant.” The result looks unusual enough that random chance alone seems like an unlikely explanation.

Why 0.05 Is Not the Only Option

Different fields use different thresholds depending on what’s at stake. Before statistical software existed, researchers typically chose between just two options: 0.05, which became standard in fields like exercise science and psychology, or 0.01, which was common in pharmaceutical research where the consequences of a false positive (approving an ineffective or harmful drug) are more severe.

Particle physics takes this to an extreme. When physicists at CERN announced the discovery of the Higgs boson, they required their data to reach “five sigma,” a threshold so stringent that the probability of a fluke is roughly 1 in 3.5 million. The logic is straightforward: when you’re claiming to have found a new fundamental particle, and billions of dollars and the direction of an entire field hinge on being right, you need to be extraordinarily confident it’s not a statistical accident.

At the other end of the spectrum, some exploratory research uses 0.10 as a threshold when the goal is simply to flag promising leads for further investigation, not to make definitive claims.

Why Many Statisticians Want to Retire 0.05

A growing number of researchers argue that the 0.05 threshold is too lenient and contributes to a reproducibility crisis across science. A large group of prominent statisticians has formally proposed changing the default threshold from 0.05 to 0.005 for claims of new discoveries. Their core argument: statistical standards of evidence in many fields are simply too low, and this is a leading cause of findings that fail to replicate when other labs try to confirm them.

The American Statistical Association took the unusual step in 2016 of releasing its first-ever formal statement on the topic in the organization’s 177-year history. Among six principles, the statement made several points that directly challenge how most people use the 0.05 cutoff:

A p-value does not measure the probability that your hypothesis is true. A result of p = 0.03 does not mean there’s a 97% chance the effect is real. It means the data would be this extreme only 3% of the time if nothing were actually going on.
Crossing the 0.05 line says nothing about how large or important an effect is. A tiny, practically meaningless difference can be statistically significant if the study is large enough.
Scientific conclusions should not be based only on whether a p-value passes a specific threshold. The statement called the 0.05 cutoff “conventional and arbitrary.”

The Real Problem With a Hard Cutoff

One of the most damaging effects of treating 0.05 as a bright line is the all-or-nothing thinking it encourages. A study with p = 0.049 gets published as a “significant finding.” A study with p = 0.051, testing the same question with nearly identical results, gets filed away as a failure. The difference between those two numbers is trivially small, yet the conclusions drawn from them can be completely opposite.

This creates perverse incentives. Researchers may run extra analyses, tweak their methods, or selectively report outcomes until they find a p-value that squeaks below 0.05. This practice, sometimes called p-hacking, inflates the published literature with findings that look significant on paper but don’t hold up under scrutiny.

A 1-in-20 false positive rate also compounds quickly. If a field runs thousands of studies per year, 5% of them will produce “significant” results by pure chance even when there’s nothing to find. Without strong replication practices, those false positives accumulate and get cited as established facts.

How to Think About 0.05 in Practice

When you encounter a study that reports “statistically significant” results at the 0.05 level, it helps to keep a few things in perspective. The 0.05 threshold tells you the result was unlikely enough under chance to be worth paying attention to, but it doesn’t tell you the effect is large, important, or guaranteed to replicate. It’s a minimum filter, not a stamp of truth.

The size of the effect matters just as much as the p-value. A blood pressure drug that lowers readings by 1 point with p = 0.04 has cleared the statistical bar, but that reduction is too small to matter clinically. Meanwhile, a study showing a 15-point reduction with p = 0.06 just barely misses the cutoff despite showing a potentially meaningful effect. Context, effect size, study design, and replication are all part of evaluating evidence.

The 0.05 level persists because it’s deeply embedded in journal standards, regulatory frameworks, and statistical training. It was never meant to be a universal rule. Fisher himself used different thresholds for different situations and viewed significance testing as a flexible tool for interpreting data, not a mechanical yes-or-no gate. The number stuck because institutions needed a default, and a convenient shortcut from a pre-computer era became one of the most influential conventions in modern science.