A p-value tells you how likely you’d be to see your data (or something more extreme) if there were truly no effect or no difference. It’s a number between 0 and 1, and smaller values suggest your results aren’t easily explained by random chance alone. If that sounds straightforward, the details are where most people, including many researchers, get tripped up.
The Core Idea Behind a P-Value
Every p-value starts with an assumption called the null hypothesis. This is the default position that nothing interesting is happening: the drug doesn’t work, the two groups are the same, the variable has no effect. The p-value then asks a specific question: if the null hypothesis were true, how often would random sampling produce results at least as extreme as the ones you actually observed?
A p-value of 0.03, for example, means there’s about a 3% chance of seeing data this extreme (or more extreme) in a world where the null hypothesis is true. A p-value of 0.40 means you’d see results like yours 40% of the time by pure chance, which isn’t very surprising at all.
This framing matters because the p-value is not the probability that your hypothesis is right or wrong. It’s the probability of the data, given a specific assumption. That distinction is the single most important thing to understand, and it’s the one most people miss.
Why 0.05 Is the Standard Cutoff
In most fields, a p-value below 0.05 is considered “statistically significant.” That threshold has a surprisingly practical origin. The statistician Ronald Fisher noted in the 1920s that a p-value of 0.05 corresponds to roughly two standard deviations from the mean of a normal distribution. At a time when researchers calculated everything by hand, this made for a clean, easy-to-use rule of thumb. Fisher himself wrote that “we shall not often be astray if we draw a conventional line at .05.”
Even early statisticians acknowledged the number was arbitrary. L.H.C. Tippett wrote in 1931 that the 0.05 threshold was “quite arbitrary” but “in common use.” Fisher viewed p-values as one piece of evidence in a larger scientific process, not as a final verdict. He never intended a single number to carry the weight that modern research culture has placed on it.
More recently, a group of over 70 researchers proposed lowering the default threshold to 0.005 for new scientific discoveries, arguing this would reduce the number of false positives that slip through. That proposal, published in Nature Human Behaviour, hasn’t become the new standard, but it reflects a growing recognition that 0.05 alone isn’t a reliable gatekeeper.
What a P-Value Does Not Tell You
The most common misreading is treating the p-value as the probability that the null hypothesis is true. If you get p = 0.05, it’s tempting to say “there’s only a 5% chance this result is due to chance.” That’s wrong. The p-value is calculated under the assumption that the null hypothesis is already true. It can’t loop back and tell you how likely that assumption is. As one widely cited paper in Clinical Orthopaedics and Related Research put it: “The p value is computed on the basis that the null hypothesis is true and therefore it cannot give any probability of it being more or less true.”
A p-value also doesn’t tell you:
- How big the effect is. A tiny, meaningless difference can produce a very small p-value if you have enough data.
- Whether the result matters in the real world. Statistical significance and practical importance are completely separate questions.
- Whether your study was well designed. A low p-value from a flawed experiment is still a flawed result.
Why Sample Size Changes Everything
P-values are highly sensitive to sample size. The larger your dataset, the smaller your p-value tends to be, even when the actual difference between groups is trivial. This is a mathematical inevitability: more data gives a statistical test more power to detect any deviation from the null hypothesis, no matter how tiny.
A classic example comes from the Physicians’ Health Study, which tested whether aspirin prevents heart attacks. With more than 22,000 participants followed over five years, the study found a highly significant result: p less than 0.00001. That looks overwhelming. But the actual size of the effect was minuscule. The difference in heart attack risk between the aspirin and placebo groups was 0.77%, and the effect size (measured as r²) was 0.001. Based on those results, aspirin was widely recommended for prevention. Many people who took it experienced no benefit but were exposed to side effects like bleeding. Later studies found even smaller effects, and the recommendation has since been scaled back.
This is a textbook case of statistical significance without practical significance. When you’re reading a study, a low p-value in a very large sample should prompt the question: how big is the actual effect?
Effect Size and Confidence Intervals
Because the p-value can’t tell you how large or meaningful a difference is, researchers increasingly report two additional pieces of information: effect size and confidence intervals.
Effect size measures the magnitude of the difference between groups. Unlike p-values, it doesn’t depend on sample size. A large effect size with a small p-value is a strong signal. A tiny effect size with a small p-value, as in the aspirin study, is a warning sign that the finding may not matter much in practice.
Confidence intervals show the range of values that plausibly contain the true effect. A 95% confidence interval, for instance, gives you boundaries: the real answer likely falls somewhere in this range. A narrow interval means the estimate is precise. A wide one means there’s a lot of uncertainty. The p-value and the confidence interval are mathematically related. A narrower interval corresponds to a smaller p-value. But the confidence interval communicates something the p-value can’t on its own: the likely size of the effect and how much the estimate might vary. If you only look at one number, the confidence interval is generally more informative.
How to Read a P-Value in Practice
When you encounter a p-value in a news article, a study summary, or a medical report, here’s a practical way to interpret it. First, check whether it’s above or below the significance threshold being used (usually 0.05). A value below that line means the researchers consider the result unlikely to be explained by chance alone. A value above it means the data didn’t provide strong enough evidence to rule out chance.
Next, look at the effect size. A statistically significant finding with a tiny real-world effect may not change anything about how you’d act. Then check the sample size. Thousands of participants can make minuscule differences look “significant” in the statistical sense. Finally, look for confidence intervals. They’ll give you a clearer picture of how uncertain the estimate really is.
A p-value is a useful starting point, not a finish line. It answers one narrow question, “how surprising is this data if nothing is going on?”, and it’s most valuable when you read it alongside the size of the effect and the precision of the estimate. Fisher himself saw it as one piece of evidence in a larger puzzle, and that’s still the best way to treat it.

