Larger sample sizes push p-values lower when a real effect exists. This happens through a specific mathematical chain: more data reduces random noise, which makes your test statistic larger, which produces a smaller p-value. But the relationship isn’t as simple as “bigger is always better,” because a massive sample can also make trivially small effects look statistically significant.
The Math Behind the Relationship
The connection between sample size and p-value runs through something called the standard error. The standard error equals the standard deviation divided by the square root of the sample size. So if you quadruple your sample size, you cut the standard error in half. If you multiply it by nine, the standard error drops to a third.
This matters because most statistical tests work by dividing the observed difference between groups by the standard error. That ratio is your test statistic. A smaller standard error means a larger test statistic, and a larger test statistic means a smaller p-value. The actual difference between groups doesn’t need to change at all. Simply collecting more data tightens the estimate enough to make the same difference more “visible” to the test.
Think of it like listening for a faint sound. With a small sample, there’s a lot of background static, and you can’t tell if you’re hearing something real. A larger sample turns down the static, and the same faint sound becomes unmistakable. The sound didn’t get louder. The noise just got quieter.
When No Real Effect Exists
Here’s the important flip side: if there truly is no difference between groups, increasing the sample size does not drive the p-value toward zero. When the null hypothesis is true (meaning there’s genuinely nothing going on), p-values follow a uniform distribution. They’re equally likely to land anywhere between 0 and 1, regardless of whether your study has 50 participants or 50,000. A larger sample won’t manufacture a false finding out of thin air.
This is why the sample size effect is conditional. It amplifies real signals, but it doesn’t create them. The catch, though, is that in practice almost nothing has an effect of exactly zero. Two groups will nearly always differ by some tiny, meaningless amount. And that’s where large samples create problems.
The Large Sample Size Fallacy
With enough participants, even a negligible difference produces a highly significant p-value. The classic example comes from the Physicians Health Study, which enrolled more than 22,000 subjects over five years to test whether aspirin prevents heart attacks. The result was a p-value below .00001, which looks overwhelmingly convincing. But the actual risk difference was just 0.77%. The effect size, measured by how much of the outcome aspirin explained, was 0.1%. That’s an extremely small real-world impact dressed up in a tiny p-value.
The study was terminated early because the evidence seemed conclusive, and aspirin was broadly recommended for prevention. But the statistical significance far outstripped the practical significance. A study with 200 people and the same true effect would likely have returned a nonsignificant p-value, not because aspirin worked differently, but because the smaller sample wouldn’t have had the precision to detect such a small difference.
This pattern is common. If a sample size reaches 10,000 or more, a significant p-value is likely even when the difference between groups is negligible and wouldn’t justify changing a treatment or policy. Researchers have called this the “large sample size fallacy,” treating statistical significance as the finish line when it should be just the starting point.
Effect Size Is the Missing Piece
The p-value is a product of two things: the size of the effect and the size of the sample. A small effect with a huge sample can produce the same p-value as a large effect with a modest sample. The p-value alone can’t tell you which situation you’re in.
This is why effect size measures exist. They quantify how big the difference actually is, independent of sample size. If you see a study reporting p = 0.001 but the effect size is tiny, the finding is statistically detectable but may not matter in practice. If a study reports p = 0.04 with a large effect size, the result is both real and meaningful, even though the p-value is less dramatic. Relatively few published studies explicitly report and discuss effect sizes, which makes it harder for readers to judge whether a significant p-value reflects something that actually matters.
How Sample Size Connects to Statistical Power
Statistical power is the probability that your study will detect a real effect if one exists. Small sample sizes are the most common reason studies miss real effects, a mistake known as a Type II error or false negative. When you combine a small sample with a modest effect size, power drops sharply, and you’re likely to get a large, nonsignificant p-value even though the effect is real.
Increasing the sample size increases power, which means the study is more likely to return a small p-value when there’s genuinely something to find. This is the whole logic behind sample size calculations before a study begins: researchers estimate how big the effect is likely to be, then figure out how many participants they need to have a reasonable chance of detecting it. The conventional target is 80% power, meaning an 80% chance of getting a significant result if the effect is real.
A useful rule of thumb from the research literature: when researchers have difficulty estimating the expected effect size in advance, a minimum of 300 subjects often provides enough precision for stable, reliable results across a range of common analyses. That’s not a magic number, but it reflects the point at which estimates from the sample start reliably matching the true values in the broader population.
Putting It All Together
The relationship between sample size and p-value follows a clear logic. More participants reduce random variability, which inflates the test statistic, which shrinks the p-value. This is exactly what you want when a meaningful effect exists and you need enough data to detect it. But the same mechanism means that massive datasets will flag differences too small to care about. The p-value tells you whether an effect is likely real. It doesn’t tell you whether it’s large enough to be useful. For that, you need the effect size, and ideally a sample that’s large enough to detect meaningful differences without being so large that it detects everything.

