Why Are Small Sample Sizes Bad in Research?

Small sample sizes weaken research because they make results unstable, inflate the apparent size of effects, and increase the chance of both false positives and false negatives. A study with too few participants is essentially rolling the dice: any finding it produces, whether positive or negative, carries so much uncertainty that it may not hold up when tested again with more people. This matters whether you’re reading a nutrition study, a drug trial, or a psychology experiment.

The Core Problem: Statistical Power

Statistical power is the probability that a study will detect a real effect if one actually exists. The widely accepted minimum is 80%, meaning the study has an 8-in-10 chance of catching a true result. Small samples drag power well below that threshold. When a study is underpowered, it’s like trying to hear a conversation across a loud room: the signal might be there, but you can’t pick it up through the noise.

The relationship between sample size and power isn’t subtle. For a moderate effect, increasing a sample from 8 to 30 participants can dramatically improve a study’s ability to detect real differences. And when the effect being studied is small (which is common in medicine and social science), the required sample size climbs steeply. A study looking for a small but meaningful benefit of a new therapy might need hundreds of participants to have adequate power, while a study with 20 participants would miss the same benefit most of the time.

False Negatives Become Likely

A Type II error, or false negative, happens when a study concludes there’s no effect even though one actually exists. Small samples make this far more likely. If a new treatment genuinely works but your trial only enrolled 15 people, random variation among those few participants can easily mask the benefit. The study reports “no significant difference,” and a potentially useful treatment gets shelved.

This isn’t just a statistical inconvenience. Increasing sample size is the primary way researchers reduce both false negatives and false positives. When a sample is too small to reliably represent the broader population, the data it produces can point in almost any direction, regardless of what’s actually true.

Inflated Effects and the Winner’s Curse

Here’s a counterintuitive problem: small studies that do find statistically significant results tend to overestimate how large the effect really is. This phenomenon is sometimes called the winner’s curse. It happens because, in a small sample, the only way to clear the bar for statistical significance is if the data happens to skew in the direction of a large effect. Smaller, more realistic effects get filtered out as non-significant, so the “winning” results that get published are disproportionately inflated.

This creates a predictable pattern. A small, splashy study reports that some intervention cuts risk by 40%. A larger replication study, with hundreds or thousands of participants, finds the real benefit is closer to 10%. The original wasn’t necessarily fraudulent or poorly designed. It was just too small to produce a stable estimate. The confidence interval around its finding was so wide that the reported number could have landed almost anywhere.

P-Values Become Unreliable

The p-value, the number researchers use to judge whether a result is statistically significant, behaves erratically in small samples. Whether a result crosses the traditional significance threshold of 0.05 depends not just on whether an effect is real, but on the sample size and how much variation exists in the measurements.

Consider a concrete example. In a study with just 10 people per group, a clinically meaningless difference between groups can appear statistically significant if the measurements happen to be very precise (p = 0.038). But a large, clinically meaningful difference in that same small sample can fail to reach significance if there’s even moderate variability in the data (p = 0.071). The p-value is bouncing around based on noise rather than reflecting reality. In larger samples, these quirks smooth out, and p-values become much more stable indicators of whether an effect is genuine.

Poor Generalizability

Every study sample is meant to stand in for a larger population. When that sample is small, the gap between the two can be enormous, purely by chance. A sample of 12 college students doesn’t capture the diversity of age, genetics, health conditions, and life circumstances in the general population. Even with perfectly random selection, small samples routinely differ from the population they’re drawn from in ways that skew results.

Research on generalization has shown that sharp inferences from small experiments to large populations are difficult even under ideal sampling conditions. The statistics commonly used to assess generalizability are themselves sensitive to sample size, meaning that small studies lack the tools to even diagnose their own representativeness. In practice, many small studies also use convenience samples (whoever is available), compounding the problem further. The result is findings that may be true for a narrow slice of people but misleading when applied broadly.

Publication Bias Amplifies the Damage

Small sample sizes don’t exist in a vacuum. They interact with a well-documented flaw in how science gets published: journals are far more likely to accept studies with significant, positive results. When a small study finds nothing noteworthy, it often ends up in what researchers call the “file drawer,” unpublished and invisible. This means the published record is disproportionately filled with small studies that happened to get lucky, while the null results that would balance the picture are missing.

This matters most for meta-analyses, which pool results from many studies to estimate a true effect. If the pool is contaminated with inflated small-study results and missing the null findings, the meta-analytic estimate will be larger than the real effect. The entire evidence base on a topic can be distorted because small, underpowered studies fed selectively into the literature.

The Ethical Dimension

When research involves human participants or animals, small sample sizes raise ethical concerns beyond just bad statistics. Every participant in a clinical trial accepts some burden: time, potential side effects, the possibility of receiving an inferior treatment. If the study is too small to produce a reliable answer, those burdens were shouldered for nothing. International clinical research guidelines state that the number of subjects should always be large enough to provide a reliable answer, and that underpowered trials should be avoided.

There’s also a flip side. An underpowered study might fail to detect that a treatment works, delaying access for patients who could benefit. Or it might miss harmful side effects entirely. The ethical imperative runs in both directions: enroll enough people to get a trustworthy answer, but not so many that participants are exposed to an inferior treatment unnecessarily.

Why Larger Trials Use More Participants

The structure of clinical drug trials illustrates why sample size scales with the stakes of the question being asked. Phase 1 trials typically involve 20 to 80 people and focus narrowly on safety and dosing. Phase 2 expands to a few hundred patients and begins evaluating whether the drug works, though these studies are explicitly acknowledged as not large enough to confirm a treatment benefit. Phase 3 trials enroll 300 to 3,000 participants specifically because that size is needed to demonstrate whether a product offers a real benefit and to catch long-term or rare side effects that smaller studies would miss.

Each phase increases the sample because the questions get harder. Detecting whether a drug is broadly effective across a diverse patient population, or whether it causes a side effect in 1 out of 500 people, requires statistical power that only comes with numbers. A Phase 1 trial with 30 people simply cannot answer Phase 3 questions, no matter how well it’s designed.

What “Too Small” Actually Means

There’s no single number that separates an adequate sample from an inadequate one. The required size depends on the expected effect size (how big the difference is between groups), the variability in the measurements, and the acceptable error rates. A study comparing a powerful surgical intervention to a placebo might need only 30 participants because the effect is large and obvious. A study comparing two similar blood pressure medications might need thousands because the difference between them is small.

The standard approach is a power analysis, done before the study begins, that calculates the minimum sample needed to have at least an 80% chance of detecting the expected effect with a false positive rate of 5% or lower. Studies that skip this step, or that proceed despite knowing they’re underpowered, produce results that are hard to interpret and potentially misleading. When you encounter a study with a small sample, the most important question isn’t whether its finding reached statistical significance. It’s whether the study was ever capable of giving a reliable answer in the first place.