Null hypothesis significance testing (NHST) is a method of statistical inference used to determine whether an observed result in a study is likely due to a real effect or simply due to chance. It works by assuming there is no effect, then calculating how probable your data would be under that assumption. If the probability is low enough, you reject the assumption and conclude the effect is likely real. NHST is the backbone of most published research in psychology, medicine, and the social sciences, though it comes with important limitations worth understanding.
How the Two Hypotheses Work
Every study using NHST starts with two competing statements. The null hypothesis proposes that nothing interesting is happening: there is no difference between groups, no relationship between variables, no effect of a treatment. The alternative hypothesis is what the researcher actually believes or is trying to demonstrate, such as a new drug working better than a placebo.
Counterintuitively, the entire framework is built around trying to disprove the null hypothesis rather than directly proving the alternative. You don’t gather evidence for your theory. Instead, you gather evidence against the “no effect” scenario, and if that evidence is strong enough, you reject the null. Think of it like a courtroom: the null hypothesis is “innocent until proven guilty,” and your data is the prosecution’s case. You either find enough evidence to reject innocence, or you don’t.
These hypotheses can be one-tailed or two-tailed. A two-tailed hypothesis simply states that there is a difference between groups without specifying which direction. A one-tailed hypothesis predicts a specific direction, like “Treatment A is better than Treatment B.” Most studies use two-tailed tests because they allow for surprises in either direction.
What a P-Value Actually Tells You
The p-value is the probability of obtaining a result equal to or more extreme than what you observed, assuming the null hypothesis is true. In plain terms: if there really were no effect, how often would you see data this dramatic just by random chance? A p-value of 0.03, for instance, means there’s a 3% chance of getting results this extreme in a world where the treatment does absolutely nothing.
A common misunderstanding is that the p-value tells you the probability that your hypothesis is correct. It does not. It only tells you how surprising your data would be if the null hypothesis were true. That distinction matters enormously, because a small p-value doesn’t prove your theory is right. It only suggests that the “no effect” explanation is hard to square with what you observed.
The 0.05 Threshold and Why It Exists
In practice, researchers compare their p-value to a pre-set cutoff called the alpha level. The most widely used alpha level is 0.05, meaning you’ll reject the null hypothesis if there’s less than a 5% probability of seeing your results under the assumption of no effect. If your p-value lands at or below alpha, the result is declared “statistically significant.”
The 0.05 standard is more convention than law. Some fields use stricter thresholds. Journal editors have historically favored 0.01, and a 2018 proposal by a large group of researchers recommended shifting the default to 0.005 to reduce false positives. Particle physics famously requires a threshold equivalent to about 0.0000003 (the “five sigma” standard) before claiming a discovery. The key point is that 0.05 is not a magic number that separates real effects from fake ones. It is an arbitrary but widely accepted line in the sand.
Type I and Type II Errors
Because NHST is based on probabilities, it can go wrong in two specific ways. A Type I error (false positive) happens when you reject the null hypothesis even though it is actually true. You conclude an effect exists when it doesn’t. The probability of making this mistake is equal to your alpha level. At the standard 0.05 threshold, you accept a 5% chance of a false positive.
A Type II error (false negative) happens when you fail to reject the null hypothesis even though there really is an effect. You miss something real. The probability of this error is called beta. Reducing one type of error generally increases the other, so researchers have to decide which mistake is more costly for their particular study. In drug safety research, for example, a false negative (missing a dangerous side effect) could be worse than a false positive.
Statistical Power and Sample Size
Power is the probability that your study will correctly detect a real effect when one exists. It is calculated as 1 minus beta, and a power of 0.80 (80%) is the most common target. That means you’re accepting a 20% chance of missing a real effect.
Three things drive power. First, sample size: more participants give you more ability to detect effects. Second, effect size: bigger effects are easier to spot. Third, your alpha level: a stricter threshold makes it harder to reach significance, reducing power unless you compensate with more data. These three factors are linked, so researchers need to plan carefully before collecting data. Studies with small samples and small expected effects often have dangerously low power, meaning they waste time and resources because they’re unlikely to find anything even if the effect is real. As the expected effect gets smaller, the required sample size climbs sharply to maintain adequate power.
Why Effect Size Matters More Than P-Values
One of the most important things to understand about NHST is that statistical significance does not tell you how large or meaningful an effect is. A p-value tells you whether an effect likely exists, not whether it matters. With a large enough sample, a statistical test will almost always produce a significant result, even if the actual difference is trivially small.
A well-known example comes from a large aspirin study that found a statistically significant reduction in heart attack risk. The p-value was impressive. But the actual risk difference was only 0.77%, an extremely small effect. Many people were advised to take aspirin based on that finding, exposing them to side effects for a benefit that was barely measurable at the individual level. This illustrates why effect size, which quantifies the magnitude of a difference independent of sample size, is essential to report alongside p-values. As one prominent statistician put it: “Statistical significance is the least interesting thing about the results. You should describe the results in terms of measures of magnitude, not just does a treatment affect people, but how much does it affect them.”
Common Criticisms of NHST
NHST has drawn criticism from methodologists for more than half a century, and those concerns have intensified during the replication crisis, the discovery that many published findings in psychology, neuroscience, and biomedicine fail to hold up when other researchers try to reproduce them.
One core problem is the “nil hypothesis” issue. Most studies set the null hypothesis as a true mean difference of exactly zero. In the real world, it is almost always the case that two groups differ by at least some tiny, meaningless amount. With enough data, you can detect that tiny difference and declare it significant, even though it has no practical relevance. The NHST machinery guarantees that any non-zero effect, no matter how small, will become statistically significant if you collect enough data.
Another criticism is binary thinking. Reducing a complex result to “significant” or “not significant” based on whether a p-value falls above or below a cutoff encourages black-and-white conclusions. A p-value of 0.049 gets treated as a discovery, while 0.051 gets treated as nothing, even though those two results are nearly identical. Critics argue that this all-or-nothing framework is unreliable for making nuanced scientific judgments, especially when theoretical predictions are weak.
Perhaps the most persistent misuse is treating a failure to reject the null as proof that no effect exists. NHST does not let you draw that conclusion. All you can say is that no significant effect was observed. The absence of evidence is not evidence of absence, especially in underpowered studies that lacked the sample size to detect real effects in the first place.
NHST in Context
NHST is actually a hybrid of two historically separate ideas. Ronald Fisher developed significance testing in the 1920s as a way to measure the strength of evidence against a null hypothesis using p-values. Jerzy Neyman and Egon Pearson developed hypothesis testing in the 1930s as a decision-making framework with formal error rates (alpha and beta) and pre-defined rejection regions. Modern NHST blends these two approaches, sometimes awkwardly, treating the p-value as both a measure of evidence and a decision tool.
Despite its flaws, NHST remains the dominant framework in published research because it provides a standardized, relatively simple procedure for evaluating results. The key is understanding what it can and cannot tell you. It can tell you whether your data are surprising under the assumption of no effect. It cannot tell you the probability that your hypothesis is true, how large or important an effect is, or whether a result will replicate. Used carefully alongside effect sizes, confidence intervals, and adequate sample sizes, it is one useful tool among several for interpreting data.

