What Is a Null Distribution in Statistics?

A null distribution is the pattern of results you’d expect to see from a statistical test if nothing interesting were actually happening in your data. It represents the full range of outcomes that could occur purely by chance, with no real effect or relationship present. Every time a researcher calculates a p-value or declares a result “statistically significant,” they’re comparing what they actually found against this baseline distribution of chance outcomes.

The Core Idea

Imagine you’re testing whether a new teaching method improves exam scores. Your null hypothesis is the skeptic’s position: “This teaching method makes no difference.” The null distribution is the answer to a follow-up question: “If that skeptic is right, what range of results would I still expect to see just from normal variation among students?”

Some classes will score a little higher, some a little lower, purely by luck. The null distribution maps out exactly how often each of those random differences would occur. It’s a probability curve, and every point on it represents how likely a particular result is when nothing real is going on. Most results cluster near zero difference (as you’d expect if the method truly doesn’t help), with increasingly extreme results becoming increasingly rare out in the tails of the curve.

How It Connects to P-Values

The p-value is simply the area under the null distribution curve from your observed result outward to the tail. If your test produces a statistic of 2.5 on a t-distribution, the p-value equals the proportion of that curve that sits at 2.5 or beyond. A small area means your result would be rare under pure chance, which gives you grounds to reject the null hypothesis.

Think of it this way: the null distribution tells you what “normal randomness” looks like. Your test statistic is what you actually got. The p-value measures how far into unusual territory your result falls on that distribution. A p-value of 0.03, for instance, means only 3% of the null distribution sits at or beyond your result. That’s a fairly unlikely outcome if nothing real is happening.

The Role of Alpha and the Rejection Region

Before running a test, researchers set an alpha level, which is the threshold for calling a result statistically significant. The most common alpha is 0.05, meaning the researcher accepts a 5% chance of being wrong when they reject the null hypothesis. There’s nothing mathematically sacred about 0.05. It was historically chosen because a 1-in-20 chance of error seemed acceptable for many applications, but researchers working in high-stakes fields often use stricter thresholds like 0.01.

The alpha level carves out a “rejection region” on the null distribution. For a two-sided test at the 0.05 level, the rejection region is the most extreme 5% of the curve, split into 2.5% on each tail. Any test statistic that lands in this zone is considered statistically significant. For a one-sided test, the full 5% sits in a single tail, making the critical value less extreme on that side (a z-score of 1.645 rather than 1.96).

If your test statistic falls inside the rejection region, you reject the null hypothesis. If it doesn’t, you fail to reject it. The null distribution is the map, the alpha level draws the boundary line, and your test statistic is the pin you drop on it.

Common Null Distribution Shapes

The specific shape of a null distribution depends on the type of test you’re running. A few show up constantly across scientific research:

  • Normal (z) distribution: The classic bell curve, used when sample sizes are large and population parameters are known. It’s symmetric, centered at zero.
  • t-distribution: Similar to the normal curve but with heavier tails, accounting for extra uncertainty in smaller samples. The smaller the sample, the fatter the tails. As sample size grows, it converges toward the normal distribution.
  • Chi-square distribution: Used for tests involving categorical data, like whether the distribution of responses across categories differs from what you’d expect. It’s right-skewed rather than symmetric.
  • F-distribution: Used when comparing variability across multiple groups, as in ANOVA tests. Also right-skewed.

Each of these is a theoretical null distribution, meaning mathematicians have worked out their exact shapes based on known properties. When a researcher runs a t-test, for example, they already know the null distribution is a t-curve with a specific number of degrees of freedom. They don’t have to generate it from scratch.

When Assumptions Matter

Theoretical null distributions come with conditions. For a t-test to produce a valid null distribution, the data generally needs to be drawn randomly, measured on a numeric scale, approximately normally distributed (especially with small samples), and the groups being compared should have roughly equal variability. When these conditions hold, the mathematical t-distribution accurately represents what chance alone would produce.

When they don’t hold, the null distribution may not match reality. A test might tell you a result is rare under the null, when in fact it’s only rare under the assumed distribution shape. This is one reason researchers check their data’s properties before choosing a test, and it’s also why alternative approaches exist for data that breaks the rules.

Building a Null Distribution From Scratch

Sometimes the data doesn’t fit neatly into a known theoretical distribution. In those cases, researchers can build a null distribution empirically using a technique called permutation testing. The logic is straightforward: if the null hypothesis is true and there’s no real difference between groups, then it shouldn’t matter which data points belong to which group. You can shuffle the group labels, recalculate the test statistic, and see what you get.

The process works in five steps. First, calculate the test statistic on your real data. Second, randomly reshuffle which observations belong to which group. Third, recalculate the test statistic on the shuffled data. Fourth, repeat the shuffling and recalculating many times, typically 1,000 to 10,000 times. Fifth, compare your original test statistic to this collection of shuffled results. That collection is your empirical null distribution. If only 20 of your 10,000 permutations produced a result as extreme as your real one, your p-value is roughly 0.002.

By scrambling the data, you’re destroying any real relationship that might exist. What remains is pure noise. Doing this thousands of times builds a picture of what noise looks like for your specific dataset, no theoretical assumptions required.

Type I and Type II Errors

The null distribution is directly tied to two kinds of mistakes researchers can make. A Type I error (false positive) happens when you reject the null hypothesis even though it’s actually true. The probability of this equals alpha, your significance threshold. At alpha = 0.05, you’ll incorrectly reject a true null hypothesis 5% of the time in the long run.

A Type II error (false negative) happens when you fail to reject the null hypothesis even though a real effect exists. The probability of this is called beta. Type II errors relate to the overlap between the null distribution and the distribution of results you’d see if the effect were real. When those two distributions overlap heavily, it becomes hard to tell signal from noise, and your risk of missing a real effect goes up. Increasing your sample size or looking for larger effects pulls the two distributions apart, reducing this risk.

Both errors are unavoidable in statistical testing. Lowering your alpha (say, from 0.05 to 0.01) reduces false positives but makes false negatives more likely, because you’re demanding stronger evidence before you’ll reject the null. The null distribution sits at the center of this tradeoff: it defines what “normal” looks like, and every decision about significance is ultimately a judgment call about how far from that normal your result needs to be before you take it seriously.