What Is a Significance Test in Statistics?

A significance test is a statistical method for deciding whether a pattern you see in data reflects something real or could just be a fluke of random chance. You start with the assumption that nothing interesting is happening (no real effect, no real difference), then check whether your data is strange enough to cast doubt on that assumption. If it is, you call the result “statistically significant.” The entire process runs on probability, and the key output is a number called the p-value.

The Core Logic: Null and Alternative Hypotheses

Every significance test begins with two competing claims. The first is the null hypothesis, which states that there is no effect, no difference, or no relationship. If you’re testing whether a new drug lowers blood pressure, the null hypothesis says it doesn’t. The second is the alternative hypothesis, which says the effect is real. These two claims are mutually exclusive: one must be wrong.

The test doesn’t try to prove the alternative hypothesis directly. Instead, it asks: if the null hypothesis were true, how likely would we be to see data this extreme? That backward logic is the foundation of the whole process. You assume nothing is happening, then see if your data is too surprising to support that assumption.

What the P-Value Actually Tells You

The p-value is the probability of getting results as extreme as (or more extreme than) what you observed, assuming the null hypothesis is true. A p-value of 0.03 means there’s a 3% chance of seeing data this unusual if there really were no effect. The smaller the p-value, the harder it becomes to blame your results on random variation.

Researchers typically compare the p-value to a preset threshold, most commonly 0.05 (5%). If the p-value falls below that line, the result is declared statistically significant. If it’s above, you don’t have enough evidence to reject the null hypothesis. Ronald Fisher, the statistician who popularized this approach, suggested 0.05 as a reasonable convention, not an iron law. He noted that p-values between 0.1 and 0.9 give no reason to doubt the null hypothesis, while values below 0.02 strongly suggest it doesn’t account for the data.

One critical point: there is no sharp boundary between “significant” and “not significant.” A p-value of 0.049 is not meaningfully different from 0.051. The evidence against the null hypothesis gets gradually stronger as the p-value shrinks. A result with p = 0.02 is stronger evidence than one with p = 0.04, but neither is a certainty.

How a Significance Test Works, Step by Step

The general procedure follows a consistent pattern regardless of which specific test you use:

State your hypotheses. Define the null hypothesis (no effect) and the alternative hypothesis (there is an effect). Decide whether you’re testing for a difference in one direction or in either direction.
Choose your significance level. Set your threshold (alpha) before looking at the data. This is usually 0.05, meaning you’ll accept a 5% risk of a false alarm.
Collect data and calculate a test statistic. The test statistic is a single number that summarizes how far your observed data falls from what the null hypothesis predicts. It’s computed from your sample’s values, the hypothesized value, and the sample size.
Find the p-value. Using the test statistic, determine the probability of seeing results this extreme under the null hypothesis. Standard statistical tables or software convert the test statistic into a p-value.
Make a decision. If the p-value is below your threshold, reject the null hypothesis. If it’s above, you lack sufficient evidence to reject it.

For a simple example: a test statistic (often called z) greater than 1.96 corresponds to a p-value below 0.05, and a value greater than 2.58 corresponds to a p-value below 0.01.

Common Types of Significance Tests

Which test you use depends on the type of data you have and how many groups you’re comparing. The choice matters because each test is built on different mathematical assumptions.

When comparing the averages of two groups with normally distributed numerical data, the unpaired t-test is standard. If you’re comparing more than two groups, you use ANOVA (analysis of variance) instead. A common mistake is running multiple t-tests when comparing three or more groups, say Group A vs. B, B vs. C, and C vs. A. This inflates your chance of a false positive. ANOVA handles all groups simultaneously, and if it finds a significant difference, follow-up tests identify which specific groups differ.

For categorical data (counts of people in different categories, like how many patients improved vs. didn’t), the chi-square test is the go-to. When your numerical data is skewed or ranked rather than normally distributed, nonparametric alternatives like the Mann-Whitney U test (for two groups) or the Kruskal-Wallis test (for more than two) replace the t-test and ANOVA respectively.

Type I and Type II Errors

Significance tests can go wrong in two ways. A Type I error, or false positive, happens when you reject the null hypothesis even though it’s actually true. You conclude there’s an effect when there isn’t one. Think of it like convicting an innocent person. The probability of making this mistake is exactly your alpha level. Setting alpha at 0.05 means you accept a 5% chance of a false positive.

A Type II error, or false negative, is the opposite: you fail to detect a real effect. The null hypothesis is actually false, but your test doesn’t catch it. This is like letting a guilty person go free. The probability of a Type II error is called beta, and it’s influenced by your sample size, the actual size of the effect, and your chosen alpha level. There’s an inherent tradeoff: making your alpha stricter (say, 0.005 instead of 0.05) reduces false positives but increases the risk of missing real effects.

Why the 0.05 Threshold Varies by Field

While 0.05 is the default in most of medicine, psychology, and economics, it isn’t universal. Some researchers have argued for stricter thresholds like 0.005 or 0.001 to reduce the growing problem of false positives in published research. Others push back, noting that stricter cutoffs mean more real effects go undetected.

The optimal threshold actually depends on the research context. In fields where true effects are rare (the base rate of real effects is below about 10%), a stricter alpha like 0.005 tends to improve overall accuracy. When true effects are common (base rates above 40%), using 0.005 can actually hurt because you’ll miss too many real findings. The relative cost of each type of mistake also matters: in drug safety, a false positive might mean pulling a useful treatment from the market, while a false negative could leave a dangerous side effect undetected.

Statistical Significance Is Not the Whole Story

A p-value tells you whether an effect likely exists, but it says nothing about how large or meaningful the effect is. This distinction between statistical significance and practical significance is one of the most important things to understand about these tests. With a large enough sample, even a trivially small effect will produce a tiny p-value. A study of 100,000 people might find that a supplement raises test scores by 0.1 points with p = 0.001. Statistically significant, yes, but practically meaningless.

This is why researchers increasingly report effect sizes alongside p-values. Effect size measures the magnitude of the difference between groups, independent of sample size. As one widely cited guideline puts it: “Statistical significance is the least interesting thing about the results. You should describe the results in terms of measures of magnitude, not just does a treatment affect people, but how much does it affect them.”

Confidence Intervals as a Companion Tool

A confidence interval gives you a range of plausible values for the true effect, which is often more informative than a yes-or-no significance verdict. A 95% confidence interval means that if you repeated the study many times using the same method, 95% of the intervals you calculated would contain the true effect size.

Confidence intervals and significance tests are mathematically linked. If a 95% confidence interval for a difference between two groups doesn’t include zero, that corresponds to a p-value below 0.05. The interval also shows you something the p-value can’t: the range of effect sizes that are compatible with your data. A confidence interval of 2.1 to 2.3 tells a very different story than one of 0.1 to 25.0, even if both are statistically significant.

Common Misinterpretations to Avoid

In 2016, the American Statistical Association took the unusual step of issuing a formal statement on p-values, the first position paper of its kind in the organization’s 177-year history. Several of its key principles address widespread misunderstandings.

A p-value does not measure the probability that the null hypothesis is true. A p-value of 0.03 does not mean there’s a 3% chance the treatment has no effect. It means there’s a 3% chance of seeing data this extreme if the treatment had no effect. That distinction sounds subtle but changes the interpretation completely. Similarly, a large p-value doesn’t prove the null hypothesis is true. It simply means your data isn’t unusual enough to rule it out.

The ASA also warned against basing scientific conclusions solely on whether a p-value crosses a specific threshold, and against running many tests on the same data and reporting only the significant ones. That practice, sometimes called p-hacking, makes results uninterpretable because running enough tests on random data will inevitably produce some low p-values by chance alone. A low p-value should never be the sole basis for a scientific claim. Full reporting of all analyses, significant or not, is essential for honest interpretation.