How to Find Statistical Significance: Step by Step

Finding statistical significance means determining whether the results of a study or experiment reflect a real effect or could have happened by chance alone. The standard method involves comparing a calculated p-value to a pre-set threshold, most commonly 0.05. If your p-value falls below that threshold, the result is considered statistically significant. But getting to that point requires a structured process, and interpreting the result correctly requires understanding what the numbers actually tell you.

The Core Concept: P-Values and Alpha Levels

A p-value measures how incompatible your data are with the assumption that there’s no real effect. Specifically, it tells you the probability of seeing results at least as extreme as yours if nothing were actually going on. A p-value of 0.03, for example, means there’s a 3% chance your data would look this way if the effect weren’t real.

Before you run any analysis, you set an alpha level, which is the cutoff you’ll use to judge significance. The most widely used alpha is 0.05, originally proposed by the statistician Ronald Fisher. This means you’re accepting a 5% chance of a false positive, or declaring something significant when it isn’t. In fields where false positives carry serious consequences, researchers sometimes use a stricter threshold like 0.01 (a 1% chance of error). The key rule: if your p-value is less than your alpha, the result is statistically significant.

The Step-by-Step Process

Hypothesis testing follows a consistent sequence, regardless of the specific test you’re running.

State your null hypothesis. This is the default assumption that there’s no difference or no effect. If you’re testing whether a new teaching method improves exam scores, the null hypothesis says both methods produce the same average score.

State your alternative hypothesis. This is what you suspect is true. It doesn’t have to claim that every group differs from every other group. It simply states that at least one difference exists. Before collecting data, you also need to decide whether your alternative hypothesis is directional. If you’re only interested in whether the new method is better (not just different), you’d use a one-tailed test. If the effect could go in either direction, use a two-tailed test. This decision must be made before you look at any data.

Set your alpha level. Again, typically 0.05. Do this before collecting data, not after. Choosing your threshold after seeing results introduces bias.

Collect data and calculate a test statistic. The specific statistic depends on your data type and study design (more on this below). The test statistic captures how far your observed results fall from what the null hypothesis predicts.

Compare the p-value to your alpha. Your statistical software or calculation will produce a p-value based on the test statistic. If p is less than alpha, you reject the null hypothesis and conclude the result is statistically significant. If p is greater than alpha, you fail to reject it.

Choosing the Right Statistical Test

The test you use depends on two things: what kind of data you have and how many groups you’re comparing. Picking the wrong test can produce misleading p-values.

For comparing two groups on a numerical outcome that follows a normal (bell-shaped) distribution, use an unpaired t-test. The t-test produces a test statistic calculated as the difference between the two group averages divided by a measure of variability adjusted for sample size. That statistic is then converted to a p-value based on the degrees of freedom, which is essentially the sample size minus one.

If you’re comparing more than two groups on a normally distributed numerical outcome, use an analysis of variance (ANOVA) instead. ANOVA produces an F-statistic rather than a t-statistic, but the logic is the same: it tells you whether the differences between groups are larger than you’d expect from random variation alone.

When your data aren’t normally distributed or are ranked (ordinal), you need nonparametric alternatives. For two groups, that’s the Mann-Whitney U test. For three or more groups, it’s the Kruskal-Wallis test.

For categorical data, where you’re counting how many people fall into different categories, the chi-square test is standard. And if you’re looking at the relationship between two continuous variables rather than comparing groups, you’d calculate a correlation coefficient: Pearson’s if both variables are normally distributed, Spearman’s if one or both are skewed.

Using Confidence Intervals

Confidence intervals offer another way to assess significance, and many researchers find them more informative than p-values alone. A 95% confidence interval gives you a range of plausible values for the true effect. If that interval doesn’t include zero (for a difference between groups) or doesn’t include one (for a ratio like an odds ratio), the result is statistically significant at the 0.05 level.

When comparing two groups, you might also look at whether their confidence intervals overlap. Non-overlapping 95% confidence intervals guarantee a p-value below 0.05. But be cautious with the reverse: overlapping intervals don’t necessarily mean the difference is non-significant. This is a common misinterpretation. Intervals can overlap slightly and the difference can still be statistically significant.

Why Sample Size Matters

Sample size has a direct and powerful effect on whether you’ll find statistical significance. A real effect can easily go undetected in a small study simply because there isn’t enough data to distinguish the signal from random noise. This is called a Type II error, or a false negative.

The probability of correctly detecting a real effect is called statistical power. As sample size increases, power increases. The relationship is dramatic for small effects: detecting a small effect (standardized difference of 0.2) requires roughly 788 participants, while detecting a large effect (standardized difference of 2.5) needs only about 8. Running an underpowered study wastes time and resources because it’s unlikely to find significance even when a real effect exists.

This also means that very large studies can produce statistically significant p-values for differences that are trivially small and practically meaningless, which brings up a critical distinction.

Statistical Significance vs. Practical Importance

A p-value tells you whether an effect exists. It tells you nothing about how large or meaningful that effect is. A study with thousands of participants might find that a new intervention improves test scores by half a point on a 100-point scale, with p = 0.001. That’s statistically significant but probably not worth changing anything over.

This is why researchers also report effect size. The most common measure for comparing two groups is Cohen’s d, which expresses the difference between group averages in standard deviation units. A d of 0.2 is considered small, 0.5 is medium, and 0.8 or higher is large. Unlike p-values, effect size is independent of sample size, so it gives you a cleaner picture of whether the finding actually matters.

The American Statistical Association issued a formal statement in 2016 emphasizing that scientific conclusions should not be based solely on whether a p-value crosses a specific threshold. Their key points: p-values don’t measure the probability that your hypothesis is true, they don’t measure the size of an effect, and by themselves they don’t provide strong evidence for or against a hypothesis. A complete picture requires the p-value, the effect size, and the confidence interval together.

Common Mistakes to Avoid

The most frequent error is interpreting a p-value of, say, 0.03 as meaning there’s a 3% chance the null hypothesis is true. That’s not what it means. It means there’s a 3% chance of seeing data this extreme if the null hypothesis were true. The distinction is subtle but important.

Another common mistake is choosing your alpha level or switching between one-tailed and two-tailed tests after seeing your data. Both practices inflate your chance of a false positive. The alpha level, the directionality of your hypothesis, and the statistical test should all be decided before you collect any data.

Running multiple statistical tests on the same dataset without adjusting your alpha is another pitfall. If you test 20 different comparisons at the 0.05 level, you’d expect one to come up significant by pure chance. Correction methods exist for this, and they work by making the threshold stricter as the number of comparisons increases.

Finally, don’t confuse “not significant” with “no effect.” A non-significant result might simply mean your study was too small to detect the effect. Check the confidence interval: if it’s wide and includes both meaningful and trivial values, the study was inconclusive rather than negative.