How to Test Statistical Significance (Step by Step)

Testing statistical significance means determining whether a result you observed in data is likely real or just due to random chance. The core process follows a consistent framework regardless of your field: set up a hypothesis, choose a threshold, calculate a test statistic from your data, and compare the resulting p-value to that threshold. A p-value of 0.05 or lower is the most common cutoff, meaning there’s a 5% or smaller probability that your result happened by chance alone.

The Step-by-Step Process

Every significance test follows the same logical sequence, whether you’re comparing two averages, looking at survey responses, or testing a new drug.

State your null hypothesis. This is the default assumption that nothing interesting is happening. If you’re testing whether a new fertilizer improves plant growth, the null hypothesis is that it doesn’t: the average growth is the same with or without the fertilizer. You write this as “there is no difference between groups” or “the effect equals zero.”

State your alternative hypothesis. This is what you suspect is actually true. It could be directional (“the fertilizer increases growth”) or non-directional (“the fertilizer changes growth in either direction”). Your choice here affects whether you run a one-tailed or two-tailed test.

Set your significance level (alpha). This is the threshold you’ll use to judge your results, and you must choose it before looking at your data. The standard alpha is 0.05, which means you’re willing to accept a 5% chance of incorrectly declaring a result significant. Some fields use stricter thresholds: drug approval sometimes requires p-values near 0.001, and particle physics famously uses a “five-sigma” standard equivalent to roughly 0.0000003.

Collect your data and calculate the test statistic. The test statistic measures how far your observed data falls from what the null hypothesis predicts, scaled by how much variability is in your sample. For comparing a single group’s average to a known value, the formula is straightforward: take the difference between your sample mean and the hypothesized mean, then divide by the standard deviation divided by the square root of your sample size. Every test statistic follows this same basic logic of “observed difference divided by standard error.”

Get your p-value and make a decision. Software will convert your test statistic into a p-value, which represents the probability of seeing a result this extreme if the null hypothesis were true. If the p-value falls below your alpha, you reject the null hypothesis. If it doesn’t, you fail to reject it. Note the careful language: you never “prove” the null hypothesis is true. You simply don’t have enough evidence to rule it out.

Choosing the Right Test

Three things determine which statistical test to use: the number of variables involved, what type of data you have, and whether your observations are paired or independent.

Data type is the biggest factor. Continuous data (things you measure on a scale, like weight, temperature, or test scores) calls for different tests than categorical data (things you count or classify, like yes/no outcomes or color preferences). Within continuous data, you also need to consider whether your data follows a roughly bell-shaped distribution, which determines whether you can use parametric tests or need nonparametric alternatives.

  • Comparing two group averages (unpaired): Use an independent samples t-test. Example: test scores from class A vs. class B.
  • Comparing two measurements on the same subjects: Use a paired t-test. Example: blood pressure before and after a medication in the same patients.
  • Comparing three or more group averages: Use ANOVA (analysis of variance). Example: crop yields across four different fertilizer types.
  • Testing a relationship between two continuous variables: Use Pearson correlation. Example: the relationship between hours studied and exam score.
  • Comparing categories or proportions: Use a chi-square test or Fisher’s exact test. Example: whether men and women choose different product colors at different rates.

Fisher’s exact test works best with small sample sizes and binary outcomes in a simple two-by-two table. The chi-square test handles larger samples and can compare more than two groups or more than two categories at once.

Checking Your Assumptions First

Every statistical test comes with assumptions about your data, and violating them can make your results unreliable. Parametric tests like the t-test and ANOVA assume your data is roughly normally distributed, that observations are independent of each other, and that the groups you’re comparing have similar variability (spread). If you know your subject matter well, you can often justify the normality assumption based on how the data is generated. Many biological measurements, for instance, naturally follow a bell curve.

When your data clearly violates these assumptions, nonparametric tests are the alternative. These don’t require normal distributions, but they carry their own constraints. Many nonparametric tests that compare groups assume the data in all groups has the same spread, even if the shape of the distribution differs. If your data is continuous but skewed, you can also sometimes transform it (using a log transformation, for example) to meet the assumptions of a parametric test, which tends to be more statistically powerful.

Understanding Type I and Type II Errors

Two kinds of mistakes can happen when you test significance. A Type I error (false positive) means you declared a result significant when it actually wasn’t. Your alpha level directly controls this risk: at alpha = 0.05, you accept a 5% chance of a Type I error. A Type II error (false negative) means you missed a real effect and concluded there was no difference when one actually existed.

Statistical power is your ability to avoid Type II errors. It represents the probability of correctly detecting a real effect when one exists. Power depends on three things: your sample size, the size of the effect you’re trying to detect, and your alpha level. Small samples have low power, meaning they can easily miss real but modest effects. This is why researchers calculate the required sample size before collecting data, a step called a power analysis. Most studies aim for at least 80% power, meaning a 20% chance of missing a true effect.

Why P-Values Aren’t the Whole Story

A p-value tells you whether an effect exists. It does not tell you how large or meaningful that effect is. With a big enough sample, even a trivially small difference becomes statistically significant. A study might find that a new teaching method raises test scores with p = 0.01, but if the actual improvement is only 0.3 points on a 100-point scale, that’s not useful to anyone.

This is where effect size comes in. Effect size measures the magnitude of a difference, independent of sample size. One common measure, Cohen’s d, expresses the difference between two groups in terms of standard deviations. A d of 0.2 is considered small, 0.5 is medium, and 0.8 or above is large. Another measure, r-squared, tells you what proportion of the variation in your outcome is explained by the variable you’re studying. A statistically significant result with an r-squared of 0.001 means the variable explains only 0.1% of the variation, which is negligible in practical terms.

Reporting both the p-value and an effect size gives the full picture. The p-value answers “is this real?” and the effect size answers “does it matter?”

Common Pitfalls to Avoid

The American Statistical Association issued a formal statement warning against the mechanical use of p-value thresholds as the sole basis for scientific conclusions. Several specific practices undermine the validity of significance testing.

P-hacking is the most widespread problem. This happens when you run many different analyses on the same data and only report the ones that produce significant p-values. If you test 20 different variables at alpha = 0.05, you’d expect one to appear significant by pure chance. Selectively reporting that one result while hiding the other 19 makes the finding essentially uninterpretable. The same issue arises more subtly when you decide what to analyze based on patterns you already noticed in the data, rather than specifying your analysis plan in advance.

Confusing statistical significance with practical significance is equally problematic. A large clinical trial might detect a blood pressure reduction of 0.5 mmHg with high statistical confidence, but no doctor would change a treatment plan based on that size of difference. Always ask whether the magnitude of the effect is large enough to matter in the real-world context you care about.

Treating a non-significant result as proof of no effect is another mistake. Failing to reject the null hypothesis means you didn’t find sufficient evidence for a difference. It does not mean the difference is zero. The distinction matters, especially with small samples where power is low. If your study only had a 40% chance of detecting a real effect, a non-significant result doesn’t say much.