How to Find Statistical Significance Step by Step

Finding statistical significance means determining whether a result from your data is likely real or just due to random chance. The standard method involves calculating a p-value and comparing it to a threshold (usually 0.05). If your p-value falls below that threshold, your result is considered statistically significant. The process has a few core steps, and understanding each one will help you interpret your results correctly.

Start With a Hypothesis

Every significance test begins with two competing statements. The null hypothesis states that nothing is going on: no difference between groups, no relationship between variables. The alternative hypothesis is your actual claim, the thing you’re trying to find evidence for.

For example, if you’re testing whether a new teaching method improves test scores, your null hypothesis would be “there is no difference in scores between the two methods.” Your alternative hypothesis would be “students using the new method score higher.” If you’re looking at whether height and shoe size are related, the null says there’s no relationship, and the alternative says there is. These two hypotheses are complementary. You can’t accept both, and the entire test is designed to help you decide between them.

Choose the Right Statistical Test

The test you use depends on two things: what type of data you have and how many groups you’re comparing. Picking the wrong test will give you meaningless results.

T-test: Use this when comparing the averages of two groups. An independent samples t-test works when the groups are separate (men vs. women, treatment vs. placebo). A paired t-test works when the same people are measured twice (before and after an intervention). A one-sample t-test compares a single group’s average to a known value.
ANOVA: Use this when comparing averages across three or more groups. If you’re testing whether four different diets lead to different weight loss, ANOVA is the right choice. A repeated measures version works when the same people are measured multiple times.
Chi-square test: Use this when your data is categorical rather than numerical. If you’re asking whether men and women prefer different brands, or whether a coin flip is fair, chi-square is the tool.

The key distinction is whether your outcome is a number you can average (use t-test or ANOVA) or a category you can count (use chi-square). Most statistical software will walk you through the selection, but knowing this logic helps you verify you’re on the right track.

Calculate the P-Value

The p-value is the core number in significance testing. It answers a specific question: if the null hypothesis were true (if there really were no effect), how likely would you be to see results at least as extreme as what you observed?

The calculation follows four steps, as outlined by Penn State’s statistics program. First, state your null and alternative hypotheses. Second, use your sample data to calculate a test statistic, which is a standardized number that summarizes how far your observed data falls from what the null hypothesis predicts. Third, use the known distribution of that test statistic to find the p-value, which is the probability of getting a result this extreme or more extreme under the null hypothesis. Fourth, compare your p-value to your chosen significance level.

In practice, you won’t calculate p-values by hand. Software like Excel, R, Python, SPSS, or even free online calculators handle the math. What matters is that you understand what the output means.

Compare the P-Value to Your Threshold

The significance level, called alpha, is the cutoff you set before running your test. In most fields, including psychology, medicine, and economics, the standard alpha is 0.05. This means you’re willing to accept a 5% chance of a false positive.

The decision rule is straightforward. If your p-value is less than or equal to alpha, you reject the null hypothesis. Your result is statistically significant. If your p-value is greater than alpha, you do not reject the null hypothesis. You haven’t proven the null is true; you simply don’t have enough evidence against it.

Some fields use stricter thresholds. A group of 72 researchers published a proposal arguing that the standard should shift to 0.005 to reduce false positive rates. In particle physics, the threshold is roughly 0.0000003 (the so-called “five sigma” standard). The optimal alpha can range from approximately 0.001 to 0.12 depending on the context, sample size, and how common the effect you’re studying is in reality. The 0.05 standard is conventional and, as the American Statistical Association has noted, arbitrary.

Using Confidence Intervals Instead

Confidence intervals offer another way to assess significance, and many statisticians consider them more informative than p-values alone. A 95% confidence interval gives you a range of values that are compatible with your data. If you’re measuring the difference between two groups and the 95% confidence interval for that difference does not include zero, the result is significant at the 0.05 level. If the interval does include zero, it’s not.

The advantage of confidence intervals is that they show you the size of the effect and the precision of your estimate, not just whether something crossed a threshold. A confidence interval of 2.1 to 14.8 tells you much more than “p = 0.03” does. You can see the range of plausible effect sizes and judge for yourself whether the result is meaningful in practical terms. One caution: overlapping confidence intervals between two groups don’t automatically mean the difference is non-significant. Direct comparison tests are still needed in that situation.

Why Sample Size Matters

Sample size has a direct relationship with your ability to find significance. Larger samples reduce random error, which increases precision and makes it easier to detect smaller differences between groups. A study with 30 participants per group might miss a real but modest effect, while the same study with 200 per group would catch it.

Statistical power is the probability that your test will detect a real effect when one exists. Power depends on three things: your sample size, the size of the actual effect, and your alpha level. The standard target for power is 0.80, meaning an 80% chance of detecting a true effect. For a moderate effect size with an alpha of 0.05, you typically need around 60 participants per group to reach that power level. Smaller effects require even larger samples.

This is why planning your sample size before collecting data is so important. Running a test on too few observations and finding “no significance” doesn’t mean there’s no effect. It may just mean you didn’t have enough data to see it. That’s a Type II error, or false negative.

Statistical Significance vs. Practical Significance

A statistically significant result isn’t necessarily an important one. With a large enough sample, even trivial differences become statistically significant. If you study 10,000 people and find that one drug lowers blood pressure by 0.5 mmHg more than another, that difference might reach p < 0.05 but have zero clinical relevance. No patient would notice it, and no doctor would change a prescription over it.

Effect size measures how large the difference actually is, independent of sample size. It tells you whether the finding matters in the real world. When evaluating any result, look at both the p-value and the effect size. A large effect that’s statistically significant is a strong finding. A tiny effect that barely clears the significance threshold in a massive sample is not.

Two Types of Errors to Watch For

Every significance test carries the risk of two mistakes. A Type I error (false positive) happens when you conclude there’s an effect but there actually isn’t one. Your alpha level directly controls this risk: an alpha of 0.05 means you accept a 5% chance of a false positive. A Type II error (false negative) happens when you conclude there’s no effect but there actually is one. This risk is controlled by your sample size and statistical power.

These two errors pull in opposite directions. Making your alpha stricter (say, 0.01 instead of 0.05) reduces false positives but increases the chance of missing real effects. Increasing your sample size is the main way to reduce both risks simultaneously.

What the Experts Say About P-Values

The American Statistical Association released its first-ever formal statement on p-values in 2016, after 177 years of existence. The statement laid out several principles worth knowing. A p-value does not tell you the probability that your hypothesis is true. It does not measure the size or importance of an effect. And scientific conclusions should never rest on whether a p-value crosses 0.05 alone.

The practical takeaway: treat p-values as one piece of evidence, not a verdict. Report exact p-values rather than just saying “significant” or “not significant.” Look at effect sizes. Consider confidence intervals. And be transparent about every analysis you ran, not just the one that produced the result you wanted. A low p-value combined with a meaningful effect size, a well-designed study, and a pre-registered hypothesis is strong evidence. A low p-value by itself, pulled from dozens of comparisons on a small sample, is not.