How to Tell If an ANOVA Result Is Significant

An ANOVA result is significant when the p-value is less than your chosen significance level, most commonly 0.05. If your p-value falls below that threshold, you reject the null hypothesis and conclude that at least one group mean differs from the others. But a p-value alone doesn’t tell the full story. Understanding the F-statistic, checking your assumptions, and knowing what to do after a significant result are all part of interpreting ANOVA correctly.

What the F-Statistic Tells You

ANOVA works by comparing two types of variation in your data: the variation between your groups and the variation within each group. The F-statistic is the ratio of these two quantities. Specifically, it divides the average between-group variance by the average within-group variance.

A large F-statistic means the differences between your group means are large relative to the natural spread of scores within each group. That’s what makes it easier to reject the null hypothesis, which states that all group means are equal. An F-value near 1 suggests the between-group differences are about the same size as the random variation you’d expect within groups, pointing toward no meaningful difference. The further the F-statistic climbs above 1, the stronger the evidence that something real is going on.

You can’t interpret the F-statistic in isolation, though. Its meaning depends on the degrees of freedom, which are determined by how many groups you have and how many total observations are in your data. The numerator degrees of freedom equal the number of groups minus one. The denominator degrees of freedom equal the total number of observations minus the number of groups. These two values shape the F-distribution you’re comparing against, which is why the same F-value can be significant in one study but not another.

Comparing the P-Value to Your Alpha Level

The most common way to check significance is to look at the p-value that your software outputs alongside the F-statistic. The p-value tells you the probability of getting an F-statistic at least as extreme as yours if the null hypothesis were actually true. In other words, it answers: “How likely is this result if there’s really no difference between groups?”

You compare the p-value to a pre-set significance level (alpha). Most researchers use 0.05, meaning they accept a 5% chance of incorrectly rejecting the null hypothesis. If p is less than 0.05, the result is significant. If p is greater than 0.05, you fail to reject the null hypothesis. There’s nothing magical about 0.05. Alpha can be set at 0.01 for stricter standards or 0.10 for more lenient ones, depending on the context. The key is choosing your alpha before you run the test, not after you see the results.

Setting alpha to 0.05 means that 5 times out of 100, you’d reject the null hypothesis even though it’s actually true. That’s the tradeoff you accept. Lowering alpha to 0.01 reduces that risk but makes it harder to detect real differences.

Using the F-Critical Value Instead

If you’re working without software or want to verify a result by hand, you can compare your calculated F-statistic to a critical value from an F-distribution table. You look up the critical value using your numerator degrees of freedom, denominator degrees of freedom, and chosen alpha level. If your F-statistic is greater than the critical value, the result is significant. If it’s smaller, it’s not.

This is exactly what the p-value method does behind the scenes. The critical value is simply the F-value that corresponds to your alpha threshold. Both approaches give the same answer.

Check Your Assumptions First

A significant ANOVA result is only trustworthy if three assumptions hold. Violating them can inflate your F-statistic and produce misleading p-values.

  • Normality: The data within each group should be approximately normally distributed. You can check this visually with a QQ plot or formally with a Shapiro-Wilk test. ANOVA is fairly robust to mild violations of normality, especially with larger sample sizes.
  • Homogeneity of variance: The spread of scores should be roughly equal across all groups. Levene’s test checks this directly. It tests whether group variances are equal, and if it comes back significant (p less than 0.05), that’s a warning that this assumption is violated. In that case, alternatives like Welch’s ANOVA may give more reliable results.
  • Independence: Each observation should be unrelated to every other observation. This is violated when the same person appears in multiple conditions (a repeated-measures design) or when data points are clustered in some way. There’s no simple test for independence. It comes down to how you designed your study.

Significant Result Does Not Mean Large Effect

A tiny p-value doesn’t necessarily mean the difference between your groups is meaningful in a practical sense. With a large enough sample, even trivially small differences between group means can produce a significant F-statistic. That’s why effect size matters.

The most common effect size measure for ANOVA is eta squared, which represents the proportion of total variance in your data that’s explained by the group differences. An eta squared of 0.01 is generally considered small, 0.06 is medium, and 0.14 or above is large, based on widely used benchmarks from the statistician Jacob Cohen. If your ANOVA is significant but eta squared is 0.02, the groups do differ, but the factor you’re testing accounts for only 2% of the variation in your outcome. That may or may not matter depending on your field.

Partial eta squared is a related measure you’ll see in studies with more than one factor. It captures how much variance a single factor explains after removing variance from other factors in the model. Most statistical software reports one or both of these automatically.

What a Significant ANOVA Does and Doesn’t Tell You

A significant result tells you that at least one group mean is different from the others. It does not tell you which groups differ. If you’re comparing three treatments and get a significant F-test, you know the three means aren’t all equal, but you don’t yet know whether Treatment A differs from B, A from C, B from C, or all three from each other.

To answer that, you need post-hoc tests. The most commonly used is Tukey’s HSD (honestly significant difference), which compares every possible pair of group means while controlling for the increased error rate that comes from running multiple comparisons. Tukey’s HSD is the standard choice when you want all pairwise comparisons and your groups are roughly equal in size. For unequal group sizes, a modified version called Tukey-Kramer works better.

Bonferroni correction is another option, best suited when you only have a small number of planned comparisons rather than every possible pair. It’s simpler but more conservative. ScheffĂ©’s method is the most flexible, covering complex comparisons beyond simple pairs, but it’s also the most conservative and generally not recommended when you only care about pairwise differences.

Reading an ANOVA Output

Whether you’re using SPSS, R, Excel, or another tool, the output typically appears as a table with these columns: source of variation, degrees of freedom, sum of squares, mean square, F-statistic, and p-value. Here’s what to look at in order:

  • F-statistic: The ratio of between-group variance to within-group variance. Larger values point toward group differences.
  • P-value: Often labeled “Sig.” in SPSS. Compare this to your alpha. Below 0.05 means significant at the 5% level.
  • Degrees of freedom: Two numbers, one for between groups (k minus 1) and one for within groups (N minus k). You need both to report your results properly.

In formal writing, ANOVA results are typically reported in a standard format: F(between df, within df) = F-value, p = p-value. For example, F(2, 45) = 6.31, p = .004. The degrees of freedom go in parentheses after the F, followed by the F-value and the exact p-value. Statistically significant results are often flagged with asterisks in tables, with a footnote specifying the alpha level.

Two-Way ANOVA Has Multiple Tests

If your design has two factors (for example, treatment type and dosage level), a two-way ANOVA produces three separate F-tests, each with its own p-value. One tests the main effect of the first factor, another tests the main effect of the second factor, and the third tests whether the two factors interact. Each of these can be significant or not, independently of the others. An interaction effect means that the impact of one factor depends on the level of the other factor. You evaluate significance for each F-test the same way: compare its p-value to your alpha level.