How to Interpret ANOVA: F-Statistic, P-Value, and More

Interpreting ANOVA comes down to one core question: is the variation between your groups large enough, relative to the variation within your groups, to conclude that at least one group mean is genuinely different? The answer lives in a handful of numbers, most importantly the F-statistic and its associated p-value. But reading those numbers correctly requires understanding what each piece of the ANOVA output actually tells you.

What the F-Statistic Means

The F-statistic is a ratio. It divides the variance between your groups by the variance within your groups. Think of it this way: the numerator captures how spread apart your group averages are from each other, while the denominator captures how spread out individual data points are inside each group.

When the null hypothesis is true (meaning all group means are actually equal), any small differences between groups are just noise, roughly the same size as the variability within groups. That produces an F-value near 1. The further the F-value climbs above 1, the more evidence you have that something beyond random chance is driving the differences between groups. An F-value of 4.7, for example, means the between-group variance is 4.7 times larger than the within-group variance.

Reading the ANOVA Table

Most software outputs an ANOVA table with columns labeled Source, SS (Sum of Squares), df (degrees of freedom), MS (Mean Square), and F. Here’s what each one means in practice.

Sum of Squares (SS) measures total variability. The “between” or “treatment” row shows how much variability comes from differences between group averages. The “error” or “within” row shows how much variability comes from individual differences inside each group. A large treatment SS relative to the error SS suggests your groups differ meaningfully.

Degrees of freedom (df) adjust for sample size and number of groups. The between-groups df equals the number of groups minus one. If you’re comparing four teaching methods, that’s 3. The within-groups df equals the total number of observations minus the number of groups. If you have 100 students across those four groups, that’s 96.

Mean Square (MS) is simply the sum of squares divided by its degrees of freedom. This step converts raw variability into a per-unit estimate of variance, which makes groups of different sizes comparable. The F-statistic is then the treatment mean square divided by the error mean square.

The P-Value Decision

Your p-value tells you the probability of seeing an F-statistic this large (or larger) if all group means were truly equal. You compare it to your chosen significance level, typically 0.05. If the p-value is less than or equal to 0.05, you reject the null hypothesis and conclude that at least one group mean differs from the others. If it’s greater than 0.05, you do not reject the null hypothesis.

A critical nuance: a significant ANOVA result does not tell you which groups differ. It only tells you that not all group means are the same. Pinpointing the specific differences requires a follow-up step.

Checking Assumptions Before You Trust the Results

ANOVA results are only reliable if three assumptions hold: your data within each group are roughly normally distributed, the variance across groups is approximately equal, and your observations are independent of each other.

The equal-variance assumption is the one most commonly tested. Levene’s test checks whether group variances are similar enough. It works like a mini hypothesis test of its own: if the p-value from Levene’s test is above 0.05, the variances are considered equal and the assumption is met. If it falls below 0.05, the assumption is violated, and you may need a corrected version of the F-test (like Welch’s ANOVA) or a non-parametric alternative.

Normality matters less as sample sizes grow. With 30 or more observations per group, ANOVA is robust to moderate departures from a normal distribution.

Effect Size: How Big Is the Difference?

A significant p-value tells you the difference is unlikely to be due to chance, but it says nothing about whether the difference is practically meaningful. That’s what effect size measures. The most common effect size for ANOVA is eta-squared (η²), which represents the proportion of total variability explained by group membership.

The standard benchmarks are:

η² = 0.01: small effect (group membership explains about 1% of the variance)
η² = 0.06: medium effect (about 6%)
η² = 0.14: large effect (about 14%)

An η² of 0.03, even with a significant p-value, means the groups differ statistically but the practical difference is tiny. Conversely, an η² of 0.18 means group membership accounts for a substantial chunk of what’s driving variation in your outcome. Always report effect size alongside your p-value, because with a large enough sample, even trivial differences become statistically significant.

Post-Hoc Tests: Finding Which Groups Differ

Once ANOVA flags a significant overall difference, post-hoc tests compare groups in pairs to identify exactly where the differences lie. The challenge is that running many comparisons inflates the chance of a false positive. Post-hoc methods control for this.

Tukey’s HSD (honestly significant difference) is the default choice when you want to compare every group to every other group. It controls the overall error rate across all pairwise comparisons and works best when group sizes are equal. For unequal group sizes, a modified version called Tukey-Kramer handles the adjustment.

The Bonferroni correction is better suited when you have a small number of planned comparisons rather than all possible pairs. It divides your significance threshold by the number of tests you’re running. This is straightforward and widely applicable, but it becomes overly conservative when the number of comparisons is large (roughly ten or more), reducing your ability to detect real differences.

Scheffé’s procedure is the most flexible option, designed for complex comparisons that go beyond simple pairwise tests, such as comparing the average of two groups against a third. It applies the most stringent error control, which makes it the most conservative. It’s generally not recommended if you only care about pairwise comparisons.

Interpreting Two-Way ANOVA and Interactions

When your design includes two independent variables (say, drug dosage and sex), a two-way ANOVA produces three results: a main effect for each variable and an interaction effect. The interaction is the most important result to check first, because it can change the meaning of everything else.

An interaction means the effect of one variable depends on the level of the other. For example, a pain study might find that a low dose reduces pain more in women, while a high dose reduces pain more in men. Neither “dose” nor “sex” alone tells the full story. The pattern reverses depending on which group you look at.

If the interaction is significant, interpret it by describing the pattern at each level of one factor separately. The main effects become misleading on their own, because they average across a pattern that isn’t consistent. If the interaction is not significant, you can interpret each main effect independently, just as you would in a one-way ANOVA.

Reporting Your Results

Standard reporting includes the F-value, degrees of freedom, and p-value. In APA format, that looks like: F(3, 96) = 4.72, p = .004, η² = .13. The first number in parentheses is the between-groups degrees of freedom, the second is the within-groups degrees of freedom. Including effect size alongside these numbers gives readers the full picture.

In an ANOVA table, list sources of variance in rows (between-subjects variables and error first, then within-subjects if applicable). Report the mean square error in parentheses, and use asterisks to flag statistically significant F-ratios. A probability footnote at the bottom of the table specifies what each asterisk means (for instance, *p < .05). Keep your asterisk conventions consistent throughout any paper or report.

A Practical Walkthrough

Suppose you’re comparing test scores across three tutoring methods with 30 students per group. Your ANOVA output shows F(2, 87) = 5.34, p = .007, η² = .11. Here’s how to read that step by step. The F-value of 5.34 means the variance between tutoring groups is over five times larger than the variance within groups. The p-value of .007 is below 0.05, so you reject the null hypothesis: at least one tutoring method produces a different average score. The η² of .11 is a medium-to-large effect, meaning tutoring method explains about 11% of the variation in test scores.

You’d then run Tukey’s HSD to compare all three pairs. Maybe Method A outperforms Method C (p = .005), but Methods A and B don’t significantly differ (p = .31), and neither do B and C (p = .08). The conclusion: Method A clearly beats Method C, but Method B falls somewhere in between without a clear statistical distinction from either.