How to Read and Interpret an ANOVA Results Table

An ANOVA results table tells you whether the averages of three or more groups are meaningfully different from each other. The key number to look at first is the p-value in the far-right column. If it’s below 0.05, at least one group’s average is significantly different from the others. But that single number is just the starting point. Understanding what each column in the table means, and what to do after you get a significant result, will help you draw real conclusions from your data.

What Each Column in the Table Means

A standard ANOVA output table has five columns: Source, Degrees of Freedom (DF), Sum of Squares (SS), Mean Square (MS), F-statistic, and p-value. They build on each other from left to right, so understanding them in order makes the whole table click.

Source splits your data’s variability into two buckets. “Between groups” (sometimes labeled “Model” or “Factor”) captures how much the group averages differ from the overall average. “Within groups” (sometimes labeled “Residual” or “Error”) captures the natural spread of individual data points inside each group. A “Total” row at the bottom sums everything up.

Sum of Squares (SS) quantifies each source of variability as a single number. The between-groups SS measures how spread apart your group means are. The within-groups SS measures how much individual observations scatter around their own group’s mean. Bigger between-groups SS relative to within-groups SS is what you’re hoping for: it means the groups really do differ, not just because of random noise.

Degrees of Freedom (DF) adjusts for sample size and number of groups. Between-groups DF equals the number of groups minus one. Within-groups DF equals the total number of observations minus the number of groups. These numbers matter because they feed directly into the next column.

Mean Square (MS) is simply SS divided by DF. This step standardizes the variability so you can fairly compare between-group differences to within-group differences, even when the counts are unequal.

F-statistic is the ratio of the between-groups Mean Square to the within-groups Mean Square. If there’s truly no difference between your groups, this ratio should hover around 1, because both sources of variability would be roughly equal. The further the F-value climbs above 1, the stronger the evidence that at least one group mean stands apart.

P-value translates the F-statistic into a probability. It answers: “If all the group means were actually identical, how likely is it I’d see an F-value this large just by chance?” A p-value below 0.05 is the conventional cutoff for calling a result statistically significant. Some fields use a stricter threshold of 0.01 for stronger confidence. These cutoffs are conventions, not laws of nature, so treat them as guidelines rather than bright lines.

How to Interpret the F-Statistic

The F-value is the centerpiece of the ANOVA table. Think of it as a signal-to-noise ratio. The “signal” is the variation between your groups (did the treatment, condition, or category actually move the needle?). The “noise” is the variation within your groups (how much do individuals naturally differ from each other?).

An F-value near 1 means the differences between groups are about the same size as the random variation you’d expect within groups. There’s no real signal. An F-value of 4, 10, or 20 means between-group differences are that many times larger than within-group noise, which is increasingly hard to explain away as chance. Your software pairs every F-value with a p-value, so you don’t need to judge the F-value in isolation. But understanding what it represents helps you see why the same F-value can be significant in one study and not another: it depends on sample size and degrees of freedom.

What a Significant Result Actually Tells You

A significant ANOVA result (p < 0.05) tells you something limited but important: at least one group mean is different from the others. It does not tell you which groups differ, or how many pairs are different. If you’re comparing three training methods and ANOVA is significant, you know they’re not all equal, but you don’t yet know whether Method A beats Method B, Method C, or both.

This is where post-hoc tests come in. These follow-up comparisons test every pair of groups while adjusting for the fact that you’re running multiple tests at once (which otherwise inflates your chance of a false positive).

  • Tukey’s HSD is the most common choice when you want to compare every group to every other group. It controls your overall error rate cleanly and works best when group sizes are equal. For unequal groups, a modified version called Tukey-Kramer handles the adjustment.
  • Bonferroni correction divides your significance threshold by the number of comparisons you’re making. It’s simple and works with any type of statistical test, but it becomes overly conservative when you have many groups, making it harder to detect real differences.
  • Scheffé’s test is the most cautious option. It can handle complex comparisons beyond simple pairs, like testing whether the average of Groups A and B together differs from Group C. But it’s less powerful for straightforward pairwise comparisons, so it’s generally not the best pick if that’s all you need.

If your ANOVA result is not significant, you stop there. Running post-hoc tests after a non-significant ANOVA doesn’t make statistical sense.

Effect Size: How Big Is the Difference?

Statistical significance tells you whether a difference exists. Effect size tells you whether it matters. A massive sample can make a tiny, practically meaningless difference “significant,” so you always want to check both.

The most common ANOVA effect size is eta-squared, which you can calculate directly from the table: divide the between-groups Sum of Squares by the total Sum of Squares. The result is a proportion between 0 and 1 that represents how much of the overall variation in your data is explained by group membership.

Cohen’s widely used benchmarks for interpreting eta-squared (or the closely related omega-squared) are: 0.01 is a small effect, 0.06 is a medium effect, and 0.14 is a large effect. So if your eta-squared is 0.03, your groups are statistically different but the grouping variable only explains 3% of the variation, a small effect. If it’s 0.18, the grouping variable accounts for 18% of the variation, which is substantial. Omega-squared is a slightly more conservative alternative that corrects for bias in small samples, but the same benchmarks apply.

Reading a Two-Way ANOVA Table

A two-way ANOVA tests two grouping variables at once and adds a critical third element: the interaction between them. Your output table will have separate rows for each main effect (Factor A, Factor B) and a row for their interaction (A × B).

Each row has its own F-statistic and p-value. The main effects tell you whether each variable independently influences the outcome. The interaction tells you whether the effect of one variable depends on the level of the other. For example, a medication might work well for younger patients but not for older ones. The medication’s main effect could look modest, but the interaction between medication and age group would be significant.

Interactions are important to check first. When the interaction is significant, interpreting the main effects in isolation can be misleading, because the story changes depending on which subgroup you’re looking at. In some cases, neither main effect reaches significance on its own, but the interaction does, meaning the two variables only matter in combination. In other cases, all three rows are significant, meaning each variable has an independent effect and also a combined effect that’s different from what you’d predict by adding them together.

Check the Assumptions Before Trusting Results

ANOVA results are only reliable if three assumptions hold. Most statistical software can test these for you, and your output may already include them.

First, the observations need to be independent, meaning one person’s score doesn’t influence another’s. This is a study design issue, not something you test statistically. Repeated measurements on the same people, for instance, violate this assumption and require a different type of ANOVA.

Second, the data within each group should be roughly normally distributed. You can check this with a Shapiro-Wilk test (a significant result means the data depart from normality) or simply by eyeballing a histogram or Q-Q plot. ANOVA is fairly robust to mild violations of normality, especially with larger samples.

Third, the variance (spread of data) should be roughly equal across groups. Levene’s test is the standard check. It’s not very sensitive to outliers, which makes it a reliable choice. If Levene’s test comes back significant, your groups have unequal variances and you may need to use a corrected version of the F-test (like Welch’s ANOVA) or a non-parametric alternative.

A Quick Walkthrough of a Real Output

Suppose you’re comparing test scores across three teaching methods with 30 students per group. Your ANOVA table shows a between-groups SS of 420, a within-groups SS of 2,580, and a total SS of 3,000. Between-groups DF is 2 (three groups minus one), and within-groups DF is 87 (90 total students minus three groups).

Mean Square for between groups: 420 ÷ 2 = 210. Mean Square for within groups: 2,580 ÷ 87 = 29.66. The F-statistic: 210 ÷ 29.66 = 7.08. With 2 and 87 degrees of freedom, this F-value would produce a p-value well below 0.05, so you’d conclude the teaching methods don’t all produce the same average score.

For effect size, eta-squared = 420 ÷ 3,000 = 0.14, which lands right at Cohen’s threshold for a large effect. The teaching method explains about 14% of the variation in test scores. You’d then run Tukey’s HSD to find out which specific methods differ from each other, and report the pairwise results alongside the overall ANOVA.