How to Interpret ANOVA Results: P-Values and Effect Size

An ANOVA result tells you whether the averages of three or more groups differ more than you’d expect by chance. The core output is an F-statistic and a p-value, but making sense of the full table, including effect sizes and follow-up tests, is what turns raw output into a meaningful answer. Here’s how to read each piece.

What the ANOVA Table Shows

Most software produces a table with the same basic columns: Source, Degrees of Freedom (df), Sum of Squares (SS), Mean Square (MS), F-statistic, and p-value. Each row separates the total variability in your data into two parts: variability between groups and variability within groups (often labeled “Error”).

The between-groups row captures how far each group’s average sits from the overall average. The within-groups row captures how much individual scores vary inside each group. Mean Square is simply the Sum of Squares divided by its degrees of freedom, which converts raw variability into an average variability per unit. The F-statistic is the ratio of those two averages: between-group Mean Square divided by within-group Mean Square. A large F means the group averages are spread apart relative to the noise inside the groups.

Degrees of freedom appear as two numbers. The first (numerator) is the number of groups minus one. The second (denominator) is the total number of observations minus the number of groups. So if you’re comparing 4 groups with 60 total participants, your degrees of freedom are 3 and 56. These numbers shape the distribution used to calculate your p-value, and you’ll report them alongside F, like F(3, 56) = 8.12.

Reading the P-Value

The p-value tells you the probability of seeing an F-statistic at least this large if there were truly no difference between the groups. The conventional threshold is 0.05: a p-value below that is typically considered statistically significant, meaning you reject the idea that all group means are equal. A p-value below 0.01 is often described as highly significant, and below 0.001 as very highly significant.

These cutoffs are conventions, not laws of nature. They trace back to the statistician R.A. Fisher, who proposed 0.05 as a reasonable line but never intended it as an absolute rule. A p-value of 0.048 and one of 0.052 carry nearly identical evidence, so treat the threshold as a guideline rather than a cliff edge. What matters more is the size of the effect, which the p-value alone cannot tell you.

Why Effect Size Matters More Than You Think

A significant p-value tells you the group differences are unlikely to be pure noise. It does not tell you how large those differences are. With a big enough sample, even trivially small differences produce significant p-values. That’s where effect size comes in.

The most common effect size for ANOVA is eta-squared, which represents the proportion of total variability in your data explained by group membership. A related measure, omega-squared, adjusts for bias and is slightly more conservative. Both use the same general benchmarks from Cohen’s guidelines:

  • Small effect: 0.01 (about 1% of variability explained)
  • Medium effect: 0.06 (about 6%)
  • Large effect: 0.14 (about 14%)

If your ANOVA output shows p = 0.003 but eta-squared is 0.02, you have a statistically significant but small effect. The groups genuinely differ, but the practical magnitude of that difference is modest. Reporting both the p-value and an effect size gives a far more complete picture than either number alone.

What a Significant Result Does Not Tell You

ANOVA is an omnibus test. A significant F-statistic tells you that at least one group differs from at least one other group, but it doesn’t tell you which groups differ. If you’re comparing three teaching methods and get a significant result, you know the methods aren’t all equally effective, but you don’t yet know whether Method A beats Method B, Method B beats Method C, or both.

To answer that question, you need post-hoc tests, which are pairwise comparisons that adjust for the fact that you’re running multiple tests at once. Without that adjustment, your chance of a false positive climbs with every comparison you make.

Choosing the Right Post-Hoc Test

The right post-hoc test depends on what you’re comparing and how many comparisons you plan to make.

Tukey’s HSD is the most widely recommended choice when you want to compare every group to every other group. It controls the overall false-positive rate while maintaining reasonable sensitivity. If your groups are different sizes, the Tukey-Kramer variation handles that automatically. For most standard ANOVA follow-ups, Tukey is the default.

Bonferroni correction works well when you have a small, specific set of comparisons planned in advance, generally fewer than ten. It simply divides your significance threshold by the number of tests, which makes it straightforward but increasingly conservative as comparisons multiply.

Dunnett’s test is purpose-built for designs where you’re comparing several experimental groups against a single control group rather than comparing all groups to each other. It has strong statistical power for that specific situation.

Scheffé’s procedure handles the broadest range of comparisons, including complex combinations of groups, not just simple pairs. However, that flexibility comes at a cost: it’s the most conservative option and not recommended if you’re only interested in pairwise differences.

One test to avoid in most cases is Fisher’s Least Significant Difference (LSD), which doesn’t adequately control the false-positive rate when you’re running multiple comparisons.

Interpreting a Two-Way ANOVA

When your design includes two factors (for example, medication type and therapy type), the output expands to include three tests instead of one: a main effect for each factor and an interaction effect between them.

A main effect tells you whether one factor influences the outcome after averaging across the levels of the other factor. If the main effect for medication is significant, the medication groups differ overall regardless of which therapy they received. If the main effect for therapy is significant, therapy types differ regardless of medication.

The interaction effect is often the most interesting line in the table. A significant interaction means the effect of one factor depends on the level of the other. For instance, a drug might produce large improvements when combined with cognitive behavioral therapy but only modest improvements when combined with a waitlist condition. The drug’s effect isn’t consistent across therapy conditions, and that inconsistency is the interaction.

When the interaction is significant, interpret the main effects cautiously. The overall averages for each factor can mask the pattern. An interaction plot (a line graph with one factor on the x-axis and separate lines for each level of the other factor) makes this visible. Parallel lines suggest no interaction. Lines that cross or diverge sharply suggest the factors combine in ways that matter. Start with the interaction, and only interpret main effects independently if the interaction is non-significant or if the main effects hold consistently across all conditions.

Checking Assumptions Before You Trust the Results

ANOVA results are only reliable if your data meet certain conditions. Checking these assumptions after running the test is standard practice, and most software makes it straightforward.

Normality

ANOVA assumes the residuals (the differences between each data point and its group mean) follow a roughly normal distribution. You can check this visually with a histogram, boxplot, or Q-Q plot, where points should fall along a straight diagonal line. For a formal test, the Shapiro-Wilk test is the most recommended option. It tests whether your data deviate from normality; a non-significant result (p > 0.05) means you have no evidence of a problem. Both visual inspection and a formal test together give the most reliable assessment.

ANOVA is fairly robust to mild violations of normality, especially with larger samples. Severe skewness or heavy outliers in small samples are more concerning.

Equal Variances

The groups should have roughly similar spread. Levene’s test checks this formally: a significant result means the variances differ enough to be a concern. If that happens, you can use a corrected version of ANOVA (Welch’s ANOVA) that doesn’t require equal variances.

Independence

Each observation should be independent of the others. This is a design issue rather than something you test statistically. If the same participants appear in multiple groups (repeated measures), you need a repeated-measures ANOVA instead of a standard one-way ANOVA.

Reporting Your Results

A complete ANOVA result in a paper or report typically follows a standard format. For a one-way ANOVA, it looks something like: F(2, 87) = 6.41, p = .003, η² = .13. The numbers in parentheses are the degrees of freedom (between-groups, within-groups). Then comes the F value, the exact p-value, and an effect size measure.

If the overall F is significant, follow it with your post-hoc results, specifying which group means differed, the direction of the differences, and the adjusted p-values. Including group means and standard deviations in a table gives readers everything they need to evaluate your findings independently. The combination of a significant F, a meaningful effect size, and clearly identified group differences is what turns an ANOVA from a single number into a complete story about your data.