How to Use ANOVA: Steps, Assumptions, and Results

ANOVA (analysis of variance) is a statistical test that tells you whether the average values of three or more groups are meaningfully different from each other. If you only have two groups to compare, a simple t-test works fine. Once you have three or more, ANOVA becomes the right tool because running multiple t-tests inflates your chance of a false positive. The core idea is straightforward: ANOVA compares how much variation exists between your groups to how much variation exists within them, and produces a single number (the F-statistic) that tells you whether the group differences are larger than you’d expect by chance.

How the F-Statistic Works

ANOVA breaks total variability in your data into two pieces. The first is between-group variance, which captures how spread out the group averages are from the overall average. The second is within-group variance (also called error), which captures how spread out individual data points are within each group. The F-statistic is simply the ratio of between-group variance to within-group variance.

When F is close to 1, it means the differences between your groups are about the same size as the random noise within them, and there’s no real evidence of a group effect. When F is much larger than 1, the group differences are bigger than you’d expect from noise alone, and the associated p-value will be small. A p-value below 0.05 is the conventional threshold for concluding that at least one group differs from the others.

Critically, a significant ANOVA result only tells you that some difference exists somewhere among the groups. It does not tell you which groups differ. That’s what post-hoc tests are for.

Choosing the Right Type of ANOVA

The type of ANOVA you need depends on two things: how many factors (independent variables) you’re testing, and whether the same people appear in multiple groups.

One-way ANOVA is the simplest version. You have one categorical independent variable with three or more levels and one continuous outcome. For example, comparing average test scores across three different teaching methods. Each participant belongs to only one group.

Two-way ANOVA adds a second factor. You might compare test scores across teaching methods and across class sizes simultaneously. The advantage here is that you can also detect interaction effects, where the impact of one factor depends on the level of the other.

Repeated measures ANOVA is for situations where the same participants are measured more than once on the same outcome. This covers two common designs: tracking changes over multiple time points (measuring blood pressure before, during, and after an exercise program) or testing the same people under different conditions (having the same participants rate three different products). Because the same individuals contribute data to every group, this design is more statistically powerful and requires fewer total participants.

Assumptions You Need to Check First

ANOVA is a parametric test, which means it makes specific assumptions about your data. Running ANOVA when these assumptions are violated can produce misleading results. You need to verify three things before interpreting your output.

Independence

Each observation must be independent of the others. One person’s score shouldn’t influence another’s. This is primarily a design issue rather than something you test statistically. If participants were measured in groups where they could influence each other, or if your data contains repeated measurements that you’re treating as independent, this assumption is violated. Repeated measures ANOVA handles the repeated measurement scenario specifically, so use that version when participants contribute more than one data point.

Normality

The data within each group should be approximately normally distributed. You can check this visually with histograms or Q-Q plots, or formally with tests like the Shapiro-Wilk test. A non-significant result (p > 0.05) on the Shapiro-Wilk test suggests your data don’t deviate meaningfully from a normal distribution. In practice, ANOVA is fairly robust to mild violations of normality, especially with larger sample sizes. With very small or heavily skewed samples, a non-parametric alternative like the Kruskal-Wallis test is safer.

Homogeneity of Variance

The spread of scores within each group should be roughly equal. Levene’s test checks this: if the p-value is above 0.05, you can assume variances are similar enough to proceed. If Levene’s test is significant, your groups have unequal variances, and you’ll want to use a correction (like Welch’s ANOVA) that doesn’t require this assumption.

Running the Analysis Step by Step

Regardless of which software you use (SPSS, R, Excel, Python, or others), the workflow follows the same sequence.

  • Set up your data correctly. You need one column for your continuous outcome variable (the thing you measured) and one column for your grouping variable (which group each observation belongs to). Each row represents one observation.
  • Check assumptions. Run normality and variance tests as described above. If assumptions are met, proceed with standard ANOVA. If not, choose an appropriate alternative or correction.
  • Run the ANOVA. The output will give you an F-statistic, degrees of freedom, and a p-value. If p is below your chosen significance level (typically 0.05), you have evidence that at least one group mean differs from the others.
  • Calculate effect size. A significant p-value doesn’t tell you how large the difference is. Eta-squared is the most common effect size measure for ANOVA. It represents the proportion of total variance explained by the group variable. An eta-squared of 0.01 is considered small, 0.06 is medium, and 0.14 is large.
  • Run post-hoc tests if the overall ANOVA is significant. This identifies exactly which group pairs differ.

Picking the Right Post-Hoc Test

Once ANOVA tells you a difference exists, post-hoc tests compare every possible pair of groups while controlling the overall error rate. The choice depends on your situation.

Tukey’s HSD (honestly significant difference) is the most widely used option when you want to compare all possible pairs of groups. It’s straightforward, controls the overall error rate well, and is considered the default choice for standard pairwise comparisons. If your groups have unequal sizes, the Tukey-Kramer modification handles that.

Bonferroni correction works best when you have a small number of planned comparisons, roughly fewer than ten. It adjusts your significance threshold by dividing 0.05 by the number of comparisons you’re making. This is a conservative approach, and it loses statistical power quickly as the number of comparisons grows. Use it when you decided in advance which specific pairs you wanted to test, not when you’re exploring all possible combinations.

ScheffĂ©’s procedure is the most conservative option and is designed for completely exploratory analysis where you had no specific comparisons planned beforehand. It controls errors well but is the least powerful, meaning it’s the hardest to get a significant result from. For most standard pairwise comparison situations, Tukey’s HSD or Bonferroni will serve you better.

Sample Size and Statistical Power

Before collecting data, it’s worth doing a power analysis to make sure you’ll have enough observations to detect a real effect if one exists. Four factors determine the sample size you need: the expected effect size, your significance level (alpha, typically 0.05), the desired power (typically 0.80, meaning an 80% chance of detecting a true effect), and the number of groups.

Smaller effects require more participants to detect. If you expect a large difference between groups, you can get away with fewer people per group. If you expect a subtle difference, you need substantially more. Cohen’s f is the standard effect size measure used in ANOVA power calculations, and most statistical software packages include power analysis tools that will calculate the required sample size once you input these parameters. Running an underpowered study risks missing real effects entirely, so this step is worth the effort upfront.

Interpreting Your Results

A complete ANOVA result includes several pieces of information. Here’s what to look at and what it means.

The F-statistic and its p-value tell you whether the overall test is significant. A typical way to report this would be: F(2, 87) = 5.43, p = 0.006. The numbers in parentheses are degrees of freedom. The first reflects the number of groups minus one, and the second reflects the total number of observations minus the number of groups. To illustrate what this looks like in practice, a clinical study tracking hemoglobin levels across four time points during an anemia treatment reported F(2.438, 1026.279) = 62.210, p < 0.0001, showing that hemoglobin levels changed significantly over the 45-day treatment period, with an average increase of 0.73 g/dl.

Effect size tells you how meaningful the difference is in practical terms. A significant p-value with a tiny effect size means the difference is real but possibly too small to matter. An eta-squared of 0.06 means your grouping variable explains about 6% of the total variation in outcomes, which is a medium-sized effect.

Post-hoc comparisons then give you the specific picture: group A differs from group C (p = 0.003), but neither differs from group B (p = 0.21 and p = 0.34). This granular information is what actually answers your research question.

Common Mistakes to Avoid

The most frequent error is running multiple t-tests instead of ANOVA. With three groups, that’s three separate t-tests, and your real chance of a false positive rises from 5% to about 14%. With more groups, it gets worse. ANOVA solves this by testing all groups simultaneously.

Another common mistake is stopping at the overall F-test. A significant ANOVA only tells you something differs. Without post-hoc tests, you don’t know what. Conversely, running post-hoc tests when the overall ANOVA is not significant is unnecessary and can produce misleading results.

Ignoring effect size is a subtler problem. With a large enough sample, even trivial differences become statistically significant. Always pair your p-value with an effect size to judge whether the difference is practically important, not just statistically detectable.

Finally, treating ANOVA results as proof of causation is a mistake unless your data come from a true randomized experiment. If participants self-selected into groups or if groups differed in ways beyond the variable you’re studying, the group differences could reflect those other factors rather than the one you care about.