A post hoc test is needed after an ANOVA (or similar omnibus test) produces a statistically significant result and you need to find out which specific groups differ from each other. ANOVA tells you that at least one group mean is different from the others, but it doesn’t tell you which ones. The post hoc test fills that gap by comparing groups in pairs while controlling for the inflated error risk that comes with making multiple comparisons.
Why ANOVA Alone Isn’t Enough
When you compare three or more groups, ANOVA gives you a single p-value for the overall “F test.” If that p-value falls below your significance threshold (typically 0.05), you know something is going on, but the test is intentionally vague about where. Imagine testing whether three different study methods produce different exam scores. A significant ANOVA result means the methods aren’t all equal, but it won’t tell you whether Method A beat Method B, Method A beat Method C, or both.
That’s where the post hoc test comes in. It takes each possible pair of groups, calculates the difference between their means, and determines whether that difference is large enough to be considered statistically significant. In an experiment with three groups, that means three pairwise comparisons. With four groups, it jumps to six. With five, ten. The number of comparisons grows quickly, and that creates a problem.
The Multiple Comparisons Problem
You might wonder why you can’t just run a series of simple two-group tests instead of bothering with a post hoc procedure. The reason is error accumulation. Every time you run a statistical test at the 0.05 level, there’s a 5% chance of a false positive. Run one test and you’re fine. Run ten, and the probability that at least one of them produces a false positive climbs well above 5%. This overall false positive rate across a set of tests is called the family-wise error rate.
Post hoc tests are specifically designed to keep that family-wise error rate in check. They do this by applying stricter criteria to each individual comparison, so that even after running all of them, the total risk of any false positive stays at or below 0.05. The tradeoff is reduced statistical power: you’re less likely to detect a real difference. But that’s a far better tradeoff than reporting differences that don’t actually exist.
Two Conditions That Must Be Met
A post hoc test is appropriate only when two conditions are true. First, your ANOVA must be significant. If the overall F test isn’t significant, there’s no statistical basis for digging into pairwise differences. Running post hoc tests after a non-significant ANOVA is like searching a room after being told nothing is in it.
Second, you need three or more groups. If you’re comparing exactly two groups, ANOVA and a standard t-test give you the same information. A significant result already tells you which two groups differ, because there’s only one possible pair. Post hoc testing adds nothing in that scenario.
Post Hoc Tests vs. Planned Comparisons
Not every multi-group study calls for a post hoc test. If you decided before collecting data that you only cared about specific comparisons (say, each treatment group versus a control), you can use planned comparisons instead. These are also called a priori contrasts, and they carry two advantages: they’re more powerful, meaning they’re better at detecting real differences, and they force you to think carefully about your research design before looking at results.
Post hoc tests, by contrast, are exploratory. They test every possible pair without any advance prediction. This makes them appropriate when you don’t have strong hypotheses about which groups will differ, or when you want a comprehensive picture of all the differences in your data. The word “post hoc” literally means “after this,” reflecting the fact that these comparisons are decided after seeing a significant omnibus result.
Choosing the Right Post Hoc Test
Several post hoc procedures exist, and the right choice depends on your data and your research question.
- Tukey’s HSD is the most widely used option when you want to compare every group to every other group. It provides a single critical value: if the difference between any two group means exceeds that value, the pair is significantly different. For standard pairwise comparisons, this is generally the default recommendation.
- Bonferroni correction works best when you’re making a small number of comparisons or testing only selected pairs rather than all possible pairs. It simply divides the significance threshold by the number of comparisons. The limitation is that it becomes overly conservative as the number of comparisons grows, making it harder to detect real effects.
- ScheffĂ©’s test is designed for complex comparisons, not just pairs. If you want to compare combinations of groups (for example, the average of groups A and B against group C), ScheffĂ© handles that. For simple pairwise comparisons, though, it’s less powerful than other options and generally not recommended.
When Your Data Violates Assumptions
Standard post hoc tests like Tukey’s HSD assume that all groups have roughly equal variance (spread). When that assumption is violated, meaning some groups are much more variable than others, you need a different set of tests. Games-Howell is a common choice for larger samples (50 or more per group), while Dunnett T3 is preferred for smaller samples. These procedures don’t assume equal variances, so they give more reliable results when your groups differ in variability.
If your data has unequal variances, you’d typically run a Welch ANOVA or Brown-Forsythe ANOVA instead of a standard ANOVA, and then follow up with one of these adjusted post hoc tests.
Reading Post Hoc Results
A post hoc output typically gives you a table showing every pair of groups, the difference between their means, and whether that difference is statistically significant. For example, if three teaching methods produce average test scores of 43.7, 63.2, and 84.3, the post hoc test would compare the three possible pairs: the difference of 40.6 between the first and third group, the 21.1-point gap between the second and third, and the 19.5-point difference between the first and second.
Each of those differences is compared against a critical value that the procedure calculates. If a mean difference exceeds the critical value, that pair of groups is significantly different. If it falls short, you can’t conclude those two groups differ. It’s common for some pairs to be significant and others not. That’s exactly the kind of nuance the omnibus ANOVA can’t provide on its own, and it’s the reason post hoc tests exist.

