What Is a Tukey Test? Post-Hoc ANOVA Explained

A Tukey test is a statistical method used to figure out which specific group averages differ from each other after an ANOVA (analysis of variance) has already told you that at least one difference exists somewhere. It’s one of the most widely used “post-hoc” tests in statistics, meaning it comes after the main analysis to pinpoint exactly where the differences lie.

Think of it this way: ANOVA is like a smoke detector that tells you there’s a fire somewhere in the building. The Tukey test is what tells you which room it’s in.

Why ANOVA Alone Isn’t Enough

When you compare three or more groups, ANOVA tests whether all the group averages are equal or whether at least one is different. If the result is significant, you know something is going on, but ANOVA doesn’t tell you which pairs of groups actually differ. Are groups A and B different? B and C? All three from each other?

You might think you could just run a bunch of individual comparisons between every pair. The problem is that each comparison carries a small risk of a false positive (typically 5%), and those risks stack up fast. With five groups, you’d need 10 pairwise comparisons, and your overall chance of at least one false positive balloons well beyond 5%. This is called the family-wise error rate, and it’s the core problem the Tukey test solves.

How the Tukey Test Controls False Positives

The Tukey test, formally called Tukey’s Honestly Significant Difference (HSD), keeps your overall false-positive rate at whatever threshold you set (usually 5%) across all pairwise comparisons simultaneously. It does this by using a special distribution called the studentized range distribution rather than the standard distributions used in simpler tests. This distribution accounts for the fact that you’re making multiple comparisons at once, so it sets a higher bar for calling any single comparison “significant.”

This is what makes it different from just running repeated tests. The Tukey method is specifically designed for situations where you want to compare every group to every other group, and it adjusts for all of those comparisons in one step.

How It Works in Practice

The basic idea is straightforward. The test calculates a critical “yardstick” value, often labeled w. If the difference between any two group averages exceeds this yardstick, those groups are considered significantly different. If the difference falls short, you can’t conclude they differ.

The yardstick value depends on three things: the number of groups being compared, the total number of observations, and the amount of variability within the groups (pulled from the ANOVA results). More groups or more variability means a larger yardstick, making it harder to declare a difference significant. More observations makes the yardstick smaller, giving you more power to detect real differences.

Here’s a simplified walkthrough. Say you tested four fertilizers on plant growth and got group averages of 21.0, 25.9, 28.6, and 29.2. After computing the yardstick (suppose it’s 2.824), you’d compare every pair:

29.2 minus 21.0 = 8.2, which exceeds 2.824: significantly different
29.2 minus 25.9 = 3.3, which exceeds 2.824: significantly different
28.6 minus 25.9 = 2.7, which does not exceed 2.824: not significantly different
28.6 minus 21.0 = 7.6, which exceeds 2.824: significantly different

You’d work through all six possible pairs this way. The results are often summarized with letter labels: groups sharing the same letter aren’t significantly different from each other, while groups with different letters are.

Reading the Output

Most statistical software reports Tukey test results as a table showing each pair of groups, the difference between their averages, a confidence interval for that difference, and a p-value. The confidence intervals are the most intuitive piece to interpret.

If a confidence interval for a given pair does not contain zero, that pair is significantly different. Zero in the interval means the true difference could plausibly be nothing at all, so you can’t claim the groups differ. This logic works the same way as checking whether the difference exceeds the yardstick value; it’s just a different way of presenting the same information. A p-value below your threshold (typically 0.05) will always correspond to a confidence interval that excludes zero.

Assumptions You Need to Meet

The Tukey test requires three conditions. First, the observations must be independent, meaning one measurement doesn’t influence another. Second, the data within each group should be roughly normally distributed. Third, the groups need to have similar amounts of variability, a property called homogeneity of variance. If one group’s data is far more spread out than another’s, the test can give misleading results.

The classic Tukey HSD also assumes equal sample sizes across groups. When your groups have different numbers of observations, you can use a modified version called the Tukey-Kramer method, which calculates the variability separately for each pair. This adjustment is built into most statistical software (R, SAS, SPSS, and others) and runs automatically when group sizes are unequal. The Tukey-Kramer method tends to be slightly conservative with unbalanced designs, meaning it’s a bit less likely to detect real differences, but it still controls the false-positive rate reliably.

Tukey vs. Other Post-Hoc Tests

The Tukey test sits in the middle of the spectrum between liberal and conservative post-hoc methods. Understanding where it falls helps you choose the right tool.

Fisher’s Least Significant Difference (LSD) is the most liberal option. It’s essentially just running standard pairwise comparisons without correcting for the number of tests. It finds the most “significant” differences, but it doesn’t control the family-wise error rate, so some of those findings are likely false positives. It’s generally not recommended unless you have only three groups and your ANOVA was already significant.

The Bonferroni correction takes the opposite approach, dividing your significance threshold by the number of comparisons. It’s stricter than Tukey, which means fewer false positives but also less statistical power. As the number of groups grows, Bonferroni becomes increasingly conservative and can miss real differences.

Scheffé’s method is the most conservative of the common options. It controls the error rate not just for pairwise comparisons but for all possible contrasts between groups (including complex combinations). That extra protection comes at a cost: it has the least power to detect simple pairwise differences. If you’re only comparing pairs, Scheffé is overkill.

For most situations where you want to compare every group to every other group, the Tukey test is the standard recommendation. It offers the best balance of false-positive control and the ability to detect real differences when all pairwise comparisons are of interest.

When to Use a Tukey Test

The Tukey test is the right choice when you’ve run a one-way ANOVA, gotten a significant result, and want to know which specific groups differ. It’s ideal when you’re interested in all possible pairwise comparisons rather than just comparing each group to a single control (for that, Dunnett’s test is more appropriate and more powerful).

It’s commonly used in fields like agriculture, psychology, medicine, and engineering, anywhere researchers compare multiple treatments or conditions and need to identify which ones actually produce different outcomes. If your analysis involves comparing group averages and you’ve met the assumptions of independence, normality, and similar variability across groups, the Tukey HSD is typically the first post-hoc test to reach for.