What Is the Tukey Method for Multiple Comparisons?

The Tukey method is a statistical test used after an ANOVA to figure out exactly which groups differ from each other. When you compare three or more groups and find that a significant difference exists somewhere among them, ANOVA only tells you that not all groups are equal. It doesn’t tell you where the differences lie. The Tukey method fills that gap by testing every possible pair of groups simultaneously while keeping your risk of a false positive under control.

Why ANOVA Alone Isn’t Enough

Suppose you’re comparing the effectiveness of four different fertilizers on plant growth. You run an ANOVA and get a significant result. That tells you the four fertilizers don’t all produce the same growth, but it says nothing about which fertilizer outperforms which. Is Fertilizer A better than B? Is C better than D? To answer those questions, you need a follow-up test, often called a post-hoc test. The Tukey method is the most commonly used one for this purpose.

With four groups, there are six possible pairs to compare. With five groups, there are ten. Each comparison carries a chance of producing a false positive, and as the number of comparisons grows, those chances stack up. If you simply ran individual tests on each pair at the 0.05 significance level, you’d expect to get a false “significant” result about 5% of the time per test. Across many tests, the overall risk of at least one false positive climbs well above 5%. This inflated risk is why running multiple uncorrected comparisons is considered unreliable.

How the Tukey Method Controls False Positives

The core strength of the Tukey method is that it controls the family-wise error rate across all pairwise comparisons at once. If you set your significance level at 0.05, the probability of making even one false positive across all comparisons stays at or below 5%, not per test, but for the entire set of tests combined.

It achieves this by using a special statistical distribution called the studentized range distribution rather than the standard t-distribution. This distribution accounts for the fact that you’re looking at the full range of group means, not just two at a time. The method starts by focusing on the difference between the largest and smallest group means, then evaluates all other pairwise differences against a threshold derived from this distribution. If the difference between two group means exceeds that threshold, the pair is considered significantly different.

Reading Tukey Results

Most statistical software presents Tukey results as a table of pairwise comparisons. For each pair of groups, you’ll typically see the difference between their means, a confidence interval for that difference, and an adjusted p-value. The confidence interval is the key piece to interpret: if it does not contain zero, the two groups are significantly different. If zero falls within the interval, you can’t conclude those groups differ.

Some software also produces what’s called a compact letter display, where groups sharing the same letter are not significantly different from each other. For example, if groups A and B both get the letter “a” and group C gets “b,” that tells you C differs from both A and B, while A and B are statistically similar. This makes it easy to scan results when you have many groups.

When It’s the Right Choice

The Tukey method is designed specifically for situations where you want to compare every group to every other group. If you only care about comparing each treatment to a single control group, other methods (like Dunnett’s test) are more appropriate and more powerful for that narrower question. And if you’re testing a broader set of comparisons beyond simple pairs, Scheffé’s method handles that flexibility, though it tends to be more conservative and less powerful.

A published comparison in Restorative Dentistry and Endodontics noted that Tukey’s HSD provides the simplest way to control the family-wise error rate and is considered the most preferable method when all pairwise comparisons are the goal. Other stepwise methods like the Student-Newman-Keuls procedure offer more statistical power but at the cost of weaker error control, making them more prone to false positives. Duncan’s multiple range test is even more liberal. On the other end, Scheffé’s method is too conservative for simple pairwise work. The Tukey method sits in a practical middle ground: good power with tight error control.

Assumptions the Data Must Meet

The Tukey method carries the same assumptions as the ANOVA it follows. Your observations need to be independent of each other, meaning one measurement shouldn’t influence another. The data within each group should be roughly normally distributed. And the variability within each group should be similar across all groups, a condition called homogeneity of variance. Violations of the equal-variance assumption are the most common problem in practice and can distort results.

Handling Unequal Group Sizes

The original Tukey HSD was designed for balanced designs, where every group has the same number of observations. When your groups have different sizes, a modification called the Tukey-Kramer method is used instead. Most statistical software applies this adjustment automatically.

With equal sample sizes, the confidence level for the full set of comparisons is exactly at your chosen level (for instance, exactly 95% if you set alpha at 0.05). With unequal sizes, the Tukey-Kramer method becomes slightly conservative, meaning the true confidence level is somewhat higher than the nominal value. In practice, this means you’re a bit less likely to detect real differences when group sizes are unbalanced, but you’re also less likely to get false positives. It’s a reasonable tradeoff for most applications.

A Practical Example

Imagine you teach three sections of the same course using different teaching methods and want to know if final exam scores differ. You run a one-way ANOVA and get a significant F-test (p = 0.01). This tells you the three methods don’t all produce the same results, but not which ones differ.

You then run a Tukey HSD test, which produces three comparisons: Method A vs. B, A vs. C, and B vs. C. The results might show that A vs. C has a confidence interval of (3.2, 11.8) points, meaning section A scored between 3.2 and 11.8 points higher than section C, a significant difference since zero isn’t in the interval. Meanwhile, A vs. B might produce an interval of (-1.5, 7.1), which includes zero, so you can’t conclude those two sections differ. This kind of granular information is exactly what makes the Tukey method useful: it moves you from “something is different” to “here is what’s different and by how much.”