How to Read and Interpret Tukey Test Results

A Tukey HSD (Honestly Significant Difference) test produces a table of pairwise comparisons between group means, and the core skill is knowing which columns matter and what the numbers tell you. Once you understand a few key pieces, the output becomes straightforward to read regardless of which software generated it.

The Tukey test is a follow-up to ANOVA. ANOVA tells you that at least one group mean differs from the others, but it doesn’t tell you which ones. The Tukey HSD fills that gap by comparing every possible pair of groups and telling you exactly where the differences are.

What Each Column in the Output Means

Most software produces a table with one row per pair of groups. If you have three groups (A, B, and C), you’ll see three rows: A vs. B, A vs. C, and B vs. C. Four groups produce six rows, five groups produce ten, and so on. Each row contains several columns, and while the exact labels vary by software, they represent the same information.

Pair label: This identifies which two groups are being compared. In R, it looks like “B-A” or “Group2-Group1.” The first group listed is subtracted from the second, which determines the sign of the difference.

Diff (or mean difference): The difference between the two group means. A positive number means the first-listed group has a higher mean; a negative number means it has a lower mean. This tells you the size of the effect. A statistically significant result with a tiny difference might not matter practically, so always look at this number alongside the p-value.

Lwr and Upr (lower and upper bounds): These define the confidence interval around the mean difference, typically at the 95% level. The interval gives you a range of plausible values for the true difference between the two group means. This is where one of the most important interpretation rules comes in: if the interval includes zero, the difference is not statistically significant. If both bounds are on the same side of zero (both positive or both negative), the difference is significant.

P-adj (adjusted p-value): This is the p-value after adjusting for the fact that you’re making multiple comparisons at once. The adjustment is the whole point of using Tukey instead of running separate t-tests. A p-adj below your significance threshold (usually 0.05) means that pair of groups is significantly different. A value above 0.05 means you don’t have evidence of a real difference for that pair.

Why the Adjusted P-Value Matters

When you compare many pairs of groups, the chance of a false positive increases with each comparison. If you have five groups, that’s ten pairwise comparisons. Running ten individual tests at the 0.05 level gives you roughly a 40% chance of at least one false positive across the whole set. The Tukey HSD controls this “family-wise” error rate, keeping the overall chance of any false positive at 0.05 no matter how many pairs you compare. That’s why it’s the preferred method when you want to compare all pairs.

This also means the adjusted p-values will be larger (more conservative) than what you’d get from individual t-tests. A pair that looks significant in a simple t-test might not survive the Tukey adjustment, and that’s by design.

Reading the Confidence Interval Plot

Many software packages produce a plot alongside the table, showing each pairwise comparison as a horizontal line segment. Each line represents the confidence interval for that pair’s mean difference, and a vertical dashed line sits at zero. The interpretation rule is simple: any interval that crosses the zero line means no significant difference for that pair. Any interval that falls entirely to the left or right of zero indicates a significant difference.

The plot makes it easy to scan many comparisons at once. You can quickly spot which pairs are significant (their intervals don’t touch zero) and get a visual sense of how large the differences are. Intervals far from zero represent large, clear differences. Intervals that barely miss zero suggest borderline results.

How Compact Letter Displays Work

Some results use a shorthand called a compact letter display, where each group gets assigned one or more letters. The rule: groups that share a letter are not significantly different from each other. Groups with completely different letters are significantly different.

For example, if you see these assignments across six groups:

Group OJ.0.5: b
Group VC.0.5: a
Group OJ.1: c
Group VC.1: b
Group OJ.2: c
Group VC.2: c

OJ.0.5 and VC.1 both have the letter “b,” so they are not detectably different. OJ.1, OJ.2, and VC.2 all share “c,” so those three are in the same group. VC.0.5 is the only one with “a,” meaning it differs significantly from every group that doesn’t also carry “a.” When a group carries two letters (like “ab”), it overlaps with both sets and is not significantly different from any group in either letter category.

A Step-by-Step Example

Suppose you ran an ANOVA comparing test scores across three teaching methods (Lecture, Discussion, Hybrid) and got a significant result. You then run a Tukey HSD and see this output:

Discussion-Lecture: diff = 8.2, lwr = 2.1, upr = 14.3, p-adj = 0.006
Hybrid-Lecture: diff = 12.5, lwr = 6.4, upr = 18.6, p-adj = 0.0001
Hybrid-Discussion: diff = 4.3, lwr = -1.8, upr = 10.4, p-adj = 0.21

Here’s how to read each row. Discussion scored 8.2 points higher than Lecture on average, the confidence interval doesn’t include zero, and the adjusted p-value is well below 0.05, so this difference is significant. Hybrid scored 12.5 points higher than Lecture, again with a confidence interval entirely above zero and a very small p-value, so this is also significant. But Hybrid vs. Discussion shows only a 4.3-point difference, the confidence interval spans from -1.8 to 10.4 (crossing zero), and the p-value is 0.21. No significant difference between those two.

The practical conclusion: both Discussion and Hybrid methods outperform Lecture, but you can’t distinguish Discussion from Hybrid based on this data.

Assumptions That Affect Your Results

Tukey HSD results are only reliable when two conditions hold. First, the observations must be independent, meaning one participant’s score doesn’t influence another’s. Second, the groups need roughly equal variance, a property called homogeneity of variance. If one group’s scores are tightly clustered while another’s are wildly spread out, the test can give misleading results. You can check this with Levene’s test or by comparing group standard deviations visually.

Sample size also matters. When all groups have the same number of observations, the confidence level is exactly what you set (typically 95%). When group sizes are unequal, the test becomes slightly conservative, meaning it’s less likely to flag a difference as significant. This version is sometimes called the Tukey-Kramer method, though most software applies the adjustment automatically without requiring you to do anything different.

Common Mistakes When Reading Results

The most frequent error is looking only at p-values and ignoring the mean differences. A significant p-value tells you a difference exists, not that it’s meaningful. If two groups differ by 0.3 points on a 100-point scale with p-adj = 0.04, that difference is real but probably irrelevant in practice. Always pair statistical significance with the actual size of the difference.

Another common mistake is treating a non-significant result as proof that two groups are the same. A p-adj of 0.08 doesn’t mean the groups are identical. It means you don’t have enough evidence to conclude they’re different, which could change with a larger sample. The confidence interval is more informative here: a wide interval crossing zero tells you the data is simply too noisy to draw a conclusion, while a narrow interval centered near zero gives you more confidence the groups truly are similar.

Finally, avoid running a Tukey test when your ANOVA was not significant. If the overall F-test doesn’t reject the null hypothesis, pairwise comparisons aren’t warranted. The Tukey HSD is designed as a follow-up to a significant ANOVA, not a replacement for it.