What Is the Kruskal-Wallis Test and How Does It Work?

The Kruskal-Wallis test is a statistical method for comparing three or more independent groups when your data don’t meet the assumptions required for a standard one-way ANOVA. It’s often called the “one-way ANOVA on ranks” because instead of comparing group means directly, it ranks all the data points from smallest to largest and then checks whether those ranks are distributed evenly across groups. If one group’s ranks are consistently higher or lower than the others, the test flags a statistically significant difference.

When to Use It

The Kruskal-Wallis test fits situations where you have one independent variable with two or more categories (such as treatment groups, age brackets, or regions) and one outcome variable that is either continuous or ordinal. Continuous variables are things like blood pressure, income, or reaction time. Ordinal variables have a clear order but no consistent numeric spacing, like pain rated on a 1-to-10 scale or education level ranked from high school through graduate degree.

The classic reason to reach for this test instead of a one-way ANOVA is that your data violate the normality assumption. Maybe your sample sizes are small, your distributions are heavily skewed, or you’re working with ranked survey responses that don’t produce a neat bell curve. In all of these cases, the Kruskal-Wallis test gives you a valid way to ask: “Do these groups differ?”

What It Assumes

Every statistical test has ground rules, and the Kruskal-Wallis test has four. First, each observation must be independent, both within and between groups. A patient in Group A can’t also appear in Group B, and one person’s score shouldn’t influence another’s. Second, the measurement scale needs to be at least ordinal, meaning the values can be meaningfully ranked from low to high. Third, the distributions in each group should have roughly the same shape, even if their centers differ. The test detects shifts in location (one group tending to score higher), so wildly different spreads or shapes can distort the results. Fourth, each group should contain at least five observations. At that size, the test statistic reliably follows a chi-square distribution, which is what lets you calculate a p-value.

How the Ranking Works

The mechanics are straightforward. You pool every data point from every group into a single list, then rank them from smallest (rank 1) to largest (rank N, where N is the total number of observations). If two or more values are identical, they each receive the average of the ranks they would have occupied. Once every value has a rank, you go back and sort those ranks into their original groups.

If the groups are truly equivalent, you’d expect each group’s average rank to be close to the overall average rank. If one group’s ranks cluster noticeably higher or lower, that’s evidence of a real difference.

The H Statistic

The test produces a single number called H. The formula takes the sum of ranks in each group, squares it, divides by the group size, and adds those values together. It then scales the result by the total number of observations. In more concrete terms, H quantifies how much the group rank totals deviate from what you’d expect under pure chance.

A larger H means the groups differ more. You compare H against a chi-square distribution with degrees of freedom equal to the number of groups minus one. So if you’re comparing four groups, you use three degrees of freedom. If the resulting p-value falls below your chosen threshold (commonly 0.05), you reject the null hypothesis.

Null and Alternative Hypotheses

The null hypothesis states that all groups come from identical populations, meaning no group tends to produce systematically larger or smaller values than any other. The alternative hypothesis states that at least one group differs. Notice the phrasing: “at least one.” A significant result tells you something is different somewhere, but it doesn’t tell you which specific groups differ from each other. That requires a follow-up step.

Post-Hoc Pairwise Comparisons

When the Kruskal-Wallis test returns a significant result, the natural next question is: which groups are actually different? The most widely used follow-up is Dunn’s test, which compares each pair of groups individually. It works by looking at the difference in summed ranks between two groups and comparing that difference to the expected average difference given the number of groups and their sizes.

Because you’re running multiple comparisons, the risk of a false positive increases. Dunn’s test adjusts for this, typically using the Bonferroni correction or a similar method. Most statistical software packages (R, SPSS, Prism, Python’s SciPy) include Dunn’s test as a built-in option alongside the Kruskal-Wallis test, so the process is mostly automated.

Reporting Effect Size

A p-value tells you whether a difference exists, but it says nothing about how large that difference is. For the Kruskal-Wallis test, the two most common effect size measures are eta-squared and epsilon-squared, both calculated from the H statistic.

Eta-squared divides H (minus the number of groups plus one) by the total number of observations minus one. The result ranges from 0 to 1, where values closer to 1 indicate that group membership explains more of the variability in ranks. Epsilon-squared uses a similar approach and also ranges from 0 to 1. Both give you a practical sense of whether the differences you detected are trivially small or meaningfully large. Including one of these alongside your p-value makes your results far more interpretable, especially for readers who need to judge practical significance rather than just statistical significance.

Kruskal-Wallis vs. One-Way ANOVA

The one-way ANOVA assumes your data are normally distributed within each group and measured on a continuous scale with equal variances. When those assumptions hold, ANOVA is more statistically powerful, meaning it’s better at detecting real differences. The Kruskal-Wallis test trades some of that power for flexibility. It works with skewed data, ordinal scales, and small samples where normality is hard to verify.

If your data are roughly normal and your groups have similar variances, stick with ANOVA. If your data are clearly non-normal, ordinal, or you have small groups where checking normality isn’t practical, the Kruskal-Wallis test is the safer choice. In large samples with mild skew, both tests will usually give you the same conclusion.

Kruskal-Wallis vs. Mann-Whitney U

The Mann-Whitney U test does the same kind of rank-based comparison, but only for two groups. The Kruskal-Wallis test generalizes this to three or more groups. If you only have two groups, either test works, and in fact the Kruskal-Wallis test applied to two groups produces results mathematically equivalent to the Mann-Whitney U. But if you have three or more groups and you run separate Mann-Whitney tests on every pair, you inflate your false positive rate. The Kruskal-Wallis test avoids this by testing all groups simultaneously first, then using a controlled post-hoc procedure like Dunn’s test for pairwise comparisons.

A Quick Example

Suppose you want to know whether customer satisfaction scores differ across three store locations. You collect ratings on a 1-to-5 scale from 20 customers at each store. The ratings aren’t normally distributed (most people give 4s and 5s, with a few 1s), and the scale is ordinal. A one-way ANOVA would be questionable here. Instead, you run a Kruskal-Wallis test. All 60 ratings get pooled and ranked 1 through 60. The test calculates H, compares it to a chi-square distribution with 2 degrees of freedom, and returns a p-value. If that p-value is below 0.05, you follow up with Dunn’s test to see which specific store locations differ from each other, and you report an effect size to show whether the difference is large enough to care about.