How to Interpret Wilcoxon Signed Rank Test Results

The Wilcoxon signed-rank test tells you whether the differences between paired observations are systematically shifted away from zero. If your p-value is below your significance threshold (typically 0.05), you can conclude that the median difference between your two conditions is not zero, meaning one condition consistently produces higher or lower values than the other. But the p-value alone doesn’t tell the full story. Understanding what the test statistic, ranks, and effect size each contribute will help you draw meaningful conclusions from your results.

What the Test Actually Measures

The Wilcoxon signed-rank test is built for paired data: before-and-after measurements, matched subjects, or repeated observations on the same individuals. It tests whether the differences within those pairs are centered on zero. Unlike a paired t-test, it doesn’t assume those differences follow a normal distribution. It only assumes the distribution of differences is roughly symmetric around a central value.

The null hypothesis states that the median difference between paired observations is zero, meaning no systematic shift between the two conditions. The alternative hypothesis states that the median difference is not zero. You can also run one-sided versions if you have a directional prediction (for example, that scores increased after treatment).

How the Test Statistic Is Calculated

Understanding how W is built helps you interpret it. The test works in four steps:

  • Calculate differences. For each pair, subtract one measurement from the other. If you’re testing against a hypothesized value rather than comparing two conditions, subtract that value from each observation.
  • Take absolute values. Drop the sign temporarily so you can rank by magnitude.
  • Rank the absolute differences. The smallest absolute difference gets rank 1, the next gets rank 2, and so on.
  • Sum the positive ranks. Go back to the original signs. W equals the sum of ranks for differences that were positive.

If there’s no real difference between conditions, you’d expect the positive and negative ranks to be roughly balanced, and W would land near the middle of its possible range. A very large W means most of the bigger differences favored the positive direction. A very small W means the negative direction dominated.

Reading the P-Value

The p-value tells you the probability of getting a test statistic as extreme as yours (or more extreme) if there were truly no difference between the paired conditions. A p-value below 0.05 is conventionally considered statistically significant, letting you reject the null hypothesis and conclude that a real difference exists.

A non-significant p-value (above 0.05) does not prove the two conditions are identical. It means you don’t have enough evidence to conclude they differ. This distinction matters: with a small sample, you may simply lack the power to detect a real but modest effect.

For small samples (roughly under 20 pairs), software typically calculates an exact p-value by working out every possible arrangement of ranks. For larger samples, the distribution of W becomes approximately normal, and the software converts W into a Z-score using this relationship: the expected value of W under the null hypothesis is n(n+1)/4, where n is the number of pairs with non-zero differences. The standard deviation is the square root of n(n+1)(2n+1)/24. Your output may report this Z-score alongside or instead of W.

Interpreting the Z-Score

When your software reports a Z value, it’s simply the standardized version of W. A Z-score near zero supports the null hypothesis. Values further from zero (positive or negative) indicate a larger departure from what you’d expect if there were no difference. The sign of Z tells you the direction: it reflects whether the positive or negative ranks dominated. In most software output, a negative Z means the sum of positive ranks was smaller than expected.

The p-value attached to Z follows the standard normal distribution. A two-tailed p-value tests for a difference in either direction. If you predicted a specific direction ahead of time, you can halve the two-tailed p-value to get a one-tailed result, but only if your hypothesis was stated before looking at the data.

Effect Size: How Big Is the Difference?

A significant p-value tells you the difference is unlikely due to chance, but it says nothing about how large or meaningful the difference is. For that, you need an effect size. The most common approach for the Wilcoxon signed-rank test is to calculate r by dividing the Z-score by the square root of the total number of observations (r = Z / √N).

Cohen’s guidelines for interpreting r:

  • Small effect: r ≥ 0.10
  • Medium effect: r ≥ 0.30
  • Large effect: r ≥ 0.50

Another option is the matched-pairs rank-biserial correlation, which uses the difference between the sum of positive ranks and the sum of negative ranks, divided by their total. This gives a value between -1 and +1, with the same interpretation thresholds. It’s more intuitive because it directly reflects the balance between ranks favoring each direction. A rank-biserial correlation of 0.60, for example, means a large proportion of the ranked differences lean toward one condition.

Handling Ties and Zero Differences

Two situations complicate ranking, and your software handles both automatically, but you should understand what it’s doing.

When two or more pairs have the same absolute difference, they receive “tied ranks,” meaning each is assigned the average of the ranks they would have occupied. For example, if the 3rd and 4th smallest absolute differences are identical, both receive rank 3.5.

When a pair has zero difference (both measurements are identical), that pair is typically dropped from the analysis entirely. Zero differences get the smallest possible rank, and since they contribute nothing to either the positive or negative sum, removing them is standard practice. Your effective sample size n then equals only the pairs with non-zero differences, which is the number that matters for your p-value calculation.

How It Compares to a Paired T-Test

If your differences are normally distributed, the paired t-test is the standard choice. But the Wilcoxon signed-rank test loses surprisingly little power in that scenario. Research comparing the two tests found that even under conditions optimized for the t-test (normally distributed data), the average power difference was only about 0.01, essentially negligible. Under non-normal distributions like heavy-tailed or skewed data, the Wilcoxon test consistently outperformed the t-test, often by a meaningful margin.

This makes the Wilcoxon test a reliable default for paired data. You give up almost nothing when the t-test’s assumptions are met, and you gain substantial protection when they aren’t. The only real requirement is that the distribution of differences is approximately symmetric. If your differences are heavily skewed, even the Wilcoxon test may not be appropriate, and a sign test could be a better alternative.

Reporting Your Results

When writing up Wilcoxon signed-rank results, include the test statistic (W or Z), the p-value, the sample size, and ideally the medians for each condition along with an effect size. A typical write-up looks like this:

“A Wilcoxon signed-rank test indicated that post-training scores were significantly higher than pre-training scores, Z = -2.33, p = .020, r = 0.42.” Including the medians and interquartile ranges for both conditions gives readers a concrete sense of the shift. Since this is a nonparametric test, report medians rather than means as your measure of central tendency.

If you have a table of ranks from your output, pay attention to the number of negative ranks, positive ranks, and ties. These tell you how many pairs shifted in each direction. A significant result driven by 12 positive ranks and 3 negative ranks paints a clearer picture than a Z-score alone. The mean rank column, if present, tells you the average magnitude of shifts in each direction, helping you see whether one direction involved larger changes.