What Does a Shapiro-Wilk Test Show for Normality?

The Shapiro-Wilk test shows whether a set of data follows a normal distribution, the familiar bell curve where most values cluster around the average and fewer appear at the extremes. It produces a test statistic called W and a p-value that together tell you how likely it is your data came from a normally distributed population. This matters because many common statistical procedures, like t-tests and ANOVA, assume your data is roughly normal before they can give reliable results.

What the Test Actually Measures

The Shapiro-Wilk test compares your sample data against what a perfectly normal distribution with the same average and spread would look like. Its null hypothesis is straightforward: “this sample comes from a normal distribution.” The alternative hypothesis is that it does not. So when you run the test, you’re essentially asking the data to prove it isn’t normal, and seeing whether the evidence is strong enough to make that case.

The test outputs a statistic called W, which ranges from 0 to 1. A W value close to 1 means your data closely matches a normal distribution. Small values of W indicate departure from normality. The calculation works by comparing the ordered values in your sample against the values you’d theoretically expect from a normal distribution of the same size, then checking how well those two sets of numbers line up.

How to Read the P-Value

The p-value is what most people actually use to make a decision. The standard threshold is 0.05:

  • P-value greater than 0.05: You fail to reject the null hypothesis. Your data is consistent with a normal distribution, and you can proceed with parametric tests that assume normality.
  • P-value less than 0.05: You reject the null hypothesis. The data likely does not come from a normal distribution, and you may need to use non-parametric alternatives or transform your data.

A critical point that trips people up: a high p-value doesn’t prove your data is normal. It just means the test didn’t find enough evidence to say it isn’t. With small samples, the test may lack the power to detect subtle departures from normality, so a passing result is weaker evidence than it might seem.

Why It’s Used Before Other Tests

The most common reason to run a Shapiro-Wilk test is as a preliminary check before performing a parametric analysis. Parametric tests like t-tests, ANOVA, and linear regression all assume that the underlying data (or residuals) follow a normal distribution. If that assumption is violated, the results of those tests can be misleading, producing p-values and confidence intervals you can’t trust.

So the typical workflow looks like this: collect your data, run the Shapiro-Wilk test, and check the result. If normality holds, you move forward with your planned parametric test. If not, you either transform the data (a log transformation is common for right-skewed distributions) or switch to a non-parametric test like the Mann-Whitney U or Kruskal-Wallis, which don’t require a normal distribution.

Sample Size: Where It Works Best

The original 1965 version of the test was designed for sample sizes between 3 and 50. A later extension expanded its range up to 2,000 observations, and that extended version is what most modern statistical software uses. The test is particularly recommended for small samples under 30, where visual methods like histograms and Q-Q plots can be hard to interpret because there simply aren’t enough data points to see a clear shape.

With very large samples, the test becomes extremely sensitive. When you have thousands of observations, even trivial deviations from perfect normality, ones that would have zero practical impact on your analysis, can produce a significant p-value. This is one reason statisticians recommend pairing the Shapiro-Wilk test with a visual check like a Q-Q plot rather than relying on the p-value alone. If the Q-Q plot looks reasonably straight but the test flags a significant result, the deviation is probably too small to worry about.

How It Compares to Other Normality Tests

The Shapiro-Wilk test is consistently the most powerful normality test available. A comprehensive comparison of four common normality tests found that Shapiro-Wilk outperformed the Anderson-Darling, Lilliefors, and Kolmogorov-Smirnov tests across all types of non-normal distributions and all sample sizes from 10 to 2,000. This held true for both symmetric non-normal distributions (like one with heavier tails than a bell curve) and asymmetric distributions (skewed to one side).

The Anderson-Darling test came in a close second and performs comparably in many situations. The Kolmogorov-Smirnov test, despite being widely taught, was consistently the weakest and required much larger sample sizes to achieve the same detection power as the other tests. If you’re choosing a single normality test, Shapiro-Wilk is the standard recommendation.

Limitations to Keep in Mind

The test requires that your observations are independent of each other. If your data points are correlated, for example repeated measurements on the same person over time, the test results won’t be valid without accounting for that structure first.

It also only tests for normality specifically. A non-significant result doesn’t tell you what distribution your data does follow if it isn’t normal. And as mentioned, the test’s high sensitivity at large sample sizes can lead to rejecting normality for deviations that are statistically detectable but practically irrelevant. Many parametric tests are robust to mild departures from normality, especially with larger samples, so a significant Shapiro-Wilk result doesn’t automatically mean your planned analysis is invalid.

For these reasons, most statisticians treat the Shapiro-Wilk test as one piece of evidence rather than the final word. Combining it with a Q-Q plot and looking at skewness and kurtosis values gives a more complete picture of whether your data’s distribution is close enough to normal for your purposes.