How to Interpret the Shapiro-Wilk Normality Test

The Shapiro-Wilk test tells you whether your data follows a normal (bell-curve) distribution. It produces two numbers: a W statistic and a p-value. If the p-value is greater than your chosen significance level (typically 0.05), your data is consistent with a normal distribution. If the p-value is 0.05 or less, your data deviates significantly from normality. That’s the core interpretation, but the details matter.

What the Test Actually Tests

The Shapiro-Wilk test sets up two competing claims. The null hypothesis says your sample came from a normally distributed population. The alternative hypothesis says it did not. Your job is to look at the p-value and decide whether there’s enough evidence to reject that null hypothesis.

This framing is important because the test can never prove your data is normal. A high p-value simply means the test found no strong evidence against normality. That’s a subtle but real distinction: you’re either rejecting normality or failing to reject it, never confirming it.

Reading the W Statistic

The W statistic ranges from 0 to 1. It essentially measures how well your data’s sorted values match what you’d expect from a perfect normal distribution. A W value close to 1 means your data closely resembles a bell curve. Small values of W indicate departure from normality.

In practice, most software users focus on the p-value rather than W itself, since interpreting W directly requires comparing it against critical value tables for your specific sample size. The p-value does that comparison for you and gives a straightforward answer.

How to Use the P-Value

Before running the test, pick a significance level. The most common choice is 0.05 (a 5% threshold). Then follow this logic:

P-value > 0.05: You cannot reject the null hypothesis. Your data is consistent with a normal distribution, and you can proceed with statistical methods that assume normality (like t-tests or ANOVA).
P-value ≤ 0.05: You reject the null hypothesis. Your data deviates significantly from a normal distribution, and you should consider non-parametric alternatives or data transformations.

For example, if you run the test and get W = 0.981 and p = 0.169, the p-value is well above 0.05. You’d conclude there’s no statistically significant evidence that your data is non-normal. If instead you got W = 0.874 and p = 0.002, that low p-value tells you the data departs meaningfully from a bell curve.

Why Sample Size Changes Everything

The Shapiro-Wilk test was originally designed for samples between 3 and 50 observations, later extended to handle up to 2,000. It’s considered especially useful for small samples (under 30), where visual inspection alone can be unreliable.

With large samples, the test becomes overly sensitive. A dataset of several hundred or thousand observations will often produce a significant p-value even when the deviation from normality is trivially small and wouldn’t meaningfully affect your analysis. This is one of the most common pitfalls: a statistically significant result doesn’t always mean a practically significant departure from normality. A tiny wobble in the tails of your distribution might trigger a rejection with 1,000 data points but wouldn’t cause any real problems for a t-test or regression.

Conversely, with very small samples (under 20), the test has limited power to detect real non-normality. You might get a non-significant p-value simply because there isn’t enough data for the test to pick up on the problem.

Combine It With a Q-Q Plot

A single p-value doesn’t tell you the whole story. Pairing the Shapiro-Wilk test with a Q-Q (quantile-quantile) plot gives you both a formal statistical answer and a visual understanding of where your data departs from normality.

A Q-Q plot graphs your data’s values against what those values would be if the data were perfectly normal. If the points fall roughly along a straight diagonal line, your data is approximately normal. Curves, S-shapes, or points flying off the line at the ends reveal specific problems: skewness, heavy tails, or outliers. This visual context helps you decide whether a significant Shapiro-Wilk result reflects a real problem or just the test’s sensitivity to large samples. It also helps you understand the nature of the non-normality, which a p-value alone cannot do.

Shapiro-Wilk vs. Other Normality Tests

If you’re choosing between normality tests, the Shapiro-Wilk test is generally the strongest option. Comparative studies rank it as the most powerful normality test across both symmetric and asymmetric non-normal distributions, outperforming the Anderson-Darling, Lilliefors, and Kolmogorov-Smirnov tests. The Kolmogorov-Smirnov test, despite being widely taught, is consistently the weakest performer and requires much larger sample sizes to achieve comparable detection power. If your software offers multiple options, Shapiro-Wilk is the default recommendation.

Running It in Common Software

In Python’s SciPy library, the function scipy.stats.shapiro() returns two values: the W statistic and the p-value. A call might return something like ShapiroResult(statistic=0.981, pvalue=0.169), which you’d interpret as no evidence against normality.

In R, the function shapiro.test() returns the same two values. SPSS includes the test under its Explore procedure, and Excel requires an add-in or manual calculation. Regardless of the tool, interpretation is identical: check the p-value against your significance level.

How to Report Results

When writing up your findings, report the W statistic to two or three decimal places and the exact p-value to two or three decimal places. If the p-value is extremely small, report it as p < .001 rather than writing out a long string of zeros. A typical write-up looks like this:

“A Shapiro-Wilk test indicated that the distribution of scores was not significantly different from normal, W = 0.98, p = .169.”

Or for a significant result: “A Shapiro-Wilk test indicated that the data significantly deviated from a normal distribution, W = 0.87, p < .001.”

State the practical implication: did you proceed with parametric tests, switch to non-parametric methods, or apply a transformation? That context matters more to your reader than the numbers alone.

Common Mistakes to Avoid

The biggest misinterpretation is treating a non-significant result as proof of normality. A p-value of 0.45 doesn’t mean your data is normally distributed. It means you lack evidence to say otherwise. With a small sample, this could simply reflect low statistical power.

Another common error is blindly rejecting parametric methods after a significant Shapiro-Wilk result on a large dataset. Many parametric tests (t-tests, ANOVA, regression) are robust to mild departures from normality, especially with larger samples. A significant Shapiro-Wilk result in a sample of 500 often reflects a deviation so small it won’t affect your main analysis. Look at the Q-Q plot, check how far W is from 1, and make a judgment call rather than treating the p-value as an automatic gate.

Finally, don’t run the test on your dependent variable in isolation when the assumption you’re actually checking is about residuals. Many statistical models assume the residuals (the errors left after fitting the model) are normal, not the raw data itself. Run Shapiro-Wilk on the residuals from your model, not on the original measurements.