How to Interpret the Shapiro-Wilk Test in R

The Shapiro-Wilk test in R checks whether your data follows a normal distribution. You run it with shapiro.test(x), and the output gives you two key values: a W statistic and a p-value. If the p-value is greater than 0.05, your data is consistent with a normal distribution. If it’s less than 0.05, your data deviates significantly from normality.

What the Test Actually Does

The null hypothesis of the Shapiro-Wilk test is that your sample came from a normally distributed population. The alternative hypothesis is that it did not. This is the opposite of what many people expect: a “significant” result (small p-value) means your data is not normal, and a non-significant result means you don’t have evidence against normality.

The W statistic ranges from 0 to 1. Values close to 1 indicate the data closely matches a normal distribution. As W drops further from 1, the data deviates more from normality. In practice, you’ll rely on the p-value for your decision, but the W statistic gives you a sense of how close the fit is.

Running the Test in R

The function takes a single numeric vector. Missing values are allowed, but you need between 3 and 5,000 non-missing values.

# Generate some example data
set.seed(42)
my_data <- rnorm(100, mean = 50, sd = 10)

# Run the test
shapiro.test(my_data)

The output looks like this:

	Shapiro-Wilk normality test

data:  my_data
W = 0.994, p-value = 0.9407

This result has a W of 0.994 (very close to 1) and a p-value of 0.94, well above 0.05. You would fail to reject the null hypothesis, meaning there's no evidence the data is non-normal. That's expected here since the data was generated from a normal distribution.

Now compare that with skewed data:

skewed_data <- rexp(100, rate = 1)
shapiro.test(skewed_data)

This will produce a small p-value (well below 0.05) and a W statistic noticeably less than 1, telling you the data is not normally distributed.

Reading the Output Components

The function returns a list with four components you can access programmatically:

statistic: The W value.
p.value: The approximate p-value for the test.
method: A label confirming it's the Shapiro-Wilk normality test.
data.name: The name of the variable you passed in.

If you need to use the p-value in a script (for example, to decide which statistical test to run next), store the result and extract it:

result <- shapiro.test(my_data)
result$p.value

The 0.05 Decision Rule

The standard threshold is 0.05. A p-value above 0.05 means you accept that the data could reasonably come from a normal distribution. A p-value below 0.05 means the departure from normality is statistically significant, and you should consider non-parametric alternatives for your downstream analysis.

This matters because many common statistical methods assume normality. T-tests, ANOVA, and linear regression all work best when the underlying data (or residuals) are approximately normal. If the Shapiro-Wilk test rejects normality, you might switch to a Wilcoxon rank-sum test instead of a t-test, or a Kruskal-Wallis test instead of ANOVA.

Why Sample Size Changes Everything

The Shapiro-Wilk test behaves very differently depending on how many data points you have, and this is the most common source of misinterpretation.

With small samples (around 30 observations), the test has poor specificity. A simulation study found that at n = 30, the Shapiro-Wilk test's specificity was only about 0.51 at the conventional 0.05 significance level. That means it correctly identifies normal data only about half the time, essentially a coin flip. Small samples often pass the normality test not because the data is truly normal, but because there isn't enough data to detect a departure.

With large samples (several thousand observations), the opposite problem occurs. The test becomes so sensitive that it flags trivially small deviations from perfect normality as "significant." Your data might be plenty normal for practical purposes, but the test rejects it because no real-world data is ever perfectly normally distributed.

R enforces a hard upper limit of 5,000 observations. If your vector is longer than that, the function throws an error. You can work around this by testing a random subset, but the sensitivity issue means the result may not be very informative at that scale anyway.

Pair It With a Q-Q Plot

Because of these sample size issues, you should never rely on the Shapiro-Wilk test alone. A Q-Q (quantile-quantile) plot gives you a visual check that complements the numeric result.

qqnorm(my_data)
qqline(my_data, col = "red")

If your data is normally distributed, the points will fall roughly along the red reference line. Systematic curves or S-shapes indicate skewness. Points that fan out at the tails suggest heavy-tailed or light-tailed distributions. This visual information tells you not just whether the data departs from normality, but how it departs, something the p-value alone cannot do.

A practical approach: if both the Shapiro-Wilk p-value and the Q-Q plot agree, you can be confident in your conclusion. If they disagree (common with large samples where the test is overly strict), trust the visual. A Q-Q plot that looks reasonably straight usually means parametric methods are fine, even if the Shapiro-Wilk p-value dips below 0.05.

What To Do When Data Isn't Normal

If the test tells you your data is non-normal, you have three main options.

First, you can transform the data. Common transformations include log, square root, and reciprocal. These often pull skewed distributions closer to a bell curve. The Box-Cox family of power transformations provides a systematic way to find the best transformation for your specific data. In R, the MASS package has a boxcox() function that estimates the optimal transformation parameter. After transforming, run the Shapiro-Wilk test again to confirm the transformation worked.

Second, you can switch to non-parametric tests that don't assume normality. For comparing two groups, use wilcox.test() instead of t.test(). For comparing three or more groups, use kruskal.test() instead of aov(). These tests use ranks rather than raw values, so the shape of the distribution doesn't matter.

Third, if your sample size is large enough (roughly 30 or more per group), many parametric tests are robust to moderate violations of normality thanks to the central limit theorem. In these cases, a mildly non-normal Shapiro-Wilk result may not require any action at all.

Reporting Your Results

When writing up results for a paper or report, include the W statistic, the p-value, and your sample size. A typical format looks like: "A Shapiro-Wilk test indicated that the data were normally distributed, W = 0.994, p = 0.94, n = 100." If the test was significant, state the opposite: "A Shapiro-Wilk test indicated a significant departure from normality, W = 0.912, p = 0.003, n = 50."

If you tested normality on residuals from a model rather than on raw data, note that as well. Testing residuals is often more appropriate than testing the raw dependent variable, particularly for regression and ANOVA, since those methods assume normally distributed residuals rather than normally distributed raw scores.