How To Tell If Residuals Are Normally Distributed

You can check whether residuals are normally distributed using visual methods like Q-Q plots and histograms, or formal statistical tests like the Shapiro-Wilk test. Most practitioners use a combination of both, since each approach has blind spots depending on your sample size. Getting this right matters: for regression to produce valid p-values and confidence intervals, the residuals need to be approximately normal. When that assumption breaks down, the probability that your confidence interval actually contains the true value can drift far from what you calculated.

Why Normality of Residuals Matters

The normality assumption isn’t about your raw data. It’s about the residuals, the differences between your observed values and the values your model predicted. For least squares regression to work as intended, those residuals need to be independent, normally distributed, and have constant variance.

When residuals aren’t normally distributed, your confidence intervals and p-values become unreliable. The actual probability of a Type I error (concluding there’s an effect when there isn’t one) can be very different from the 5% threshold you think you’re using. In practical terms, you might be much more likely to report a false positive, or you might be missing real effects because your intervals are too wide. This is especially consequential in smaller samples, where the central limit theorem can’t bail you out.

Start With a Q-Q Plot

A Q-Q (quantile-quantile) plot is the single most useful visual tool for assessing normality. It plots your residuals against the values you’d expect if they came from a perfect normal distribution. If the residuals are normal, the points fall along a straight diagonal line. Deviations from that line tell you exactly what’s going wrong.

Here’s what to look for:

Points curving away at both ends in the same direction: This indicates skewness. If the points bow above the line on the left and above on the right, your residuals are right-skewed. The opposite curve means left-skewed.
An S-shaped pattern: Points that fall below the line on the left and above it on the right (or vice versa) suggest heavy tails, meaning your residuals have more extreme values than a normal distribution would. The relationship between sample and theoretical percentiles is no longer linear.
A mostly straight line with one or two points far off: This typically signals individual outliers rather than a fundamental problem with normality. The relationship is approximately linear except for those isolated data points.

Q-Q plots are effective at any sample size and give you diagnostic information that a single pass/fail test can’t. They tell you not just whether normality is violated, but how.

Histograms: Useful but Limited

Plotting a histogram of your residuals is intuitive. You’re looking for the familiar bell-shaped curve: symmetric, with most values clustered near zero and tails tapering off evenly on both sides. Obvious skewness, multiple peaks, or a flat shape all suggest non-normality.

The catch is sample size. NIST notes that residual sample sizes are generally small (under 50) because experiments have limited treatment combinations. With so few data points, a histogram’s shape becomes unreliable. The bars are chunky, gaps appear by chance, and a perfectly normal distribution can look lopsided. If you have fewer than 50 residuals, treat a histogram as a rough sanity check and rely more heavily on the Q-Q plot.

The Shapiro-Wilk Test

The Shapiro-Wilk test is the most widely recommended formal test for normality, particularly for smaller samples. It works by comparing your residuals to what a normally distributed set of values with the same mean and spread would look like. The null hypothesis is that the distribution is normal.

Interpreting the result is straightforward: if the p-value is greater than 0.05, you don’t have sufficient evidence to reject normality. If the p-value is below 0.05, the test is flagging a statistically significant departure from a normal distribution. A p-value below 0.01 suggests a stronger violation.

The Shapiro-Wilk test is the better choice when your sample is under 50 observations, as it has more statistical power to detect non-normality in small datasets. For samples of 50 or more, the Kolmogorov-Smirnov test becomes an option, though the Shapiro-Wilk handles larger samples adequately too.

Why You Shouldn’t Rely on Tests Alone

Formal normality tests have a fundamental tension built in. With small samples, they often lack the sensitivity to detect real departures from normality. With large samples, they become overly sensitive, flagging trivial deviations that have no practical impact on your regression results. A dataset with 10,000 observations will almost always fail a Shapiro-Wilk test even if the residuals are close enough to normal for your confidence intervals to be perfectly trustworthy.

This is why experienced analysts pair a formal test with a Q-Q plot. The test gives you an objective threshold. The plot lets you judge whether a detected violation is severe enough to actually matter. A slight curve in the tails of an otherwise straight Q-Q plot is very different from a dramatic S-shape, even if both produce p-values below 0.05.

Skewness and Kurtosis as Quick Checks

You can also examine the skewness and kurtosis of your residuals directly. A normal distribution has a skewness of 0 (perfectly symmetric) and a kurtosis of 3 (sometimes reported as “excess kurtosis” of 0). The Jarque-Bera test formalizes this by combining both measures into a single statistic, checking whether the skewness and kurtosis of your residuals are jointly consistent with normality.

As a quick rule of thumb for small samples, a skewness or kurtosis z-score beyond 1.96 or below -1.96 is enough to raise concern at the 0.05 significance level. Scores beyond 2.58 point to a more serious violation. These thresholds give you a fast numeric check without running a full test, though they work best as a supplement to visual inspection.

What To Do When Residuals Aren’t Normal

If your checks reveal non-normal residuals, you have several options depending on what the Q-Q plot and tests are telling you.

Transform Your Response Variable

The most common fix is to transform the variable you’re trying to predict. A log transformation works well for right-skewed data, which is extremely common in biological and financial measurements. A square root transformation is a milder correction. The Box-Cox method takes this further by finding the optimal transformation automatically. It searches across a family of power transformations and identifies which one best normalizes your residuals. The key values to know: a Box-Cox parameter of 1 means no transformation is needed, 0.5 corresponds to a square root, 0 to a log transformation, and -1 to taking the reciprocal.

After transforming, refit your model and recheck the residuals. The goal is for the new residuals to fall more closely along that diagonal line in a Q-Q plot.

Use Robust or Alternative Methods

If the non-normality is driven by outliers or heavy tails, robust regression is a practical alternative to ordinary least squares. Rather than treating every data point equally, robust regression downweights observations with large residuals. It’s a middle ground between throwing out outliers entirely and letting them distort your results. If your robust regression estimates look very different from your ordinary regression estimates, that’s a strong signal that outliers were heavily influencing your original model.

Bootstrapping is another option. Instead of relying on the normality assumption to calculate confidence intervals and p-values, bootstrapping generates them empirically by resampling your data thousands of times. This sidesteps the normality requirement altogether.

A Practical Workflow

For most analyses, a sensible sequence looks like this. First, fit your regression model and extract the residuals. Plot them in a Q-Q plot and scan for curvature, S-shapes, or isolated outliers. If the sample is small enough for a histogram to be meaningful (roughly 50 or more residuals), check that too. Then run a Shapiro-Wilk test to get a formal result. If both the visual and the test agree that things look fine, move on. If either raises a flag, inspect the Q-Q plot more carefully to understand the nature of the problem, then decide whether a transformation, robust regression, or bootstrapping is the right fix.

Keep in mind that the normality assumption becomes less critical as your sample grows, because the sampling distribution of your regression coefficients approaches normality regardless, thanks to the central limit theorem. With several hundred observations and only mild skewness, your p-values and confidence intervals are likely still reliable even if a formal test rejects normality. With 20 observations and a clearly S-shaped Q-Q plot, you have a real problem worth fixing.