How to Mathematically Find Outliers in Statistics

The most common mathematical method for finding outliers uses the interquartile range (IQR): any data point more than 1.5 times the IQR below the first quartile or above the third quartile is an outlier. This approach works on nearly any dataset, regardless of its shape. But it’s not the only option. Z-scores, formal statistical tests, and multivariate distance measures each handle different situations, and choosing the right one depends on your data’s size, distribution, and number of variables.

The IQR Method: Most Versatile Starting Point

The IQR method, sometimes called Tukey’s fences, works by building a “fence” around the middle of your data. Here’s the process step by step:

Sort your data from smallest to largest.
Find Q1 (the median of the lower half) and Q3 (the median of the upper half).
Calculate the IQR: Q3 minus Q1.
Set your fences: Lower fence = Q1 − 1.5 × IQR. Upper fence = Q3 + 1.5 × IQR.
Flag outliers: Any value below the lower fence or above the upper fence is an outlier.

Say your dataset has Q1 = 80 and Q3 = 90. The IQR is 10. Multiply by 1.5 to get 15. Your lower fence is 65 and your upper fence is 105. A value of 62 or 110 would be flagged as outliers; a value of 72 would not.

John Tukey, the statistician who developed this method, also defined a stricter threshold. Using a multiplier of 3 instead of 1.5 identifies what he called “far out” values, essentially extreme outliers. In the example above, the extreme lower fence would be 80 − 30 = 50, and the extreme upper fence would be 90 + 30 = 120. Points beyond these limits are unusual enough to warrant serious investigation. The 1.5 multiplier is the standard default for most analyses and is what box plots use to draw their whiskers.

The IQR method’s biggest advantage is that it doesn’t assume your data follows a bell curve. Because it relies on the median and quartiles rather than the mean and standard deviation, a single extreme value can’t drag the calculation and mask other outliers. This makes it the safest choice when you don’t know much about your data’s distribution, or when it’s clearly skewed.

The Z-Score Method: Best for Normal Distributions

If your data follows a roughly normal (bell-shaped) distribution, you can use Z-scores. A Z-score tells you how many standard deviations a data point sits from the mean:

Z = (data point − mean) / standard deviation

Under the empirical rule for normal distributions, 68% of values fall within one standard deviation of the mean, 95% within two, and 99.7% within three. So a Z-score beyond +3 or −3 means a data point sits outside where 99.7% of values are expected to land. That’s the conventional cutoff for outliers.

Here’s the catch: the mean and standard deviation are themselves sensitive to outliers. One extreme value pulls the mean toward it and inflates the standard deviation, which can shrink the Z-scores of genuinely unusual points. NIST (the National Institute of Standards and Technology) notes this is especially misleading with small samples, where the maximum possible Z-score is mathematically limited by the sample size. In a dataset of 10 values, for instance, no Z-score can exceed about 2.85, so you’d never flag anything at the traditional threshold of 3.

Modified Z-Scores for More Robust Results

To fix the sensitivity problem, you can use modified Z-scores. Instead of the mean, you use the median. Instead of the standard deviation, you use the median absolute deviation (MAD), which is the median of all the absolute differences between each data point and the overall median. The modified Z-score replaces the parts of the calculation that outliers can distort with parts they can’t. NIST recommends flagging any data point with a modified Z-score whose absolute value exceeds 3.5 as a potential outlier.

Modified Z-scores give you the intuitive appeal of “how many standard deviations away is this?” while being far more resistant to the very outliers you’re trying to detect.

Formal Tests for Small Datasets

When you have a small dataset and need statistical confidence that a suspicious value is truly an outlier (not just normal variation), two formal hypothesis tests are commonly used.

Grubbs’ Test

Grubbs’ test checks whether the single most extreme value in your dataset is an outlier. The test statistic is the largest absolute deviation from the mean, divided by the standard deviation. You then compare this value against a critical threshold based on your sample size and chosen significance level (often 5%). If your test statistic exceeds the critical value, you reject the assumption that no outliers exist. The key limitation: Grubbs’ test assumes your data is approximately normally distributed, and it only tests one outlier at a time. If you suspect multiple outliers, you’d need to run the test iteratively, removing one flagged point and retesting.

Dixon’s Q Test

Dixon’s Q test is designed specifically for very small samples, typically between 3 and 10 observations. You calculate a Q ratio by dividing the gap between the suspect value and its nearest neighbor by the overall range of the data. Then you compare this ratio to a table of critical values. For example, with 5 observations at 95% confidence, the critical Q value is 0.710. If your calculated Q exceeds that, the suspect point is statistically an outlier. The smaller your sample, the larger the gap needs to be before Q flags it, which protects you from over-identifying outliers when you simply don’t have enough data to judge.

Multivariate Outliers: When You Have Multiple Variables

All the methods above work on a single column of numbers. But outliers can also hide across combinations of variables. Someone who is 6’5″ isn’t unusual. Someone who weighs 280 pounds isn’t unusual. But depending on the population, someone who is 6’5″ and weighs 120 pounds might be a clear outlier, even though neither measurement alone triggers a flag.

Mahalanobis distance handles this by measuring how far a data point is from the center of all your data, while accounting for correlations between variables. Simple straight-line distance (Euclidean distance) treats every variable independently and assumes they’re all on the same scale. Mahalanobis distance adjusts for the fact that your variables may be correlated and may have very different ranges. It uses the covariance matrix of your dataset to weight the calculation, essentially asking “how many standard deviations away is this point, given the shape and spread of the data cloud?”

Larger Mahalanobis distances indicate observations farther from where most points cluster. You typically compare the squared distances against a chi-squared distribution to set a cutoff, with points exceeding the critical value flagged as multivariate outliers. This is the standard approach in regression diagnostics, survey analysis, and machine learning preprocessing when you’re working with multiple variables at once.

Which Method to Use

Your choice comes down to three factors: whether your data is normally distributed, how large your sample is, and how many variables you’re analyzing.

The IQR method is the safest general-purpose choice. It works on skewed data, ordinal data, and distributions that aren’t bell-shaped. It’s also the easiest to calculate by hand and the hardest for outliers to game, since they can’t influence the quartiles the way they can shift a mean. If you’re doing exploratory analysis or aren’t sure about your data’s distribution, start here.

Z-scores (or better yet, modified Z-scores) are appropriate when you’re confident your data is approximately normal. They give you an intuitive measure of extremity: “this point is 4.2 standard deviations from center.” For small samples where normality holds, Grubbs’ test adds formal statistical rigor. For very small samples (under 10 observations), Dixon’s Q test is purpose-built.

For datasets with multiple variables, Mahalanobis distance is the standard tool because it catches points that look normal on any single variable but are unusual in combination.

Why Outlier Detection Matters

A single outlier can dramatically distort your analysis. The mean is pulled toward extreme values, while the median stays put. If your mean and median are far apart in data that should be roughly symmetric, that gap itself is a signal that outliers may be present. The standard deviation and range are similarly inflated by extreme points, which affects everything from confidence intervals to regression lines.

Finding outliers mathematically is only the first step. Once you’ve flagged them, the real question is why they exist. A data entry error (typing 1000 instead of 100) should be corrected. A measurement from a genuinely different population (an adult’s weight mixed into a dataset of children) should be removed. But a legitimate extreme value, like an unusually high income in a salary survey, is real data that might deserve its own analysis rather than deletion. The math tells you which points are unusual. Context tells you what to do about them.