When to Remove Outliers and When to Keep Them

You should remove outliers only when you have a clear, documentable reason that the data point doesn’t belong, such as a measurement error, equipment malfunction, or data entry mistake. Simply being far from the mean is not enough. Removing outliers without justification inflates your false positive rate and can turn nonsignificant results into misleadingly “significant” ones.

Legitimate Reasons to Remove a Data Point

The strongest case for removal is when you can trace the outlier back to a specific error. A sensor that malfunctioned during recording, a decimal point entered in the wrong place, a test subject who received the wrong treatment, or a sample contaminated by an unrelated process. In each of these cases, the data point doesn’t reflect what you were trying to measure. It’s not an extreme value from your population; it’s a mistake.

Readings outside a physically plausible range also qualify. If you’re measuring human body temperatures and one reading says 180°F, that’s clearly an instrument error. The ARRIVE guidelines for research reporting list equipment miscalibration, sampling errors, and human mistakes like forgetting to switch on recording equipment as scientifically justifiable reasons for exclusion.

The key distinction is between responsible data cleaning and data manipulation. If you can explain why a value is wrong (not just unusual), removal is appropriate. If you’re removing it because it makes your results look better, that’s a problem.

Why Careless Removal Distorts Your Results

Removing outliers without justification doesn’t just clean your data. It systematically biases your statistical tests. Monte Carlo simulations published in the Marshall Journal of Medicine showed what happens when you drop the highest value from one group and the lowest from another before running a t-test. With groups of 10, the test produced a false positive (p < 0.05) over 20% of the time, four times the expected 5% rate. Even with sample sizes in the thousands, the false positive rate remained well above 5%.

This matters because it means selective outlier removal can make two identical populations appear significantly different. The effect isn’t limited to t-tests, either. The same simulations found that nonparametric tests like the Wilcoxon test were less susceptible but still showed elevated false positive rates through sample sizes of about 50 per group.

The practical takeaway: if you remove outliers and then run a standard comparison test, your p-value is probably more optimistic than it should be.

How to Identify Outliers Statistically

Before deciding whether to remove anything, you need a consistent method for flagging unusual values. Three approaches are widely used, each with different strengths.

The IQR Method

This is the simplest and most visual approach. Calculate the interquartile range (the span between the 25th and 75th percentiles), multiply it by 1.5, then subtract that value from the 25th percentile and add it to the 75th. Any point falling outside those fences is flagged as a potential outlier. This is what box plots use to draw their whiskers and plot extreme dots.

Z-Scores and Their Limitations

A Z-score tells you how many standard deviations a point sits from the mean. Values beyond 2 or 3 standard deviations are commonly flagged. The problem is that both the mean and standard deviation are themselves pulled toward outliers, which can mask the very points you’re trying to find. In small samples, the maximum possible Z-score is mathematically limited, making this method especially unreliable when you have fewer than 25 or 30 observations.

The Modified Z-Score (MAD Method)

A more robust alternative replaces the mean with the median and the standard deviation with the median absolute deviation, or MAD. The median is far less sensitive to extreme values. To calculate the MAD, you find the median of your data, subtract it from every observation, then take the median of those differences and multiply by 1.4826 (a scaling constant that makes it comparable to a standard deviation for normal data). Statisticians Iglewicz and Hoaglin recommend flagging any point with a modified Z-score above 3.5 as a potential outlier. NIST endorses this approach over traditional Z-scores.

Influential Points in Regression

In regression analysis, the concern shifts from “is this point extreme?” to “does this point change my results?” A value can be far from others without affecting your regression line much, or it can sit in a position that drags the entire line toward it.

Cook’s Distance is the standard measure for this. It quantifies how much all of your predicted values would change if a single point were removed. Penn State’s guidelines offer a practical rule of thumb: a Cook’s Distance above 0.5 warrants investigation, and above 1.0 the point is quite likely influential. You can also just look for values that visually stand apart from the rest. An influential point doesn’t necessarily need removal, but it does need scrutiny. Run your analysis with and without it and report both results.

Alternatives to Deletion

Removing data always means losing information. Two common alternatives keep every data point while reducing the influence of extremes.

Trimming discards a fixed percentage from both ends of your data (say, the top and bottom 5%) and calculates the mean from what remains. This is straightforward but reduces your sample size.

Winsorizing replaces those extreme values with the most extreme remaining value instead of dropping them. If you winsorize at 5%, the bottom 5% of values all become equal to the value at the 5th percentile, and the top 5% become equal to the value at the 95th percentile. Simulation studies show that winsorizing produces more efficient and more robust estimates than trimming, with roughly 10% better efficiency for each increment of trimming applied. Because winsorizing preserves your full sample size, it’s generally the better choice when you need a resistant measure of center.

A third option is to skip removal entirely and use statistical methods that aren’t sensitive to outliers. Nonparametric tests, robust regression, and median-based analyses all handle extreme values without requiring you to justify deleting them.

How to Document Outlier Decisions

Whatever you decide, transparency is what separates good practice from questionable practice. The ARRIVE guidelines, widely adopted in experimental research, offer a clear standard: report every exclusion, the reasoning behind it, and whether your exclusion criteria were defined before data collection began. If you used a software tool’s built-in outlier test (like GraphPad Prism’s ROUT method), state that explicitly.

Ideally, your exclusion criteria are set in your analysis plan before you see the data. Deciding after the fact which points to drop invites bias, even unintentionally. If you’re working with groups, note whether you were blinded to group assignments when making exclusion decisions. A simple table or flowchart showing how many data points started in each group and how many were excluded (with reasons) gives readers everything they need to evaluate your choices.

The strongest approach is to present your primary analysis with all data included, then show a sensitivity analysis with outliers removed. If your conclusions hold either way, the outlier question becomes moot. If they don’t, that discrepancy is itself an important finding.