What Is Cook’s Distance? Definition and Interpretation

Cook’s distance is a statistic that measures how much a single data point influences the results of a linear regression model. It answers a simple but important question: if you removed one observation from your dataset, how much would your regression predictions change? A large Cook’s distance means that one point is pulling your model’s results in a noticeable way.

Introduced by statistician R. Dennis Cook in a 1977 paper published in Technometrics, the measure has become one of the most widely used regression diagnostics. It combines two properties of each data point, its leverage and its residual, into a single number that flags potentially problematic observations.

How Cook’s Distance Works

Every data point in a regression has two characteristics that matter for influence. The first is its residual: how far the point sits from the regression line. Points with large residuals are outliers in the vertical direction. The second is its leverage: how extreme the point’s position is along the horizontal axis (the predictor variables). A point with high leverage sits far from the center of the data and has more potential to tilt the regression line.

Cook’s distance rolls both of these into one measure. A point can have a large Cook’s distance because it has a big residual, because it has high leverage, or both. This is what makes it more useful than looking at residuals or leverage alone. An outlier that happens to fall near the center of your predictor values won’t move the regression line much. A high-leverage point that falls right on the trend line won’t either. But a point that is both far from the line and far from the center of the data can dramatically shift your model’s predictions.

Technically, Cook’s distance for each observation summarizes how much all of the fitted values change when that observation is deleted. It captures the global effect on every prediction your model makes, not just the prediction at that one point.

Outliers, Leverage, and Influence

These three terms get confused often, but they describe different things. An outlier is a point with a large residual: it doesn’t fit the pattern the rest of the data follows. A high-leverage point has unusual predictor values, sitting far from the bulk of the data in the x-direction. An influential point is one that actually changes the regression results when removed.

The key insight is that not all outliers are influential, and not all high-leverage points are influential. A high-leverage point that happens to align perfectly with the overall trend reinforces the regression line rather than distorting it. Cook’s distance identifies points that are genuinely influential, the ones where removing them would shift your regression coefficients and predictions in a meaningful way.

Common Threshold Values

There is no formal statistical test for Cook’s distance. Instead, analysts use rules of thumb. The two most common cutoffs are:

Cook’s distance greater than 1: This is the more conservative rule. Points above 1 are generally considered highly influential and worth investigating closely.
Cook’s distance greater than 4/n: Here, n is the number of observations in your dataset. This is a stricter threshold that flags more points, especially in smaller datasets. In a dataset of 100 observations, any point with a Cook’s distance above 0.04 would be flagged.

Neither threshold is absolute. A point that exceeds the cutoff isn’t automatically wrong or invalid. It simply deserves closer attention. You might discover it’s a data entry error, an unusual but legitimate case, or a sign that your model doesn’t capture the full picture.

Reading Diagnostic Plots

Most statistical software produces a residuals-versus-leverage plot with Cook’s distance overlaid as contour lines (usually dashed red curves). Points that fall outside these dashed lines have high Cook’s distance values and are considered influential.

The spots to watch are the upper-right and lower-right corners of the plot. Those are where points combine high leverage with large residuals. When no points cross the dashed lines, your data is relatively well-behaved and no single observation is dominating the model. When a point clearly crosses those boundaries, the software typically labels it with its row number so you can investigate.

A common visual pattern: if all your data clusters on the left side of the plot and one labeled point sits far to the right beyond the dashed lines, that point is pulling your regression results. Removing it and refitting the model will show you how much the coefficients change in practice.

Computing Cook’s Distance

In R, fitting a linear model with lm() and calling plot() on the result produces diagnostic plots that include Cook’s distance. You can also extract the values directly with cooks.distance(). In Python, the statsmodels library provides Cook’s distance through the OLSInfluence class, which has a cooks_distance property that returns the value for every observation. Both approaches compute the measure automatically from your fitted model, so there’s no need to calculate it by hand.

What To Do With Flagged Points

Finding a point with a high Cook’s distance is the beginning of an investigation, not an automatic reason to delete data. Start by checking whether the observation is a data entry error or measurement mistake. If it is, correcting or removing it is straightforward.

If the point is legitimate, try fitting the model with and without it and compare the results. If removing one observation changes your conclusions, your model may be fragile. That could mean your sample is too small, your model is missing an important variable, or the relationship you’re modeling isn’t as straightforward as a simple linear regression assumes. In some cases, a single influential point reveals a subgroup in your data that behaves differently from the rest.

Dropping valid data purely because it’s influential is generally bad practice. A better approach is to report your results both ways (with and without the point) and discuss the sensitivity of your findings.

Limitations With Multiple Influential Points

Cook’s distance evaluates one observation at a time. This creates a vulnerability called masking: when multiple influential points cluster together, they can make each other look normal. Each point’s influence is partially absorbed by the others, so none of them individually triggers a high Cook’s distance value. Research on this problem has shown that masking depends on where the outliers are located, how their residuals relate to each other, and how they’re arranged in the predictor space.

A related problem is swamping, where the presence of genuine outliers causes ordinary observations to be incorrectly flagged as influential. Both masking and swamping become more likely when a dataset contains several high-leverage points. If you suspect multiple problematic observations, methods that evaluate groups of points simultaneously (rather than one at a time) are more reliable than relying on Cook’s distance alone.