What Is an Influential Point in Statistics?

An influential point in statistics is a data point that, if removed, would substantially change the results of a regression analysis. That change might show up in the slope of the regression line, the intercept, the strength of the relationship, or even the conclusions of a hypothesis test. A single influential point can shift your entire model, which makes identifying these points one of the most important steps in any regression analysis.

Understanding influential points requires separating three related but distinct concepts: outliers, leverage, and influence. These terms often get used interchangeably, but they describe different things, and mixing them up leads to real analytical mistakes.

Outliers, Leverage, and Influence

An outlier is a data point whose response value doesn’t follow the general trend of the rest of the data. Think of a scatterplot where most points cluster around a line, but one point sits far above or below it. That point is an outlier in the vertical (y) direction.

A high-leverage point is different. Leverage is about the predictor (x) value, not the response. A data point has high leverage when its x value is far from the other x values in the dataset. Imagine measuring the relationship between hours studied and exam scores for students who studied between 1 and 5 hours. A student who studied 20 hours would be a high-leverage point, simply because their x value is so extreme compared to everyone else’s.

Influence combines both ideas. A point is influential when it actually changes the regression results in a meaningful way. Here’s the key insight: not every outlier is influential, and not every high-leverage point is influential. A point generally needs to be both an outlier and a high-leverage point to exert real influence on the regression line. Penn State’s statistics curriculum walks through four instructive scenarios that illustrate this clearly:

  • Outlier but no leverage: A point that falls far from the trend line but sits in the middle of the x range typically won’t pull the slope much. It’s an outlier but not influential.
  • Leverage but not an outlier: A point with an extreme x value that still follows the existing trend won’t change the regression line much either. It has leverage but isn’t influential.
  • Both an outlier and high leverage: A point that sits far from the other x values AND far from the trend line has the potential to drag the regression line toward itself. This is where influence happens.

The practical takeaway: you can’t just flag extreme points and assume they’re problems. You have to check whether removing them actually changes your results.

How One Point Can Reshape Your Results

The effects of a single influential point can be dramatic. In one textbook example from Penn State’s regression course, including a single problematic data point dropped the R-squared value from 97.32% to 55.19%. That’s the difference between concluding a very strong relationship exists and concluding the relationship is only moderate. One data point changed the entire interpretation.

Influential points also affect the precision of your estimates. In that same example, the standard error of the slope estimate became almost 3.5 times larger when the influential point was included, jumping from 0.200 to 0.686. A larger standard error means wider confidence intervals, which means you’re less certain about the true relationship between your variables. In some cases, a relationship that would be statistically significant without the influential point becomes insignificant with it, or vice versa.

It’s worth noting that these effects can work in both directions. An influential point might artificially inflate R-squared by creating the appearance of a relationship that doesn’t really exist in the rest of the data. Or it might suppress a genuine relationship by pulling the regression line away from where the bulk of the data points sit.

Measuring Influence With Cook’s Distance

The most widely used tool for quantifying influence is Cook’s Distance. For each data point in your dataset, Cook’s Distance calculates how much all of the predicted values would change if that single point were removed. A large Cook’s Distance means the point is pulling the regression line substantially.

The standard guidelines for interpreting Cook’s Distance are straightforward:

  • Greater than 0.5: The point deserves a closer look. It may be influential.
  • Greater than 1.0: The point is quite likely influential.
  • Stands out from the rest: If one point’s Cook’s Distance is dramatically larger than the others, it’s almost certainly influential, even if it doesn’t cross the 0.5 or 1.0 thresholds.

That third guideline is often the most useful in practice. In many datasets, no points will exceed 1.0, but one or two might have Cook’s Distance values several times larger than the rest. Those relative spikes matter more than any absolute cutoff. Another common rule of thumb you’ll encounter is flagging any point with a Cook’s Distance above 4/n, where n is the sample size. This threshold is more conservative and tends to flag more points for review.

Spotting Influential Points Visually

Most statistical software produces a Residuals vs. Leverage plot as part of its standard regression diagnostics. This plot puts leverage on the horizontal axis and residuals on the vertical axis, with dashed curves representing Cook’s Distance thresholds. It’s one of the quickest ways to identify influential points.

The places to watch are the upper right and lower right corners of the plot. Points in those regions combine high leverage (far right on the x-axis) with large residuals (far from zero on the y-axis), which is exactly the combination that produces influence. If a point falls outside the dashed Cook’s Distance lines, it’s influential enough to alter your regression results. When no points are influential, the Cook’s Distance lines may barely be visible because all the data sits comfortably within them.

A simple scatterplot of your raw data can also help, especially with single-variable regression. Look for points that sit far from the cluster of other observations in both the x and y directions. If you suspect a point is influential, fit the regression with and without it. If the slope, intercept, or R-squared changes substantially, you’ve confirmed its influence.

What to Do With Influential Points

Finding an influential point doesn’t automatically mean you should remove it. The right response depends on why the point is influential.

If the point reflects a data entry error, a measurement malfunction, or a recording mistake, removing it is straightforward. You’re correcting an error, not manipulating your data. If the point comes from a genuinely different population (for example, a commercial building in a dataset of residential homes), excluding it makes sense because it doesn’t belong to the group you’re studying.

The harder cases are when the influential point is a legitimate observation that just happens to be unusual. Removing real data because it’s inconvenient weakens your analysis. In these situations, the better approach is to report your results both with and without the point, so your audience can see how sensitive your conclusions are to that single observation. If your conclusions hold either way, the influential point isn’t really a problem. If your conclusions flip depending on whether one data point is included, that’s a signal your results aren’t robust enough to draw strong conclusions from.

You can also consider using regression methods that are less sensitive to extreme values. Robust regression techniques downweight influential observations automatically rather than forcing you to make a binary keep-or-remove decision. These approaches give unusual points less say in determining the regression line without discarding them entirely.