What Is Stepwise Regression? Methods, Uses, and Limits

Stepwise regression is an automated method for building a statistical model by adding or removing predictor variables one at a time. Instead of manually deciding which variables belong in your model, the algorithm tests each one against a statistical criterion and keeps only those that improve the model’s fit. The process repeats until no remaining variable meets the threshold for inclusion or removal.

It’s one of the most widely used techniques for narrowing down a large set of potential predictors to a smaller, more manageable model. It’s also one of the most criticized. Understanding how it works, and where it falls short, will help you decide whether it belongs in your analysis.

How the Three Methods Work

There are three main flavors of stepwise regression: forward selection, backward elimination, and bidirectional elimination. They differ in where they start and how they iterate, but all share the same goal of arriving at a model with only the most useful predictors.

Forward Selection

Forward selection begins with an empty model containing no predictors. At each step, the algorithm evaluates every available variable and adds the one whose inclusion produces the biggest improvement in model fit. It then re-evaluates the remaining variables, adds the next best one, and continues until no remaining variable meets the entry criterion. Think of it as building a team by auditioning one player at a time.

Backward Elimination

Backward elimination takes the opposite approach. It starts with every candidate predictor already in the model, then removes the weakest one, the variable contributing the least to model fit. After each removal it re-evaluates what’s left, dropping another variable if warranted. The process stops when every remaining predictor meets the threshold for staying. This is like starting with a full roster and cutting players who aren’t pulling their weight.

Bidirectional Elimination

Bidirectional elimination combines both strategies. At each step, the algorithm can add a new variable or remove an existing one, depending on which action improves the model most. This flexibility matters because a variable that looked useful early on may become redundant once better predictors enter the model. Bidirectional elimination catches those situations and drops the now-unnecessary variable, something forward selection alone cannot do.

What Decides Whether a Variable Stays or Goes

The algorithm needs a rule for judging each variable, and the most common options fall into two categories: significance-based thresholds and information criteria.

With significance-based thresholds, the algorithm uses p-values to decide. A variable enters the model if its p-value falls below a set “alpha to enter” and gets removed if its p-value rises above an “alpha to remove.” Many statistical software packages default both of these to 0.15, which is deliberately more lenient than the traditional 0.05 cutoff. The looser threshold keeps the algorithm from prematurely excluding variables that might prove useful in combination with others.

Information criteria take a different approach. The two most common are the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). Both balance model fit against model complexity: they reward accuracy but penalize every additional variable you include. AIC is the more lenient of the two and tends to keep more predictors. BIC applies a heavier penalty for extra variables, pushing the algorithm toward simpler models. Other criteria, such as adjusted R-squared and Mallows’s Cp, are also used, but AIC and BIC dominate modern practice.

The choice of criterion matters. Switching from AIC to BIC on the same dataset can produce a noticeably different final model, so it’s worth understanding which criterion aligns with your goal. If prediction accuracy is the priority, AIC is a reasonable default. If you want a more parsimonious, interpretable model, BIC’s stricter penalty is a better fit.

A Step-by-Step Example

Imagine you have a dataset with one outcome variable (say, house price) and ten candidate predictors (square footage, lot size, number of bedrooms, age of the house, and so on). Here’s what forward selection would look like in practice:

  • Step 0: The model is empty. No predictors, just the average house price as the baseline.
  • Step 1: The algorithm fits ten separate one-predictor models, one for each candidate. Square footage produces the largest improvement in fit and meets the entry criterion, so it enters the model.
  • Step 2: With square footage already in the model, the algorithm tests the remaining nine variables by adding each one individually. Number of bathrooms provides the biggest additional improvement, so it joins the model.
  • Step 3: The process repeats. This time, lot size helps the most, so it’s added.
  • Stopping: Eventually, no remaining variable improves the model enough to meet the entry threshold. The algorithm stops, and the final model includes only the predictors that earned their way in.

Backward elimination would run the same logic in reverse, starting with all ten predictors and peeling away the least useful one at each step. Bidirectional elimination would do both, potentially adding lot size at one step and then removing number of bedrooms at the next if it became redundant.

Why Statisticians Are Cautious About It

Stepwise regression has a long list of well-documented problems, and the criticisms are serious enough that many statisticians discourage its use for drawing conclusions about which variables truly matter.

The biggest concern is biased coefficients. A study published in the Journal of Clinical Epidemiology found “considerable overestimation” of regression coefficients for variables selected through stepwise methods. Because the algorithm tests many variables and keeps the winners, the coefficients of those winners get inflated, similar to how a sports tournament’s top scorer looks more dominant than they really are because you’re only watching the highlights. This overestimation gets worse with smaller datasets.

A related problem is underestimated uncertainty. After the algorithm finishes selecting variables, most software calculates confidence intervals and p-values as if you had chosen those variables in advance. It ignores all the testing and comparing that happened behind the scenes. The result is confidence intervals that are too narrow and p-values that are too small, making your results look more precise and significant than they actually are.

There’s also the issue of instability. In small datasets, simulation studies have shown that stepwise methods have limited ability to consistently identify truly important variables. At the same time, there’s a real risk of selecting variables that are essentially random noise, simply because so many comparisons are being made. Remove a handful of data points or add a few new ones, and the algorithm may choose an entirely different set of predictors.

Finally, stepwise regression can overfit the training data. The selected model may perform well on the dataset it was built from but poorly on new data, because it has latched onto patterns specific to that particular sample rather than genuine underlying relationships.

When Stepwise Regression Is Still Useful

Despite the criticisms, stepwise regression has a legitimate role in certain situations, particularly as an exploratory tool rather than a confirmatory one.

If you’re in the early stages of research with dozens of potential predictors and no strong theory about which ones matter, stepwise methods can help you identify a shorter list of candidates worth investigating more carefully. The key distinction is treating the output as a hypothesis to test later, not as a finished conclusion. As one review from the Institute of Education Sciences put it, stepwise methods “can be appropriate for variable evaluation” when used to understand how different predictors behave across multiple models.

To use it responsibly, experts recommend several practices. First, don’t let the computer make the final call on which model is “best.” Use the results alongside your knowledge of the subject matter. Second, compare the stepwise results against a best subsets procedure, which evaluates every possible combination of predictors rather than building the model sequentially. Third, if your dataset is large enough, split it in half, run the stepwise procedure on one half, and test the resulting model on the other half. If the model holds up, it’s more trustworthy. Fourth, consider adjusting the default alpha thresholds rather than accepting the software defaults uncritically.

Software Options

In R, the most common tool is the stepAIC() function from the MASS package, which defaults to AIC as its selection criterion. You can switch to BIC by adjusting a single argument. The base R step() function works similarly. For more flexibility across different regression types (logistic regression, Poisson regression, and others), the StepReg package supports forward, backward, bidirectional, and best subset strategies with a range of criteria.

Python’s ecosystem is less unified. The statsmodels library, the closest equivalent to R’s regression tools, does not include a built-in stepwise function. Most Python users write custom loops that mimic the forward or backward procedure using AIC or p-values as the stopping rule. Several third-party packages and code recipes exist, but there’s no single standard implementation the way there is in R.

In commercial software like Minitab and SPSS, stepwise regression is available through menu-driven interfaces where you specify the method (forward, backward, or stepwise) and set the alpha-to-enter and alpha-to-remove thresholds. Minitab defaults both to 0.15.