Stepwise regression is an automated method for building a regression model by adding or removing predictor variables one at a time, based on statistical tests at each step. Instead of manually deciding which variables belong in your model, the algorithm evaluates candidates and keeps only those that meet a significance threshold. It’s one of the most widely used variable selection techniques in statistics and data science, though it comes with important limitations.
How Stepwise Regression Works
The core idea is simple: rather than testing every possible combination of predictors (which becomes impractical with more than a handful of variables), stepwise regression builds a model incrementally. At each step, it runs statistical tests on the available predictors and decides whether to add one, remove one, or stop. The process repeats until no remaining change improves the model enough to justify another step.
The decision at each step typically relies on p-values from partial F-tests. Many software packages set the default significance threshold at 0.15 for both adding and removing variables. That’s more lenient than the familiar 0.05 cutoff used in hypothesis testing, because the goal here is screening rather than final inference. If a variable’s p-value falls below 0.15, it’s considered a candidate for the model. If its p-value later rises above 0.15 (because another variable entered the model and changed the picture), it gets removed.
There are three main flavors of this approach, and the differences come down to where you start.
Forward Selection
Forward selection starts with an empty model containing no predictors at all. In the first step, the algorithm fits a separate one-predictor model for every candidate variable and checks which one has the strongest relationship with the outcome. If that variable’s p-value is below the entry threshold (typically 0.15), it enters the model.
In the second step, the algorithm tries every remaining variable alongside the one already selected, forming two-predictor models. Again, the best-performing addition enters if it meets the threshold. This process repeats, adding one variable per step, until no remaining predictor clears the bar. At that point, the algorithm stops and returns whatever model it has built so far.
The limitation of pure forward selection is that it never looks back. A variable that was useful early on might become redundant once other predictors enter, but forward selection won’t remove it.
Backward Elimination
Backward elimination works in the opposite direction. You start with every candidate variable in the model, then remove the weakest one. Specifically, the algorithm identifies the predictor with the highest p-value (the least significant contributor). If that p-value exceeds the removal threshold, the variable is dropped. The model is refit, and the process repeats until every remaining variable is statistically significant enough to stay.
Some implementations use adjusted R-squared instead of p-values as the criterion. In that version, the algorithm removes a variable only if doing so actually increases the adjusted R-squared, meaning the simpler model explains the data better after accounting for its reduced complexity. When no removal improves the score, the algorithm stops.
Backward elimination has a practical advantage: because it starts with all variables in the model, it can detect relationships that only appear when other predictors are already accounted for. The tradeoff is that it requires fitting a full model first, which isn’t always feasible when you have more candidate variables than observations.
Bidirectional (Standard Stepwise)
The standard stepwise method combines both approaches. It generally starts like forward selection, adding the strongest predictor first. But after each addition, it pauses and checks whether any variable already in the model has become non-significant. If so, that variable is removed before the next forward step.
For example, suppose variable X1 enters the model first. In the next step, X2 is added. The algorithm then rechecks X1’s significance in the presence of X2. If X1’s p-value has risen above the removal threshold, it gets dropped. This back-and-forth continues until neither adding nor removing any variable would improve the model. This is the version most people mean when they say “stepwise regression” without further qualification.
Selection Criteria Beyond P-Values
P-values are the traditional gatekeeping tool, but stepwise procedures can also use information criteria to compare models. The two most common are the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). Both measure how well a model fits the data while penalizing complexity, but they differ in how harsh that penalty is.
AIC applies a lighter penalty, so it tends to keep more variables in the final model. BIC penalizes complexity more heavily and generally produces simpler, more parsimonious models. In R, the built-in step() function uses AIC by default. Other criteria include Mallows’s Cp and cross-validation scores, though these are less commonly used in standard stepwise implementations.
Why Stepwise Regression Is Controversial
Stepwise regression remains popular in practice, but statisticians have raised serious concerns about it for decades. The problems aren’t minor technicalities. They affect the reliability of whatever model the procedure selects.
The most fundamental issue is inflated significance. Because the algorithm tests many variables across multiple steps, the p-values it produces are misleadingly small. Each step involves fitting and comparing models, but the final p-values aren’t adjusted to reflect all that repeated testing. A variable might appear significant in the final model simply because the algorithm had many chances to find something that looked good by chance.
R-squared values are similarly biased upward. The iterative fitting process optimizes for the training data, so the model’s apparent explanatory power overstates how well it would perform on new data. This is a form of overfitting: the algorithm captures noise specific to your dataset rather than genuine patterns.
Stability is another concern. Small changes in the data, such as adding or removing a few observations, can produce entirely different final models. Methodologists have noted that stepwise regression capitalizes on sampling error and has poor replicability, meaning the variable set it selects in one sample often fails to replicate in another.
Large datasets create their own problems. With enough observations, nearly every predictor will achieve a statistically significant p-value, even if its actual effect is trivially small. In those situations, stepwise regression tends to include variables that are technically significant but practically meaningless.
Sample Size Requirements
Stepwise regression is especially sensitive to the ratio between observations and candidate variables. A commonly cited rule of thumb is that you need at least five observations for every variable in your candidate pool. If you’re considering 50 potential predictors, that means 250 observations at minimum. With fewer data points per variable, the algorithm is more likely to fit random noise in the dataset, producing models that look good on paper but fail to generalize.
This requirement is more demanding than it sounds. Many real-world analyses involve dozens or even hundreds of candidate predictors, and collecting enough data to satisfy a 5-to-1 ratio isn’t always practical. When the ratio falls short, the selected model becomes increasingly unreliable.
Modern Alternatives
Much of the criticism of stepwise regression has driven interest toward regularization methods, particularly Lasso regression. Lasso works differently: instead of adding and removing variables through a series of hypothesis tests, it fits all variables simultaneously but applies a penalty that shrinks weaker coefficients toward zero. Variables that contribute little or nothing get their coefficients set to exactly zero, effectively removing them from the model.
This approach addresses several of stepwise regression’s weaknesses. It handles the variable selection and model fitting in a single step rather than through iterative testing, which avoids the inflated p-value problem. It also tends to produce more stable results across different samples. Ridge regression is a related technique that shrinks coefficients but doesn’t set any to zero, so it improves prediction accuracy without actually selecting a subset of variables.
Stepwise regression selects a sparse model but can sacrifice prediction accuracy. Ridge regression provides stable predictions but doesn’t simplify the model. Lasso was specifically designed to do both: reduce complexity and maintain predictive performance.
Running Stepwise Regression in Practice
In R, the step() function handles stepwise selection using AIC as the default criterion. You can specify forward, backward, or bidirectional search by setting the direction argument. The stepAIC() function in the MASS package offers additional flexibility, including the ability to switch between AIC and BIC by adjusting a single parameter.
Python doesn’t include a built-in stepwise regression function in its major statistics libraries. The most common workaround is the SequentialFeatureSelector from the mlxtend library, which lets you specify forward or backward selection, the number of features to select, and a scoring metric like accuracy or R-squared. You pair it with a regression model from scikit-learn and fit it to your data. It’s more manual than R’s one-line approach, but it gives you fine-grained control over the process.
Regardless of the tool, the output of stepwise regression should be treated as exploratory rather than confirmatory. The selected model is a reasonable starting point, not a definitive answer about which variables matter. Validating the final model on a separate dataset, or using cross-validation, is essential before drawing conclusions from the results.

