How to Perform a Regression Analysis, Step by Step

Regression analysis estimates the relationship between a dependent variable (the outcome you care about) and one or more independent variables (the factors you think influence it). The process follows a consistent sequence: choose the right type of regression, check that your data meets the necessary conditions, fit the model, evaluate how well it performs, and interpret the results. Here’s how to work through each step.

Choose the Right Type of Regression

The nature of your outcome variable determines which regression method to use. If your outcome is continuous, like revenue in dollars, blood pressure readings, or time spent on a task, linear regression is the standard choice. If your outcome is categorical, like whether a customer churned (yes/no) or whether a patient survived, you need logistic regression instead. Trying to force a categorical outcome into a linear model produces nonsensical predictions, like probabilities below 0 or above 1.

Within linear regression, you’ll also decide between simple and multiple regression. Simple regression uses a single predictor variable. Multiple regression uses two or more predictors, which is far more common in practice because most outcomes are influenced by several factors at once. The rest of this guide focuses primarily on multiple linear regression, since it’s the most widely used form, but the general workflow applies to other types as well.

Prepare and Screen Your Data

Before fitting any model, spend time with your data. Look for missing values, obvious data entry errors, and variables that need recoding. Categorical predictors (like region or treatment group) need to be converted into dummy variables, where each category gets its own 0/1 column, before they can enter a regression equation.

Check for outliers that could distort your results. A single extreme data point can pull an entire regression line off course. One useful diagnostic is Cook’s Distance, which measures how much each data point influences the overall model. A Cook’s Distance value greater than 0.5 warrants a closer look, and a value greater than 1 strongly suggests that point is distorting your results. You don’t automatically delete outliers, but you should investigate them. Sometimes they’re errors. Sometimes they represent real but unusual cases that your model needs to account for differently.

Verify the Assumptions

Linear regression relies on several conditions being true about your data. When these conditions hold, the ordinary least squares (OLS) method produces the most accurate estimates possible among all linear estimators. When they’re violated, your results can be misleading. Here’s what to check.

Linearity

The relationship between each predictor and the outcome should be roughly linear. Plot each predictor against the outcome and look for straight-line patterns rather than curves. If you see a curved relationship, you may need to transform the variable (taking the logarithm or square root, for example) or add a polynomial term.

Independence of Observations

Each data point should be independent of the others. This assumption is violated when you have repeated measurements on the same person, data collected over time where values carry over from one period to the next, or observations clustered within groups (students within classrooms, for instance). If your data has this structure, you’ll need a specialized method like mixed-effects modeling rather than standard regression.

Constant Spread of Residuals

The variability in your outcome should be roughly the same across all levels of your predictors. Statisticians call this homoscedasticity. To check it, plot the model’s predicted values on the horizontal axis against the residuals (the differences between predicted and actual values) on the vertical axis. You want to see a random scatter with no pattern. If the residuals fan out in a cone shape, getting wider as predicted values increase, the spread isn’t constant and your significance tests become unreliable.

Normally Distributed Residuals

The residuals should follow an approximately normal (bell-shaped) distribution. A histogram of residuals or a Q-Q plot, where residuals are plotted against what you’d expect from a normal distribution, will reveal obvious departures. Moderate deviations are tolerable with large samples, but severe skewness or heavy tails can distort confidence intervals.

No Multicollinearity

When two or more predictors are highly correlated with each other, the model struggles to separate their individual effects. This is called multicollinearity, and it inflates the uncertainty around your coefficients. The standard diagnostic is the Variance Inflation Factor (VIF), calculated for each predictor. A VIF above 5 signals a potential problem, and a VIF above 10 indicates serious multicollinearity. The fix is usually to drop one of the correlated predictors, combine them into a single composite variable, or use a technique like ridge regression that’s designed to handle correlated predictors.

Fit the Model

With your data screened and assumptions checked, you’re ready to run the regression. Every major statistics package (R, Python’s scikit-learn or statsmodels, SPSS, Stata, Excel) can fit a regression model. You specify the outcome variable and the predictor variables, and the software estimates the coefficients using OLS, which minimizes the sum of squared differences between predicted and actual values.

If you have many candidate predictors and aren’t sure which to include, there are a few common approaches. You can start with all predictors and remove the weakest ones (backward elimination), start with none and add the strongest ones (forward selection), or rely on theory and domain knowledge to select variables before looking at the data. The theory-driven approach is generally more trustworthy because automated selection methods can capitalize on noise in your specific dataset.

Evaluate the Model’s Fit

Once the model is fitted, you need to assess how well it explains your data. The primary metric is R-squared, which tells you what proportion of the variation in your outcome is explained by your predictors. An R-squared of 0.65 means the model accounts for 65% of the variation, with the remaining 35% unexplained.

There’s an important catch: standard R-squared can only go up (or stay the same) when you add more predictors, even if those predictors are useless. This makes it tempting to stuff the model with variables to inflate the number. Adjusted R-squared corrects for this by penalizing the addition of predictors that don’t genuinely improve the model. Unlike standard R-squared, adjusted R-squared can decrease when you add a weak predictor. The gap between the two metrics is most pronounced with small samples, many predictors, or both. Always report adjusted R-squared when comparing models with different numbers of predictors.

Beyond R-squared, examine your residual plots. Plot residuals against fitted values and against each individual predictor. A properly fitted model produces residuals that look like random noise: no curves, no patterns, no funneling. If you see a curved trend in residuals plotted against a predictor, the relationship with that variable isn’t purely linear and you may need to revisit your model specification.

Interpret the Coefficients

The regression output gives you a coefficient (often labeled “b” or “beta”) for each predictor. Each coefficient tells you the expected change in the outcome when that predictor increases by one unit, holding all other predictors constant. For example, if you’re predicting salary and the coefficient for years of experience is 2,400, the model estimates that each additional year of experience is associated with $2,400 more in salary, after accounting for the other variables in the model.

That “one unit” is defined by whatever scale the variable is measured on. A one-unit increase in years of experience is one year. A one-unit increase in a satisfaction score measured 1 to 7 is one point on that scale. This means you can’t directly compare the size of coefficients across predictors measured on different scales. A coefficient of 2,400 for years of experience and 500 for satisfaction score doesn’t necessarily mean experience matters more.

To compare the relative importance of predictors, use standardized coefficients, which most software can produce alongside the raw ones. Standardized coefficients express changes in standard deviations rather than raw units, putting all predictors on a common scale. The predictor with the largest absolute standardized coefficient has the strongest association with the outcome.

Test for Statistical Significance

Each coefficient comes with a p-value that tests whether that predictor has a real relationship with the outcome or whether the observed association could be due to chance. The conventional threshold is p < 0.05, meaning there’s less than a 5% probability of seeing an association this strong if no real relationship existed. Some fields set stricter thresholds: genetics research, for example, commonly requires p-values below 0.00000001 because of the massive number of comparisons being tested simultaneously.

A few things to keep in mind about p-values. A small p-value tells you the relationship is unlikely to be zero. It does not tell you the relationship is large or practically meaningful. A predictor can be statistically significant but trivially small in effect size. Conversely, a predictor with a p-value of 0.08 isn’t necessarily useless; your sample may simply be too small to detect the effect reliably. Always pair significance testing with the actual size of the coefficient. Ask yourself whether that magnitude of change in the outcome matters in the real world.

The overall model also gets an F-test, which tells you whether your set of predictors, taken together, explains a significant amount of variation in the outcome. If the overall F-test isn’t significant, the model as a whole isn’t performing better than simply using the average of the outcome as your prediction.

Report Your Results

A complete regression report includes the coefficients (both raw and standardized), their standard errors, p-values, the overall R-squared and adjusted R-squared, the F-statistic, and your sample size. Describe what you did to check assumptions and how you handled any violations. If you removed outliers or transformed variables, say so and explain why.

When describing individual predictors, state the direction and size of the effect in plain language. Rather than “b = 3.2, p = 0.003,” write something like “each additional hour of weekly exercise was associated with a 3.2-point reduction in anxiety scores, holding other factors constant.” Then note whether that effect was statistically significant. This makes your findings accessible to anyone reading the report, not just people comfortable with statistical notation.