Linear vs Logistic Regression: When to Use Each

The choice comes down to what you’re trying to predict. If your outcome is a number on a continuous scale (like blood pressure, salary, or temperature), use linear regression. If your outcome is a category (like yes/no, pass/fail, or alive/dead), use logistic regression. That single distinction drives everything else: the math behind each model, how you check whether it’s working, and how you interpret the results.

Start With Your Outcome Variable

The outcome variable, sometimes called the dependent variable, is the thing you’re trying to predict or explain. Linear regression expects that variable to be continuous, meaning it can take any numeric value along a range. Days of hospitalization, lung function measurements, house prices, test scores: these are all continuous. The model draws a best-fit line through your data and uses it to estimate a specific number.

Logistic regression expects your outcome to fall into categories, most commonly two (binary). Did the patient survive or die? Did the customer buy or not buy? Was the email spam or not spam? Instead of predicting a number, logistic regression predicts the probability that an observation belongs to one category versus the other. If the predicted probability crosses a threshold (usually 0.5), the model assigns it to that category.

When your outcome has more than two categories that don’t have a natural order (mild, moderate, severe would have order; red, blue, green would not), you’d use multinomial logistic regression, an extension of the standard binary version. And if those categories do have a natural ranking, ordinal logistic regression is the appropriate choice. But the core principle stays the same: categories go to logistic, continuous numbers go to linear.

How Each Model Works Under the Hood

Linear regression fits a straight line (or a flat plane, with multiple predictors) through your data points. The equation is simple: predicted outcome equals a baseline value plus the effect of each predictor multiplied by its coefficient. If you’re predicting house price from square footage, the coefficient tells you the dollar increase for every additional square foot. The output can range from negative infinity to positive infinity, which is fine when your outcome is a continuous number.

That unbounded output becomes a problem when you’re predicting a probability, which has to land between 0 and 1. Logistic regression solves this with what’s called a logit transformation. It takes a linear combination of your predictors (which could be any value) and converts it into a probability between 0 and 1 using an S-shaped curve. Mathematically, it models the log of the odds: the natural logarithm of the probability something happens divided by the probability it doesn’t. This keeps predictions in the valid probability range no matter how extreme the input values get.

Assumptions You Need to Check

Linear regression relies on several assumptions about the errors in your model (the gaps between predicted and actual values). Those errors should average out to zero, have roughly constant spread across the range of predictions, and not be correlated with each other. If the spread of errors fans out as your predicted values increase, or if errors follow a pattern instead of looking random, the model’s estimates become unreliable. Linear regression also assumes the relationship between each predictor and the outcome is, well, linear. Doubling your predictor should roughly double (or halve) its effect on the outcome.

Logistic regression shares some of these requirements but drops others. It doesn’t assume normally distributed errors or constant error variance. It does assume that observations are independent of each other, that predictors have a consistent directional relationship with the log-odds of the outcome, and that predictors aren’t too highly correlated with one another. When input variables are strongly correlated (multicollinearity), the model can’t cleanly separate their individual effects, and you should consider dropping one of the redundant predictors.

Interpreting the Results

In linear regression, interpretation is straightforward. Each coefficient tells you how much the outcome changes for a one-unit increase in that predictor, holding everything else constant. A coefficient of 3.2 for “years of experience” in a salary model means each additional year is associated with $3,200 more in salary (assuming salary is in thousands).

Logistic regression coefficients are less intuitive because they represent changes in log-odds, not direct changes in the outcome. To make them useful, you convert them into odds ratios by raising the mathematical constant e to the power of the coefficient. An odds ratio of 1 means the predictor has no effect on the outcome. An odds ratio greater than 1 means higher values of the predictor are associated with higher odds of the outcome occurring. An odds ratio less than 1 means higher values of the predictor are associated with lower odds.

For example, if a logistic regression modeling heart disease risk produces an odds ratio of 1.4 for a “smoking” variable, that means smokers have 1.4 times the odds of developing heart disease compared to non-smokers, after accounting for the other variables in the model.

Measuring How Well Each Model Performs

Because the two models produce different types of output, they use different metrics to gauge performance. For linear regression, R-squared tells you what proportion of the variation in your outcome is explained by the predictors. An R-squared of 0.75 means the model accounts for 75% of the variation. Mean squared error (MSE) and root mean squared error (RMSE) measure how far off predictions are in the same units as your outcome, which makes them easy to interpret practically.

For logistic regression, accuracy (the percentage of correct classifications) is the most basic metric but can be misleading when categories are imbalanced. If 95% of emails are not spam, a model that always predicts “not spam” gets 95% accuracy while being completely useless. More informative metrics include the area under the ROC curve (AUC), which measures how well the model distinguishes between the two categories across all possible thresholds. An AUC of 0.5 means the model is no better than flipping a coin; 1.0 means perfect separation. Log-loss penalizes confident wrong predictions more heavily, making it useful when you care about the quality of the predicted probabilities, not just the final category assignment.

Common Mistakes That Lead to the Wrong Model

The most frequent error is using linear regression on a binary outcome. Technically, you can run the math, but the model will predict values below 0 and above 1, which don’t make sense as probabilities. It also violates the constant-variance assumption because the errors behave very differently near the extremes. The result is a model that looks like it works on the surface but produces unreliable predictions and misleading coefficients.

The reverse mistake, using logistic regression on a continuous outcome, is less common but still happens. People sometimes convert a continuous outcome into categories (above average vs. below average, for instance) just so they can run a logistic regression. This throws away information. You lose the distinction between someone who scored just below average and someone who scored far below average. If your outcome is naturally continuous, keep it that way and use linear regression.

Another subtle pitfall is misidentifying your outcome type. Count data (number of hospital visits, number of defective parts) looks continuous but often violates linear regression assumptions because it can’t go below zero and tends to be skewed. Poisson regression or negative binomial regression typically handles count data better. Similarly, outcomes measured as proportions or rates sometimes need specialized approaches rather than plain linear regression.

Quick Decision Framework

Predicting a continuous number (revenue, weight, duration): linear regression.
Predicting a yes/no outcome (churned or stayed, sick or healthy, clicked or didn’t): logistic regression.
Predicting which of several unordered categories (which product a customer buys, which disease a patient has): multinomial logistic regression.
Predicting an ordered category (pain level as mild/moderate/severe, satisfaction rating): ordinal logistic regression.
Predicting a count (number of events in a time window): consider Poisson or negative binomial regression instead of either.

If you’re still unsure, plot your outcome variable. A histogram that looks like a bell curve or a spread of values along a range points to linear regression. A bar chart with two or a few distinct categories points to logistic. The data itself usually makes the answer obvious once you look at it.