What Is Multiple Logistic Regression and How It Works

Multiple logistic regression is a statistical method that predicts a yes-or-no outcome using two or more input variables. If you want to know whether a patient will develop a disease, whether a customer will buy a product, or whether a student will pass an exam, and you have several pieces of information that might influence that outcome, multiple logistic regression is the tool designed for exactly that job. It works by calculating the probability of the outcome falling into one category versus the other, expressed as a value between 0 and 1.

How It Differs From Simple Logistic Regression

Simple logistic regression looks at the relationship between one input variable and a binary outcome. Multiple logistic regression does the same thing but with two or more input variables at once. The critical advantage is that it evaluates the independent effect of each variable on the outcome while adjusting for the influence of all the other variables in the model. This adjustment is what allows researchers to separate real effects from confounding, where one variable only appears important because it’s tangled up with another.

For example, if you’re studying whether smoking predicts lung disease, age is a confounder because older people both smoke more and get lung disease more often. Multiple logistic regression lets you include age, sex, and smoking status simultaneously, isolating each factor’s contribution to the outcome.

The Math Behind the Model

Unlike regular linear regression, which can spit out any number on a continuous scale, logistic regression needs to keep its predictions between 0 and 1 (since it’s predicting a probability). It accomplishes this through something called the logit transformation. The model first builds a standard linear equation combining all the input variables, then converts that equation’s output into a probability.

The linear piece looks familiar: a constant plus each variable multiplied by its own coefficient. What the model actually predicts with this equation is the log odds of the outcome occurring. “Odds” here means the probability of the event happening divided by the probability of it not happening, and “log odds” is just the natural logarithm of that ratio. This log-odds value can range from negative infinity to positive infinity, which makes the linear math work. The model then converts it back into a probability between 0 and 1 using an exponential formula.

You don’t need to do this math by hand. Statistical software handles it automatically. But understanding the concept helps you interpret what the model’s coefficients actually mean.

Interpreting Odds Ratios

Each input variable in the model gets a coefficient. On its own, that coefficient represents the change in log odds for every one-unit increase in that variable, holding all other variables constant. Since log odds aren’t intuitive, researchers typically convert coefficients into odds ratios by raising the mathematical constant e to the power of the coefficient.

An odds ratio of 1.0 means the variable has no effect on the outcome. An odds ratio above 1.0 means higher values of that variable increase the odds of the outcome. Below 1.0 means they decrease the odds. For instance, if a model predicting college admission produces an odds ratio of 1.14 for math scores, that means each additional point on the math test multiplies the odds of admission by 1.14, or increases them by 14%.

Things get more interesting with interaction terms. If the model includes an interaction between gender and math score, the effect of math score can differ for male and female students. In one UCLA teaching example, the odds ratio for math score was 1.14 for male students but 1.22 for female students. The ratio of those two odds ratios (1.22 / 1.14 = 1.07) matched the exponentiated coefficient for the interaction term, confirming that the gender difference was captured correctly.

Each coefficient also comes with a p-value and confidence interval. A confidence interval that doesn’t cross 1.0 (for odds ratios) or 0 (for log-odds coefficients) indicates the variable has a statistically significant effect.

Handling Categorical Variables

Continuous variables like age or blood pressure can enter the model directly. Categorical variables like race, education level, or marital status cannot, because their numerical codes (1, 2, 3) don’t represent meaningful quantities. These variables need to be recoded into a set of binary indicators, often called dummy variables.

If a categorical variable has four levels, the software creates three new binary variables (always one fewer than the number of categories). One category serves as the reference group, and each dummy variable compares its category to that reference. For a variable like race with categories White, Black, Hispanic, and Asian, choosing White as the reference means the model produces separate coefficients comparing Black to White, Hispanic to White, and Asian to White. The choice of reference group changes the coefficients but not the model’s overall predictions.

For ordinal variables where the categories have a meaningful order (like education level: high school, bachelor’s, master’s, doctorate), alternative coding systems such as polynomial or difference coding can capture trends across levels rather than just pairwise comparisons.

Assumptions the Model Requires

Multiple logistic regression is more flexible than linear regression in some ways (it doesn’t assume the outcome is normally distributed, for example), but it still has requirements that must be met for results to be trustworthy.

  • Independence of observations. Each data point must be unrelated to the others. Repeated measurements from the same person, or students clustered within the same classroom, violate this assumption.
  • Linearity in the logit. Continuous input variables need to have a linear relationship with the log odds of the outcome. This doesn’t mean the relationship with the probability itself is linear (it isn’t), but the log-odds relationship should be.
  • No multicollinearity. Input variables shouldn’t be too highly correlated with each other. A common diagnostic is the variance inflation factor (VIF). A VIF of 1 means no correlation with other predictors. Values above 4 warrant investigation, and values above 10 signal serious multicollinearity that needs to be addressed, typically by removing or combining variables.
  • No strongly influential outliers. A small number of extreme data points shouldn’t be disproportionately driving the model’s results.

Notably, logistic regression does not require the input variables to be normally distributed and does not assume equal variance across groups, which makes it applicable in situations where linear regression would struggle.

Sample Size: The Events Per Variable Rule

One of the most common mistakes with multiple logistic regression is building a model with too many variables for the available data. The key constraint isn’t the total number of observations but the number of events, meaning occurrences of whichever outcome category is less common.

The traditional guideline requires at least 10 events per variable (EPV) in the model for the coefficients to be estimated accurately. So if you’re predicting a disease that occurs in 40 people in your dataset and you have 5 input variables, you’re at 8 EPV, which is below the threshold. More recent work suggests that 20 EPV provides meaningfully better performance, particularly when you need the model’s predictions to hold up in new data. Below 10 EPV, coefficients become unreliable and the model is likely to overfit, performing well on the data it was built on but poorly on anything new.

Evaluating Model Performance

A logistic regression model outputs probabilities, so evaluating it requires deciding on a cutoff (say, 0.5) above which you classify someone as “yes” and below which you classify them as “no.” Different cutoffs produce different trade-offs between catching true positives and avoiding false positives.

The ROC curve plots this trade-off across every possible cutoff, graphing the true positive rate against the false positive rate. The area under this curve (AUC) summarizes overall performance in a single number. An AUC of 0.5 means the model is no better than flipping a coin. An AUC of 1.0 means perfect classification. In practice, an AUC between 0.7 and 0.8 is considered acceptable for many applications, and above 0.8 is generally strong. AUC is especially useful for comparing two competing models on the same dataset, though it works best when the outcome categories are roughly balanced in size.

Real-World Applications

Multiple logistic regression appears across nearly every field that deals with binary outcomes, but it’s particularly common in clinical research. One published study used it to build a diagnostic screening model for COVID-19 based on patient symptoms at presentation. The model incorporated 16 symptoms (coded as present or absent), age, sex, and the total number of symptoms, including interaction terms between variables. The goal was to predict whether a patient was infected based on readily available clinical information, which could support rapid screening when lab testing was delayed or unavailable.

Similar models are used routinely to predict surgical complications, identify patients at risk for readmission, assess the likelihood of treatment response, and flag individuals who might benefit from preventive interventions. Outside of medicine, the same method predicts loan defaults, customer churn, email spam classification, and election outcomes. Any situation where the question boils down to “which of two groups does this case belong to, and what factors predict it” is a natural fit for multiple logistic regression.