When to Use Poisson Regression for Count Data

Poisson regression is the right tool when your outcome variable is a count: the number of times something happened. Hospital visits per year, insurance claims per month, defects per batch, goals per match. If you’re predicting a non-negative integer and the events occur independently, Poisson regression is your starting point. But choosing it wisely requires checking a few conditions that, when violated, can produce misleading results.

What Makes Count Data Special

The instinct with any prediction problem is to reach for ordinary linear regression. With count data, that instinct leads to trouble. Linear regression assumes your outcome can take any value on a continuous number line, including negatives and fractions. Counts can’t be negative, they’re always whole numbers, and they tend to be right-skewed, with most observations clustered near zero and a long tail stretching upward. A linear model fitted to this kind of data can predict negative counts (nonsensical), produce residuals that fan out unevenly, and generate confidence intervals that miss the mark.

Poisson regression handles these problems by modeling the natural log of the expected count rather than the count itself. Because the exponential of any number is always positive, the model’s predictions can never dip below zero. This log link also means the relationship between your predictors and the outcome is multiplicative rather than additive: a one-unit increase in a predictor multiplies the expected count by a fixed factor, which is how count processes typically behave in the real world.

The Four Conditions to Check

Poisson regression rests on four assumptions, and all four need to hold for the results to be trustworthy.

  • Count outcome. The response variable is a count per unit of time, space, or exposure. Not a continuous measurement, not a proportion, not a ranking.
  • Independence. Each observation must be independent. If you’re counting events for the same person across multiple time periods, those counts are likely correlated, and you’ll need a model that accounts for that clustering.
  • Mean equals variance. This is the signature property of the Poisson distribution. The average count and the spread around that average should be roughly equal. When the variance substantially exceeds the mean (overdispersion), the model underestimates your standard errors, making predictors look statistically significant when they aren’t.
  • Log-linear relationship. The log of the expected count should change linearly with your predictors. If the true relationship is curved, you’ll need transformations or polynomial terms.

Of these four, the mean-equals-variance assumption is the one most frequently violated in practice, and it’s the one that causes the most damage when ignored.

Modeling Rates With an Offset

Raw counts are sometimes misleading on their own. If you’re comparing traffic accidents across cities of different sizes, or disease cases across hospitals with different patient volumes, the raw count conflates the thing you care about (the rate) with the exposure (population size, patient-days, observation time). Poisson regression handles this cleanly through an offset term.

An offset is a variable representing the size, exposure time, or population of each observation. Adding it to the model transforms your outcome from a raw count into a rate. For example, if your count is the number of infections and your offset is the number of patient-days, the model estimates the infection rate per patient-day rather than the total number of infections. When you include an offset, each exponentiated regression coefficient tells you how much the expected rate changes multiplicatively for a one-unit increase in that predictor. This is how epidemiologists model incidence rates and how engineers model failure rates per thousand operating hours.

Common Real-World Applications

Poisson regression shows up anywhere events are being counted. In public health, it’s the standard approach for modeling disease incidence: how many new cases of influenza appeared per county per week, and how do vaccination rates, population density, and climate affect that count. In insurance, actuaries use it to predict the number of claims a policyholder will file in a given year based on age, driving history, and coverage type.

Manufacturing quality control relies on it to model defect counts per production run. Ecologists use it to count species observed at survey sites. Sports analysts model goals, fouls, or home runs per game. In all these cases, the outcome is a non-negative integer, the events are reasonably independent of each other, and the question is which factors drive the count up or down.

Detecting Overdispersion

The most common reason a Poisson model fails is overdispersion, where the variance in your data exceeds the mean. This happens frequently with recurrent events. If you’re counting emergency room visits per patient per year, some patients are heavy users while others never show up. That extra person-to-person variability inflates the variance well beyond what the Poisson distribution expects.

You can spot overdispersion by comparing two goodness-of-fit statistics to their expected values. Both the Pearson chi-square statistic and the deviance statistic should be approximately equal to their degrees of freedom (the number of observations minus the number of estimated parameters). If either statistic is substantially larger, overdispersion is present. A quicker rule of thumb: divide the deviance by its degrees of freedom. A ratio near 1 suggests a good fit. A ratio of 2 or higher is a red flag.

When overdispersion is present, the Poisson model’s standard errors shrink artificially. Predictors that look significant may not actually be, and confidence intervals will be too narrow. Research comparing the two approaches has shown that the negative binomial distribution can capture variance that the Poisson model misses entirely, making it the more appropriate choice for overdispersed data.

When to Switch to a Different Model

Poisson regression is the starting point for count data, but several situations call for a different approach.

Overdispersed counts: If your variance clearly exceeds the mean, negative binomial regression adds an extra parameter that allows the variance to grow independently of the mean. It’s the most common alternative and is available in every major statistical software package.

Excess zeros: Sometimes your data has far more zeros than the Poisson distribution can accommodate. If you’re counting how many cigarettes people smoked last week, a large chunk of your sample are non-smokers who will always report zero. That’s a different kind of zero than a smoker who happened not to smoke this week. Zero-inflated Poisson models handle this by combining two processes: one that determines whether an observation can have a non-zero count at all, and a standard Poisson model for those that can. Hurdle models take a similar approach but treat all zeros as coming from one process. The choice between them depends on whether your zeros arise from a single mechanism or two distinct ones, and goodness-of-fit testing on your specific data should guide the decision.

Underdispersion: Less common but possible, this is when the variance falls below the mean. Generalized Poisson or Conway-Maxwell-Poisson models handle this, though you’ll encounter it rarely compared to overdispersion.

A Practical Decision Checklist

Before fitting a Poisson regression, run through these questions:

  • Is your outcome a count? If it’s continuous (blood pressure, income), use linear regression. If it’s binary (yes/no), use logistic regression. If it’s a count, proceed.
  • Are observations independent? If you have repeated measures on the same subjects, consider a mixed-effects Poisson model or generalized estimating equations instead.
  • Do you need to model a rate? If your observations have different exposure times or population sizes, include an offset term.
  • Is the mean roughly equal to the variance? Check this before interpreting results. If the variance is much larger, switch to negative binomial regression.
  • Are there too many zeros? If the proportion of zeros far exceeds what a Poisson distribution would predict given your mean, consider a zero-inflated or hurdle model.

Poisson regression is rarely the final model in a complex analysis, but it’s almost always the right first model for count data. Fit it, check the diagnostics, and let the data tell you whether a more flexible alternative is needed.