What Is GLM in Statistics? General vs. Generalized

GLM stands for General Linear Model or Generalized Linear Model, depending on context. Both are statistical frameworks used to analyze relationships between variables, but they work differently. The General Linear Model assumes your data follows a bell curve (normal distribution) and includes familiar tools like t-tests, ANOVA, and linear regression. The Generalized Linear Model expands on that idea, letting you analyze data that doesn’t follow a normal distribution, such as yes/no outcomes or counts of events.

If you’ve encountered “GLM” in a statistics class, a research paper, or a software package, you’re almost certainly looking at one of these two frameworks. Here’s how they work and why they matter.

The General Linear Model

The General Linear Model is the simpler of the two and serves as the foundation. It’s a blanket term for a family of statistical tests you may already know by their individual names: t-tests, ANOVA (analysis of variance), linear regression, and ANCOVA. What these all share is a core set of assumptions: the data points are independent of each other, the variability within groups is roughly equal, the residual errors follow a normal distribution, and the relationship between variables is linear.

When researchers run a standard regression to predict something like blood pressure from age and weight, they’re using the General Linear Model. It works well when the outcome you’re measuring is continuous and roughly normally distributed, like height, test scores, or temperature.

The Generalized Linear Model

Real-world data often breaks the rules the General Linear Model requires. A patient either survives surgery or doesn’t. A hospital counts the number of infections per month. A survey asks people to rate satisfaction on a 1-to-5 scale. None of these outcomes are continuous or normally distributed, so the standard model doesn’t fit.

The Generalized Linear Model (sometimes abbreviated GLIM to distinguish it from the General Linear Model) was developed to handle exactly these situations. Instead of forcing data into a normal distribution, it lets you specify which probability distribution your data actually follows and then uses a mathematical trick called a link function to connect your predictors to the outcome.

Three Components of a GLM

Every Generalized Linear Model has three building blocks:

  • Random component: This is your outcome variable and the probability distribution you assume it follows. For binary outcomes (yes/no, survived/died), that’s typically a binomial distribution. For count data (number of hospital visits, number of accidents), it’s usually a Poisson distribution.
  • Systematic component: These are your predictor variables, combined in a linear equation. If you’re predicting disease risk from age, smoking status, and BMI, those three predictors form the systematic component.
  • Link function: This is the bridge between the other two components. It transforms the expected value of your outcome so it can be expressed as a straight-line relationship with your predictors, even when the raw outcome isn’t linear.

How Link Functions Work

The link function is what makes the Generalized Linear Model flexible. Different types of data get different link functions.

For binary data (did something happen or not), the standard choice is the logit link. This is the basis of logistic regression, one of the most widely used statistical models in health and social science research. The logit link transforms a probability, which is naturally stuck between 0 and 1, into a value that can range across the entire number line. That transformation makes it possible to fit a linear equation to the data.

For count data, the standard choice is the log link. This underpins Poisson regression, which is commonly used in epidemiology to model things like disease incidence rates or mortality counts. Because counts can’t be negative, the log link ensures the model’s predictions stay in a valid range. Researchers use Poisson regression to estimate the effects of risk factors on incidence rates and to evaluate dose-response relationships for different levels of exposure.

Other link functions exist for more specialized situations, including the probit link (another option for binary data) and the complementary log-log link, but the logit and log links cover the vast majority of practical applications.

General vs. Generalized: Key Differences

The naming overlap causes real confusion, so here’s the clearest way to separate them. The General Linear Model is a special case of the Generalized Linear Model. It assumes a normal distribution and uses an identity link function, meaning no transformation is applied. The Generalized version drops the normality requirement and uses maximum likelihood estimation to fit the model, rather than the least-squares approach used in ordinary regression.

If your outcome variable is continuous and roughly bell-shaped, both approaches give you essentially the same result. The Generalized version only becomes necessary when your data is binary, counts, proportions, or otherwise non-normal.

Where GLMs Are Used

GLMs show up across a wide range of fields. In clinical research, the distributional assumptions you make about your outcome variable can critically affect the conclusions you draw. Choosing the right GLM family helps researchers correctly test for treatment effects or differences between patient groups. One published example used Generalized Linear Models to compare immune cell counts between two disease groups, where a standard regression would have produced misleading results.

In epidemiology, Poisson regression models are a workhorse for studying how risk factors influence disease rates across populations. They offer several advantages over older standardization techniques for comparing rates between groups.

In neuroscience, the General Linear Model (the simpler version) is the standard tool for analyzing brain imaging data from fMRI scans. Researchers use it to characterize brain activity at each point in the brain, testing whether specific regions respond to particular tasks or stimuli.

In ecology, researchers use GLMs to model species counts or the presence and absence of organisms across habitats. In finance, they model the probability of loan defaults. In marketing, they analyze whether customers click on an ad or make a purchase.

Running a GLM in Software

All major statistical software packages have built-in GLM functions. In R, the glm() function handles Generalized Linear Models, while lm() handles the standard linear model. In SAS, the procedure is called PROC GLM for the General Linear Model and PROC GENMOD for the Generalized version. In Python, the statsmodels library provides GLM classes, and scikit-learn covers standard linear models. SPSS and Stata also have dedicated GLM procedures.

When setting up a GLM in any of these tools, you typically need to specify three things: your outcome variable, your predictor variables, and the family (which tells the software what distribution to assume and which link function to use). Specifying family = binomial gives you logistic regression. Specifying family = poisson gives you Poisson regression. Specifying family = gaussian gives you standard linear regression, completing the circle back to the General Linear Model.