Multilevel modeling is a statistical technique designed for data that has a natural grouping or nesting structure, like students within schools, patients within hospitals, or repeated measurements within the same person over time. It goes by several names: hierarchical linear modeling, mixed effects modeling, and multilevel linear modeling all refer to the same core approach. The key idea is that standard regression assumes every observation in your dataset is independent of every other observation, and when your data is grouped, that assumption breaks down.
Why Standard Regression Falls Short
Imagine you’re studying how a teaching method affects student test scores across 50 schools. Students in the same school share a teacher, a building, a neighborhood, and a local culture. Their outcomes aren’t independent of each other. Two students at the same well-funded suburban school will likely have more similar scores to each other than to a student across town.
Standard regression ignores this. It treats every student as if they exist in a vacuum, unconnected to the school around them. This creates a real problem: the model underestimates the uncertainty in its results. Standard errors shrink artificially, making effects look more statistically significant than they actually are. Research comparing ordinary regression to multilevel approaches has shown that ignoring clustering doesn’t just produce slightly off results. It can give you incorrect estimates of the effect you’re studying, not just imprecise ones.
Multilevel modeling solves this by explicitly accounting for the fact that people (or measurements) within the same group are correlated. It partitions the variation in your outcome into the portion that comes from differences between individuals and the portion that comes from differences between groups.
How the Model Is Structured
The most common setup is a two-level model. Level 1 represents the individual observations: individual students, individual patients, individual time points. Level 2 represents the groups those observations are nested in: schools, hospitals, people. Each level gets its own set of variables. In a study of patient health outcomes, for instance, level-1 variables might include patient age, while level-2 variables might include the treating doctor’s years of experience. The model estimates the effects of both simultaneously.
This structure can extend to three or more levels. Students nested in classrooms, classrooms nested in schools, schools nested in districts. Longitudinal data fits naturally into this framework too: individual measurements at specific time points (level 1) are nested within the person being measured over time (level 2).
Random Intercepts vs. Random Slopes
Multilevel models come in two main flavors, and the distinction matters for what questions you can answer.
A random intercept model is the simpler version. It allows each group to have a different baseline value for the outcome, but assumes that the relationship between your predictor and your outcome is the same across all groups. Picture a graph with parallel lines, one for each school, shifted up or down. Some schools produce higher average test scores than others, but the effect of study hours on scores is identical everywhere.
A random slope model (also called a random coefficient model) goes further. It lets both the baseline and the relationship itself vary across groups. Now those lines on your graph aren’t parallel anymore. Study hours might have a strong effect in one school and a weak effect in another. This model captures that variability and lets you investigate what group-level characteristics might explain it. Maybe the effect of study hours differs depending on class size or teacher experience.
The Intraclass Correlation Coefficient
Before building a multilevel model, researchers typically check whether the grouping structure actually matters. The intraclass correlation coefficient, or ICC, quantifies how much of the total variation in your outcome is attributable to differences between groups rather than differences between individuals within groups.
The ICC is a ratio: the variance between groups divided by the total variance (between groups plus within groups). An ICC of 0.15 means that 15% of the variation in your outcome can be explained by which group someone belongs to. The higher the ICC, the stronger the case for using a multilevel approach. Even modest values suggest that ignoring the grouping would bias your results. When the ICC is essentially zero, meaning group membership explains none of the variation, a standard regression may work fine.
Longitudinal Data and Growth Curves
One of the most common applications of multilevel modeling is tracking change over time. When you measure the same people repeatedly, whether it’s blood pressure readings over months or depression scores over years, those repeated measurements are nested within individuals. Each person has their own trajectory.
In this setup, the time-specific measurements sit at level 1, and the individuals sit at level 2. The model can estimate a unique growth curve for each person, capturing both where they started (their intercept) and how quickly they changed (their slope). This is a major advantage over approaches that only estimate average trends for the entire sample. You can see not just whether people improve on average, but how much variation there is in the rate of improvement, and then link that variation to individual characteristics like age, treatment group, or baseline severity.
Key Assumptions
Multilevel models still carry assumptions, and they apply at every level of the model. At the individual level, the residual errors (the gap between what the model predicts and what actually happened) are assumed to be normally distributed with a constant spread. At the group level, the random effects, those group-specific intercepts and slopes, are also assumed to follow a normal distribution with constant variance. The model further assumes that the individual-level errors and the group-level random effects are independent of each other.
Violations of these assumptions don’t necessarily invalidate your analysis, but severe departures, like highly skewed group-level effects or wildly unequal variance across groups, can distort your results. Checking residual plots at both levels is standard practice.
How Many Groups Do You Need?
Sample size planning for multilevel models is trickier than for standard regression because you need to think about size at every level. The old rule of thumb was “30/30”: 30 groups with 30 individuals per group. Current research paints a more nuanced picture.
For estimating the main effects in your model (the fixed effects), at least 50 groups appears necessary for accurate results. With only 30 groups, confidence intervals for group-level predictors start to show bias, with coverage rates dropping below the expected 95%. For estimating how much variability exists between groups (the random effects), the bar is even higher: at least 100 groups may be needed to reliably detect significant variation across groups and obtain trustworthy confidence intervals.
A consistent finding across simulation studies is that adding more groups helps more than adding more individuals within each group. If you have a fixed budget for data collection, recruiting 80 schools with 20 students each will generally give you better multilevel estimates than recruiting 40 schools with 40 students each.
Common Software Options
Five software platforms dominate multilevel modeling in practice: R (using the lme4 package), Stata, SAS, HLM, and Mplus. A simulation study comparing all five found that they generally produce similar results, though they can differ in computational speed and how they handle convergence when models get complex, particularly with multiple random slopes of varying magnitudes.
In R, the lme4 package is the most widely used, with the function lmer for continuous outcomes and glmer for binary or count outcomes. The older nlme package offers some features lme4 doesn’t, like more flexible correlation structures. Stata, SAS, and HLM each use their own syntax but fit the same underlying models. The choice often comes down to what your field uses and what your collaborators are comfortable with, rather than meaningful differences in statistical output.

