What Is Hierarchical Linear Modeling and How Does It Work?

Hierarchical linear modeling (HLM) is a statistical technique designed for data that has a naturally layered, or “nested,” structure. Think of students grouped within schools, patients grouped within hospitals, or repeated measurements grouped within the same person over time. In each case, individuals within the same group tend to be more similar to each other than to individuals in other groups, and HLM accounts for that similarity in a way that standard regression cannot.

You’ll also see HLM called multilevel modeling, mixed-effects modeling, or random-effects modeling. These terms all refer to essentially the same family of techniques. The core idea is simple: when your data has layers, your statistical model should have layers too.

Why Ordinary Regression Falls Short

Standard regression (ordinary least squares, or OLS) assumes that every observation in your dataset is independent of every other observation. That assumption breaks down the moment your data is clustered. Students in the same classroom share a teacher, a curriculum, and a peer environment. Their test scores aren’t truly independent of one another.

When you ignore that clustering and run a standard regression anyway, the most immediate problem is that your standard errors shrink below where they should be. The dependency between observations within a cluster effectively reduces your true sample size, but OLS doesn’t know that. It treats every observation as a fully independent data point, which makes your results look more precise than they actually are. The practical consequence: you’ll find “statistically significant” effects that aren’t real, inflating your false positive rate.

HLM fixes this by explicitly modeling the variation that exists at each level of the hierarchy. It separates within-group variation (how students differ from each other inside the same school) from between-group variation (how schools differ from one another), giving you accurate standard errors and honest p-values.

How the Levels Work

The simplest HLM has two levels. Level 1 describes what happens within each group. Level 2 describes what happens between groups. Using the classic education example: Level 1 might model how a student’s socioeconomic status predicts their test score within a given school. Level 2 might model how school-level characteristics (funding per pupil, average class size) shift the overall average score from school to school, or even change the strength of the socioeconomic effect.

What makes HLM distinctive is that the intercepts and slopes from the Level 1 model are allowed to vary across groups. In a standard regression, you get one intercept and one slope for the entire dataset. In HLM, each school can have its own intercept (its own baseline achievement level) and its own slope (its own relationship between socioeconomic status and test scores). Level 2 then tries to explain why those intercepts and slopes differ, using group-level predictors.

This is why HLM is sometimes described as “regression where the coefficients are themselves outcomes to be modeled.” The technique assumes these varying coefficients follow a normal distribution across groups and estimates the average value of each coefficient along with how much it varies.

Common Applications

Education research is the field most closely associated with HLM, and for good reason. Educational data is almost always nested: students sit within classrooms, classrooms within schools, schools within districts. A researcher studying whether a new reading program works needs to account for the fact that outcomes will cluster by classroom and school.

Longitudinal research is the other major use case. When you measure the same person repeatedly over time, those repeated measurements are nested within the individual. HLM handles this naturally, with time points at Level 1 and individuals at Level 2. This setup is often called a growth curve model, because it lets you estimate each person’s trajectory of change and then ask what predicts differences in those trajectories.

For instance, researchers have used HLM to track reading and math achievement from kindergarten through fifth grade among students with learning disabilities, comparing their growth trajectories to those of students without disabilities. The Special Education Elementary Longitudinal Study, a nationally representative dataset, used this approach to follow achievement growth from ages 7 to 17 across all federal special education categories. These analyses revealed that growth rates for students with learning disabilities generally paralleled those of students not receiving special education, a finding that would have been difficult to isolate without a multilevel framework.

Beyond education, HLM appears regularly in public health (patients within clinics), organizational psychology (employees within companies), and any field where data arrives in clusters.

The Intraclass Correlation Coefficient

Before fitting an HLM, researchers typically check whether the clustering in their data actually matters. The tool for this is the intraclass correlation coefficient, or ICC. The ICC tells you what proportion of the total variation in your outcome sits at the group level rather than the individual level.

An ICC of 0.15 in an education study, for example, means that 15% of the variation in student test scores is attributable to differences between schools, with the remaining 85% due to differences between students within the same school. Even modest ICCs (0.05 to 0.10) can meaningfully distort standard errors if ignored. There’s no universal cutoff that demands HLM, but any nonzero ICC in clustered data is a signal that a multilevel approach will give you more trustworthy results than OLS.

Key Assumptions

HLM carries its own set of assumptions, organized by level. At Level 1, the residuals (the leftover variation after the model does its work) should be normally distributed with a mean of zero and constant variance across groups. At Level 2, the randomly varying intercepts and slopes should follow a multivariate normal distribution. And critically, the Level 1 residuals and the Level 2 random effects must be uncorrelated with each other.

In practice, these assumptions are checked with residual plots and distributional tests, much like standard regression diagnostics. The normality assumptions become more important when sample sizes are small. With large enough samples, HLM estimation methods tend to be robust to moderate violations.

How Much Data You Need

Sample size guidance for HLM focuses on two numbers: how many groups you have (Level 2 units) and how many observations sit within each group (Level 1 units). Both matter, but the number of groups tends to be the bigger bottleneck.

Simulation studies have shown that HLM can produce unbiased estimates of main effects (fixed effects) with as few as 10 groups of 10 members each, provided the data is normally distributed. However, the estimates of how much the intercepts and slopes vary across groups (the variance components) become unreliable with fewer than 30 groups. Standard errors for those variance components tend to be biased downward at that threshold. A general guideline: at least 10 groups to trust your main effects, and 30 or more groups to trust your variance estimates.

Within groups, having fewer than 5 observations per group is generally considered unsafe for reliable estimation. If you have fewer than 10 groups, having at least 30 observations per group can partially compensate. Datasets with only a few dozen total observations, especially when clustered, are not well suited to multilevel modeling.

Software Options

HLM can be run in most major statistical platforms. In R, the lme4 and nlme packages are the most widely used. SAS offers the MIXED and GLIMMIX procedures. Stata has GLLAMM, and SPSS includes a MIXED procedure. Mplus handles multilevel models through its TWOLEVEL command. There are also dedicated programs built specifically for multilevel analysis, including a program literally called HLM (developed by Raudenbush and Bryk, two of the method’s primary architects) and MLwiN.

For most researchers, the choice comes down to what software they already know. The underlying statistical engine is the same across platforms, and results should converge to the same estimates. R’s lme4 package has become especially popular because it’s free, well-documented, and handles both simple two-level models and more complex structures with crossed random effects.

Fixed Effects vs. Random Effects

Two terms you’ll encounter constantly in HLM are “fixed effects” and “random effects.” Fixed effects are the average relationships in your model, the overall intercept and the overall slopes that apply across all groups. Random effects capture how much individual groups deviate from those averages.

When you allow only the intercept to vary across groups (a “random intercept” model), you’re saying that groups can differ in their baseline level of the outcome, but the relationship between your predictor and the outcome is the same everywhere. When you also allow slopes to vary (a “random slopes” model), you’re saying the strength of that relationship itself changes from group to group. The random slopes model is more flexible but requires more data to estimate reliably, since you’re now asking the model to learn an entire distribution of slopes rather than a single value.

Deciding which effects to treat as random is one of the key modeling choices in HLM. A common strategy is to start with random intercepts only, then test whether adding random slopes meaningfully improves the model’s fit to the data.