How to Interpret Structural Equation Model Results

Interpreting structural equation model (SEM) results comes down to evaluating three things in order: whether your model fits the data adequately, whether the measurement model captures your constructs reliably, and whether the path coefficients in the structural model support your hypotheses. Each piece builds on the last, so skipping ahead to path coefficients before confirming model fit can lead you to interpret relationships that don’t hold up.

Start With Model Fit Indices

Your software output will report several fit indices simultaneously. No single index tells the whole story, so you need to look at a combination of them. The most widely cited thresholds come from Hu and Bentler (1999), and while they aren’t iron rules, they remain the standard benchmarks most reviewers expect to see.

The Comparative Fit Index (CFI) and Tucker-Lewis Index (TLI) both compare your model against a baseline model where none of the variables are related. For both, values of 0.95 or higher indicate good fit. These are relative fit indices, meaning they tell you how much better your model is than a model with no relationships at all.

The Root Mean Square Error of Approximation (RMSEA) works differently. It estimates how much error remains per degree of freedom in your model. Values of 0.06 or below are generally considered acceptable. An RMSEA at or above 0.10 signals a model not worth taking seriously. Most software also reports a 90% confidence interval around the RMSEA, and you want the upper bound of that interval to stay below 0.10 at minimum. The Standardized Root Mean Square Residual (SRMR) should ideally fall below 0.08.

You will also see a chi-square test, which technically tests whether your model perfectly reproduces the observed data. In practice, this test is almost always significant (meaning it rejects your model) once your sample size gets large enough. That’s a known limitation: absolute fit indices like chi-square and RMSEA are more sensitive to model misspecification than relative indices like CFI. Because of this sensitivity, most researchers report the chi-square value and its degrees of freedom but rely on CFI, TLI, RMSEA, and SRMR for their actual fit evaluation.

Evaluate the Measurement Model First

Before looking at the relationships between your constructs, you need to confirm that your measured variables (survey items, test scores, observed indicators) actually capture the latent constructs they’re supposed to represent. This is the measurement model, and there are two key metrics to check.

Composite reliability (CR) tells you how consistently your set of indicators measures a given construct. It functions like Cronbach’s alpha but is considered less biased. A CR of 0.7 or higher is the accepted threshold. Average variance extracted (AVE) goes a step further by telling you what proportion of variance in the indicators is actually explained by the underlying construct, as opposed to measurement error. An AVE of 0.5 is the minimum acceptable level, meaning the construct accounts for at least half the variance in its indicators. Values above 0.7 are considered very good.

If your CR is below 0.7 or your AVE is below 0.5 for any construct, that’s a sign your indicators aren’t measuring what you think they’re measuring. You would need to revisit your measurement model (perhaps dropping weak indicators) before interpreting any structural paths.

Checking Discriminant Validity

AVE also plays a role in discriminant validity, which confirms that your constructs are genuinely distinct from each other. The classic test: the square root of each construct’s AVE should be larger than its correlation with any other construct in the model. If two constructs correlate more strongly than the square root of either one’s AVE, they may be measuring the same thing.

Interpreting Path Coefficients

Path coefficients are the core of your structural model. They represent the direct effects between variables. Your output will typically show both unstandardized and standardized versions, and understanding the difference matters for how you report and compare them.

A standardized path coefficient of 0.81 from Variable A to Variable B means that when Variable A increases by one standard deviation, Variable B is expected to increase by 0.81 standard deviations, holding all other predictors constant. A coefficient of -0.16 would mean Variable B decreases by 0.16 standard deviations for each standard deviation increase in A. These are not correlation coefficients. They represent the unique, direct effect of one variable on another after accounting for all other paths in the model.

Each path coefficient comes with a standard error and a p-value (or critical ratio). A p-value below 0.05 indicates the path is statistically significant. But statistical significance alone doesn’t tell you the effect is meaningful, which is where the size of the coefficient matters.

Standardized vs. Unstandardized Coefficients

Standardized coefficients let you compare the relative strength of different predictors on the same outcome, even when those predictors are measured on completely different scales. If education level and income both predict life satisfaction, and their standardized coefficients are 0.35 and 0.22 respectively, you can say education has a stronger relative effect.

Unstandardized coefficients preserve the original metric of your variables and are essential when you’re comparing effects across groups or across different samples. The reason: standardizing depends on the variance of each variable, and those variances can differ dramatically between groups. Two variables might show identical standardized effects but very different unstandardized effects simply because one variable has ten times the variance of the other. When comparing paths across populations or testing whether two predictors have equal effects in raw terms, unstandardized coefficients give you the more honest picture.

R-Squared for Endogenous Variables

Every endogenous variable in your model (any variable with an arrow pointing to it) gets an R-squared value. This tells you how much of that variable’s total variance is explained by its predictors in the model. An R-squared of 0.45 means the predictors collectively account for 45% of the variance, with the remaining 55% attributable to the disturbance term, which represents everything your model doesn’t capture.

The formula is straightforward: R-squared equals one minus the ratio of residual variance to total variance. Low R-squared values on your key outcome variables suggest your model is missing important predictors. There’s no universal threshold for what counts as “good” since this depends entirely on your field and research question, but you should report and discuss these values for all endogenous variables. In behavioral research, R-squared values of 0.20 to 0.40 are common and can be substantively meaningful.

Indirect Effects and Mediation

If your model includes mediating variables (where A affects B, and B affects C, creating an indirect path from A to C), you’ll need to test whether that indirect effect is statistically significant. The indirect effect is calculated as the product of the path coefficients along the chain. If A to B is 0.40 and B to C is 0.30, the indirect effect of A on C through B is 0.12.

The standard approach for testing indirect effects is bootstrapping, which resamples your data at least 1,000 times to build an empirical sampling distribution of the indirect effect. The resulting confidence interval tells you whether the indirect effect is meaningfully different from zero. If the 95% confidence interval does not include zero, the indirect effect is significant.

For data that follow a normal distribution, Monte Carlo or profile-likelihood methods also perform well. For data that are not normally distributed, the percentile bootstrap is recommended. Some software also offers bias-corrected (BC) and bias-corrected and accelerated (BCa) bootstrap options. These adjust for skewness in the bootstrap distribution and tend to produce more accurate confidence intervals in smaller samples. Whichever method you choose, report both the point estimate of the indirect effect and its confidence interval.

Sample Size Affects Everything

The trustworthiness of every result described above depends partly on whether your sample is large enough for the complexity of your model. Outdated rules of thumb suggest minimums of 100 to 200 cases, or 5 to 10 observations per estimated parameter, or 10 cases per variable. Research has shown these guidelines are far too simplistic.

In practice, required sample sizes range from as few as 30 to over 460 cases depending on your model’s specifics. A one-factor model with four strong indicators (factor loadings around 0.80) can work with as few as 60 participants. The same model with weaker loadings (around 0.50) needs 190. Move to a two-factor model with three indicators per factor and weak loadings, and you need at least 460. The pattern is clear: more complex models with weaker indicators demand substantially larger samples.

It’s also worth knowing that when sample sizes drop below 250, the combination of fit index cutoffs can over-reject models that actually fit the data adequately. If you’re working with a smaller sample and your fit indices are borderline, this known bias is worth considering before scrapping your model entirely.

Modifying Your Model

When initial fit is poor, most software provides modification indices that suggest specific changes (like adding a path or allowing two error terms to correlate) that would improve fit. Each modification index estimates how much the chi-square value would drop if you made that change.

The danger here is overfitting. Every modification you make should be justifiable on theoretical grounds, not just statistical ones. Adding a correlation between two error terms because the modification index is large, without any substantive reason why those two indicators would share variance beyond their intended construct, capitalizes on sample-specific noise. If you make data-driven modifications, report them transparently and ideally validate the revised model in a separate sample. A model that was theoretically specified from the start carries far more weight than one that was tweaked to achieve a CFI of 0.95.