How Much Should We Trust Differences-in-Differences Estimates?

Difference-in-differences (DiD) estimates deserve cautious trust. They can isolate causal effects of policies and interventions when randomized experiments aren’t possible, but their reliability hinges entirely on assumptions that are untestable in their strictest form. Recent methodological advances have revealed that the most common version of DiD, the two-way fixed effects model, can produce biased or even directionally wrong estimates under conditions that are surprisingly common in applied research. Knowing where DiD breaks down is the key to knowing when to trust it.

The Assumption You Can’t Fully Verify

Every DiD estimate rests on the parallel trends assumption: the treated and comparison groups would have followed parallel outcome trends if the treatment had never happened. This is a claim about a counterfactual, a world we never observe, which means it can never be directly confirmed. You can look at whether trends were parallel before treatment, and that’s helpful, but pre-treatment parallel trends don’t guarantee post-treatment parallel trends. A divergence could have been about to happen for reasons completely unrelated to the policy.

A stronger version of this assumption, the conditional parallel trends assumption, says that trends would have been parallel among units that share the same values of certain background characteristics. This gives researchers a way to adjust for observable differences between groups, but it still requires believing that no unobserved factor was simultaneously changing one group’s trajectory. If you’re reading a DiD study and the treated group was already on a different trajectory before the intervention, or if the comparison group experienced its own shock during the study period, the estimate is unreliable regardless of how sophisticated the model is.

Why the Standard Model Fails With Staggered Rollouts

Many policies don’t hit everywhere at once. States adopt minimum wage increases in different years. Countries implement regulations on different timelines. Researchers have long handled this with two-way fixed effects models that include indicators for each unit and each time period. For years this was considered standard practice. It turns out it has a serious flaw.

The two-way fixed effects estimator produces a weighted average of treatment effects across all treated groups, but the weights are driven by statistical properties like treatment variance rather than anything substantively meaningful. Units treated near the middle of the study period get disproportionately large weights. If the policy’s effect varies across groups (say, early adopters benefit more than late adopters), these unintuitive weights distort the overall estimate in ways the researcher may not realize.

A separate and more alarming problem arises when treatment effects change over time. Under staggered adoption, the model sometimes uses already-treated units as controls for newly-treated units. If the effect in the already-treated group is growing over time, the comparison makes the treatment look like it has the opposite effect. To illustrate: if the policy increases harm for both groups but the earlier-treated group’s effect is growing faster (a slope of 3 versus 2), the comparison yields a negative number, making a harmful policy appear protective. This isn’t a minor technical nuisance. It can flip the sign of your estimate.

These problems were formalized in work by Andrew Goodman-Bacon, who showed that the two-way fixed effects estimator requires an additional assumption beyond parallel trends: treatment effects must be constant across groups and over time. In practice, constant effects are the exception, not the rule.

Newer Estimators That Address These Problems

The methodological response has been a wave of robust estimators designed for settings with staggered treatment timing and heterogeneous effects. Approaches developed by Callaway and Sant’Anna, Sun and Abraham, and others avoid comparing already-treated units against newly-treated units. Instead, they estimate group-specific and time-specific treatment effects, then aggregate them with transparent, researcher-chosen weights.

These newer methods require the parallel trends assumption to hold for all periods and all treated groups, which is a stronger version of the assumption than the classic two-group, two-period case demands. But they avoid the mechanical bias that two-way fixed effects introduces. If you’re evaluating a DiD study published after roughly 2020 and it still relies exclusively on a standard two-way fixed effects model with staggered treatment, that’s a reason to be skeptical. The tools to do better now exist and are widely accessible.

What Placebo Tests Can and Can’t Tell You

Credible DiD studies typically include placebo tests, sometimes called falsification tests or negative control tests. These come in several flavors. A researcher might apply the analysis to a fake treatment date (testing whether an “effect” appears before the policy actually started), a placebo outcome that the policy shouldn’t affect, or a placebo population that wasn’t exposed. If these tests show effects where none should exist, something is wrong with the research design.

These tests are genuinely useful, but they have limits. They can detect certain violations of the parallel trends assumption, and they can flag problems with standard errors (for instance, if the model understates uncertainty). What they can’t do is prove the assumption holds. A study that passes every placebo test could still be biased by an unobserved confounder that happened to coincide with the real treatment. Think of placebo tests as necessary but not sufficient: a study that fails them is clearly flawed, but a study that passes them isn’t automatically right.

When Synthetic Controls May Be More Trustworthy

For studies with only one or a few treated units (a single state passing a law, one country implementing a ban), synthetic control methods offer an alternative worth considering. Instead of assuming parallel trends between the treated unit and some pre-selected comparison group, synthetic controls build a weighted composite of untreated units that closely matches the treated unit’s pre-treatment trajectory. The key advantage is transparency: you can see exactly how well the synthetic control tracks the treated unit before the intervention, and judge for yourself whether the match is convincing.

This data-driven matching approach sidesteps some of the judgment calls involved in choosing a comparison group for DiD. But it has its own limitation. When the pool of potential donor units is small, finding a good pre-treatment match becomes difficult, and a poor match means the post-treatment comparison is unreliable. Neither method dominates the other in all settings. Synthetic controls tend to shine when the number of treated units is very small and a rich donor pool is available. DiD tends to work better when you have many treated and untreated units and a reasonable case for parallel trends.

A Practical Checklist for Evaluating DiD Studies

When you encounter a DiD estimate, whether in an academic paper, a policy brief, or a news article citing research, a few questions can help you calibrate your trust:

Are pre-treatment trends actually parallel? Look for a graph showing outcome trends before the intervention. Diverging pre-trends are a dealbreaker.
Is treatment staggered? If yes, did the authors use a robust estimator, or did they rely on standard two-way fixed effects? The latter is a red flag for bias.
Could something else explain the result? Did another event, policy, or economic shift coincide with the treatment? DiD can’t separate the treatment from simultaneous confounders.
Did the study include placebo tests? Fake treatment dates, unaffected outcomes, and unexposed populations should all show null results. If they weren’t tested, the study is less convincing.
How large and stable is the comparison group? A comparison group that’s very different from the treated group, or that experiences its own disruptions during the study, weakens the foundation of the estimate.

DiD remains one of the most widely used tools in policy evaluation for good reason: when its assumptions hold, it provides credible causal estimates from observational data. But “when its assumptions hold” is doing a lot of work in that sentence. The method is only as trustworthy as the setting it’s applied in and the care the researcher takes to probe its vulnerabilities. Treat every DiD estimate as a starting point for scrutiny, not a final answer.