Causal inference is a set of methods for determining whether one thing actually causes another, rather than simply occurring alongside it. It sits at the heart of questions like “Does this medication reduce heart attacks?” or “Did this policy lower unemployment?” where getting the answer wrong has real consequences. The core challenge is straightforward: you can observe what happened, but you can never directly observe what would have happened under different circumstances.
Why Correlation Isn’t Enough
Two variables are correlated when their values move together. Ice cream sales and drowning deaths both rise in summer, producing a strong statistical correlation, but buying ice cream doesn’t cause drowning. A hidden third factor (hot weather) drives both. Correlation is a number that describes the size and direction of a relationship between variables. Causation means one event actually produces the other.
This distinction sounds obvious in the ice cream example, but it gets much harder in practice. Smoking is correlated with heavy alcohol use, but smoking doesn’t cause alcoholism. Smoking does, however, cause increased risk of lung cancer. The statistical patterns in the data can look identical in both cases. Causal inference provides the tools to tell these situations apart.
The Fundamental Problem
Imagine you want to know whether a new job training program helps people earn more money. For any individual person, there are two possible realities: one where they go through the program, and one where they don’t. The causal effect is the difference between those two outcomes. The problem is that each person can only live one of those realities. You either took the program or you didn’t. You never get to observe both versions of events for the same person at the same time.
This is formally called the “fundamental problem of causal inference”: you never observe both potential outcomes. Every method in the field is essentially a strategy for working around this limitation, either by clever study design or by using statistical techniques to construct a reasonable comparison.
How Randomized Trials Handle It
Randomized controlled trials (RCTs) are often called the gold standard for causal evidence, and the logic is elegant. If you randomly assign thousands of people to either receive a treatment or not, the two groups will, on average, be identical in every way except the treatment itself. Any difference in outcomes can then be attributed to the treatment rather than to some other factor.
But RCTs aren’t always possible. You can’t randomly assign people to smoke for 30 years to study lung cancer. You can’t randomly impose poverty on families to study its effects on children. Many important policy questions involve interventions that have already happened or that would be unethical to randomize. This is where the broader toolkit of causal inference becomes essential, offering ways to draw credible causal conclusions from observational data, where no one controlled who received what.
Confounders, Colliders, and Mediators
Three types of third variables create most of the problems (and solutions) in causal analysis.
A confounder is a variable that influences both the cause and the effect you’re studying. If you’re looking at whether exercise reduces depression, income could be a confounder: wealthier people may exercise more and also have lower depression rates for other reasons. Failing to account for a confounder produces a biased estimate of the causal effect. The fix is to adjust for it in your analysis.
A mediator is a variable that sits on the causal pathway between your cause and effect. Exercise might reduce depression partly by improving sleep quality. Sleep is the mediator. Adjusting for a mediator doesn’t fix bias; instead, it lets you break apart the total effect into the portion that flows through the mediator (the indirect effect) and the portion that doesn’t (the direct effect).
A collider is the trickiest of the three. It’s a variable that is caused by both the cause and the effect. Unlike confounders, you should not adjust for colliders. Doing so actually creates a false association between your variables where none existed. Colliders are less widely understood than confounders, which means researchers sometimes introduce bias by controlling for variables they shouldn’t.
Two Major Frameworks
The field has two complementary ways of thinking about causation. The first, often called the potential outcomes framework, focuses on comparing what happens under different treatments. For any individual, you define two potential outcomes: the outcome if treated and the outcome if untreated. The causal effect is the difference between them. Since you can only observe one, the framework specifies the assumptions needed to estimate the other from data on similar individuals.
The second framework uses causal diagrams, sometimes called directed acyclic graphs (DAGs). These are visual maps where arrows point from causes to effects. A diagram might show that smoking points to lung cancer, that a genetic factor points to both smoking and lung cancer, and that lung cancer points to a medical diagnosis. The key insight is that missing arrows matter as much as present ones: a missing arrow is a claim that one variable has zero influence on another. These diagrams help researchers identify which variables to adjust for and which to leave alone.
This second framework also introduced a mathematical tool for distinguishing observation from intervention. Observing that people who take a drug have better outcomes is different from concluding that giving the drug causes better outcomes, because the people who chose to take it may differ from those who didn’t. The formal notation separates these two questions, allowing researchers to determine when observational data can answer causal questions and when it can’t.
Assumptions That Make It Work
Drawing causal conclusions from non-experimental data requires several assumptions, and if they’re violated, the conclusions can be wrong.
The first is that one person’s treatment doesn’t affect another person’s outcome. If you’re studying a vaccine, this assumption would be violated because vaccinating one person changes the disease exposure of people around them. When this holds, each person’s potential outcomes are stable regardless of what happens to others.
The second assumption is that everyone in the study has some chance of receiving each treatment. If a certain group of people could never possibly receive the treatment, you can’t estimate its effect for that group. This is called positivity or overlap.
The third, and often the hardest to defend, is that there are no unmeasured confounders. This means that within groups of people who look similar on all the variables you’ve measured, treatment assignment is essentially random. In a randomized trial this holds by design. In observational data, it requires that you’ve measured everything that jointly influences both the treatment and the outcome. There’s no statistical test that can confirm this assumption; it rests on subject-matter knowledge.
Methods for Observational Data
When randomization isn’t possible, researchers use several techniques to approximate experimental conditions.
Propensity score matching pairs treated individuals with untreated individuals who had a similar probability of being treated, based on their observed characteristics. The idea is to create groups that look as alike as possible, mimicking what randomization would have done. One limitation is that when the treated and untreated groups are very different, matching can discard large portions of the data to find comparable pairs, shrinking the sample considerably.
Difference-in-differences compares changes in outcomes over time between a group affected by an intervention and a group that wasn’t. If a new healthcare policy is introduced in one state but not a neighboring state, you can compare how outcomes changed in both places before and after the policy took effect. This controls for any stable, unmeasured differences between the groups, because those differences cancel out when you look at changes rather than levels. It’s commonly used to evaluate policy changes, program implementations, and regulatory reforms.
Instrumental variables use a third variable that affects the treatment but has no direct effect on the outcome except through the treatment. A classic example: distance from a hospital affects whether someone receives a particular surgery, but distance itself doesn’t directly affect health outcomes. The instrument provides a source of variation in treatment that mimics randomization. Good instruments are hard to find, and weak ones produce unreliable estimates.
Causal Inference in Machine Learning
Traditional machine learning excels at prediction: given a patient’s characteristics, what’s the likely outcome? But prediction doesn’t tell you what to do about it. Causal inference is increasingly being integrated into machine learning to answer intervention questions: if you change something, what happens?
One active area combines the two to estimate how treatment effects vary across individuals. A new drug might work well for younger patients but not older ones. Methods like causal forests use machine learning’s pattern-finding ability to discover which subgroups benefit most from an intervention, then apply causal reasoning to ensure those estimates reflect genuine effects rather than spurious patterns. Stanford’s machine learning and causal inference program, for instance, focuses on using these hybrid methods to measure intervention effects, understand who benefits most, and design targeted treatment policies. The combination helps move beyond “what will happen” toward “what should we do,” which is often the question that actually matters.

