What Is Simpson’s Paradox and Why Does It Matter?

Simpson’s paradox is a statistical phenomenon where a trend that appears in several groups of data reverses or disappears when those groups are combined. A treatment can work better in every subgroup of patients yet appear worse overall. A company can improve satisfaction in every customer segment yet see its aggregate score decline. The numbers aren’t wrong in either case. The paradox arises because of a hidden variable that changes the mix of data when you pool everything together.

The name comes from Edward Simpson, a British statistician who described the problem in a 1951 paper titled “The Interpretation of Interaction in Contingency Tables.” But the phenomenon had been noticed by other statisticians even earlier. What makes it a “paradox” isn’t that it breaks any rule of math. It’s that the results violate our intuition about what the data should tell us, especially when we’re trying to figure out what causes what.

How the Reversal Actually Works

At its core, Simpson’s paradox happens when a third variable, sometimes called a lurking or confounding variable, influences both the outcome you’re measuring and the groups being compared. When you ignore that variable and lump all the data together, the uneven distribution of people across subgroups distorts the overall picture.

Here’s the logic in plain terms. Suppose you’re comparing two hospitals on survival rates. Hospital A treats mostly high-risk patients. Hospital B treats mostly low-risk ones. Hospital A might have better survival rates for both high-risk and low-risk patients individually, but because it handles a far greater share of the difficult cases (which have lower survival rates no matter where they’re treated), its overall survival rate can look worse. The hidden variable, patient severity, skews the aggregate numbers.

This is why researchers describe Simpson’s paradox as fundamentally a problem of confounding. The question it forces you to answer is: should you look at the combined data, or the subgroup data? And that answer depends not on the math itself, but on the causal structure of the situation you’re studying.

The UC Berkeley Admissions Case

The most famous example comes from graduate admissions at the University of California, Berkeley in the fall of 1973. Of the 8,442 men who applied, 44 percent were admitted. Of the 4,351 women who applied, only 35 percent got in. On the surface, this looked like clear evidence of gender bias against women.

But when researchers broke the data down by individual departments, the bias against women disappeared. In several departments, women were actually admitted at higher rates than men. The paradox occurred because women disproportionately applied to departments with low acceptance rates, while men disproportionately applied to departments with high acceptance rates. The lurking variable was department choice. Once you accounted for where people applied, the apparent discrimination in the aggregate data evaporated.

This case is a useful reminder that aggregate statistics can tell a completely misleading story. The overall numbers were real, but the conclusion people naturally drew from them (that Berkeley was biased against female applicants) was wrong.

Kidney Stone Treatments

A medical example makes the paradox even more concrete. A classic study compared two treatments for kidney stones: open surgery and a less invasive procedure called percutaneous nephrolithotomy. For small stones (under 2 cm), open surgery succeeded 93.1% of the time, compared to 86.7% for the less invasive option. For large stones (2 cm or bigger), open surgery again won: 73.0% versus 68.8%.

Open surgery was better in both subgroups. Yet when you combined all patients regardless of stone size, the less invasive procedure appeared superior: 82.6% overall versus 78.0% for open surgery. The reversal happened because doctors tended to assign patients with large, difficult stones to open surgery and patients with small stones to the less invasive method. Since large stones are harder to treat no matter what, open surgery’s overall numbers were dragged down by its disproportionate share of tough cases.

COVID-19 Death Rates Across Countries

Simpson’s paradox showed up during the pandemic in a striking way. When researchers compared case fatality rates between China (as of February 2020) and Italy (as of March 2020), Italy’s overall death rate was higher. But when the data were broken down by age group, Italy’s fatality rate was actually lower in every single age bracket.

The explanation was demographic. Italy’s population is significantly older than China’s, with a median age of 45.4 compared to 38.4. Italy also had a larger share of confirmed cases among elderly people, who face the highest risk from COVID-19 regardless of country. So even though Italy was performing better within each age group, its overall numbers looked worse because a bigger proportion of its cases were concentrated in the most vulnerable age range. Different testing strategies and patterns of social contact between generations likely played a role too, but the age distribution was the primary driver of the paradox.

Why It Matters for Everyday Decisions

Simpson’s paradox isn’t just an academic curiosity. It creates real problems whenever people make decisions based on aggregated numbers without looking at what’s happening underneath.

In business, it can lead to exactly the wrong strategic call. One product manager described discovering that his teams appeared to have more rework when they used a discovery phase in their process, with a 33% rework rate compared to 31% without it. The instinct was to scrap the discovery phase entirely. But when the data were split by individual teams, both teams actually performed better with the discovery phase. The aggregate number was misleading because the teams had different baseline rework rates and used the discovery phase at different frequencies.

A company tracking customer satisfaction might see its overall score drop from 80% to 75% year over year, while every individual customer segment actually improved. This can happen when the company’s customer base shifts toward segments that have historically lower satisfaction, even if those segments are trending upward. Acting on the aggregate decline without checking the subgroups could lead to unnecessary overhauls of a strategy that’s actually working.

In regulated industries, the stakes can be higher. A company might appear compliant at the aggregate level while failing to meet performance requirements within specific subgroups, or vice versa.

How to Spot and Avoid It

The single most important defense against Simpson’s paradox is to think carefully about what variables might be influencing both the groups you’re comparing and the outcome you’re measuring. Before trusting any aggregate trend, ask: could the mix of cases be different across groups in a way that distorts the combined picture?

Researchers use a few specific approaches. Stratification, which means breaking data into meaningful subgroups before drawing conclusions, is the most straightforward. If the trend holds within each subgroup and in the aggregate, you’re on solid ground. If it reverses, you’ve found a confounding variable that needs to be accounted for.

A more formal approach involves mapping out the causal relationships between variables before running any analysis. The statistician Judea Pearl has shown, using tools called directed acyclic graphs, that sometimes adjusting for a subgroup variable is the right move, but sometimes it can actually introduce new distortions. The correct choice depends on whether the variable is a genuine confounder or something else in the causal chain. In other words, you can’t resolve the paradox with statistics alone. You need a clear theory about what’s causing what.

For practical purposes, a good first step is to clearly define the mechanism you think is at work, determine what level it operates at (individuals, teams, populations), and then check whether your data actually match that level. Software tools now exist, including packages in the R programming language, specifically designed to flag potential instances of Simpson’s paradox when testing relationships between two variables. But the most reliable safeguard is still a habit of skepticism: whenever a single number summarizes a complex situation, look at the subgroups before you act on it.