Propensity score matching (PSM) is the right tool when you have observational data, you want to estimate a causal treatment effect, and randomization wasn’t possible. It works by pairing people who received a treatment with similar people who didn’t, based on their predicted probability of receiving that treatment. But it’s not always the best choice, and using it in the wrong situation can actually increase bias rather than reduce it.
The Core Problem PSM Solves
In a randomized controlled trial, random assignment ensures that the treatment and control groups are similar on average. In observational data, that guarantee vanishes. People who receive a treatment often differ systematically from those who don’t, and those differences can distort your estimate of the treatment’s effect. PSM attempts to recreate the balance you’d get from randomization by matching treated and untreated subjects who had a similar likelihood of being treated, based on their observed characteristics.
The technique reduces a potentially large set of confounding variables down to a single number: the propensity score. Instead of trying to match on dozens of variables simultaneously, you match on this one summary measure. This makes the matching process tractable even when you have many covariates to account for.
When PSM Is the Right Choice
PSM fits best when several conditions align. First, you’re asking a causal question about a treatment or exposure using non-randomized data. Second, you have a clear binary treatment (received it or didn’t). Third, you have a reasonably large pool of untreated subjects to match against, with meaningful overlap between groups.
PSM has a particular advantage over standard regression when the treated and untreated groups look very different from each other. Regression can hide the fact that it’s extrapolating far beyond the data, quietly generating estimates for combinations of covariates that don’t actually exist in your sample. PSM forces you to confront this problem directly: if a treated subject has no plausible match, you see it. Compared to traditional regression, PSM provides an approach that more closely approximates the validity of a randomized trial, particularly when the goal is optimizing covariate similarity between groups rather than modeling an outcome.
PSM also helps when you have many confounders relative to your number of outcome events. Traditional regression runs into trouble when you have more than roughly one variable per ten outcome events. PSM sidesteps this by modeling treatment assignment rather than the outcome, which typically gives you more statistical room to include covariates.
Two Assumptions That Must Hold
PSM rests on two non-negotiable assumptions. If either one fails, the method can produce misleading results.
The first is the “no unmeasured confounders” assumption: every variable that influences both treatment assignment and the outcome must be measured and included in the propensity score model. If something important is missing, like smoking status in a claims database or blood pressure in a questionnaire-based study, the propensity score becomes a flawed summary. A propensity score built without key confounders functions like a measurement taken with a broken instrument. In theory, conditioning on the propensity score should make exposed and unexposed subjects exchangeable, but only when there is no unmeasured confounding.
The second is the overlap (or “positivity”) assumption: every subject must have a nonzero probability of receiving either treatment. If some people were guaranteed to get the treatment, or guaranteed not to, there’s no valid comparison group for them. You can check this visually by plotting the distribution of propensity scores in both groups. Where the distributions don’t overlap, matching can’t work.
When PSM Is the Wrong Choice
PSM is not appropriate when you suspect important confounders are unmeasured and you have no way to account for them. Claims databases, for instance, often lack clinical details like lab values, lifestyle factors, or disease severity scores. In these cases, PSM can create a false sense of security: the matched groups look balanced on what you measured, but remain imbalanced on what you didn’t.
Small samples also pose problems. PSM discards unmatched subjects, which can substantially reduce your effective sample size and statistical power. If your treatment group is already small, the loss of subjects may make your analysis underpowered. Recent work has also shown that PSM can sometimes increase covariate imbalance after matching, particularly when caliper widths are poorly chosen.
PSM is also a poor fit when the treatment isn’t binary, when you’re interested in dose-response relationships, or when there’s very little overlap in propensity scores between groups. If most treated subjects have scores clustered at one end and most controls at the other, you’ll either lose most of your sample to failed matches or be forced to accept poor-quality matches.
Choosing a Matching Algorithm
Not all matching techniques perform equally. A comparison of 12 matching algorithms found that caliper matching (where a match is only accepted if the propensity scores fall within a specified distance) tended to produce less biased treatment effect estimates than either optimal or nearest neighbor matching. The tradeoff is that caliper matching may leave some treated subjects unmatched if no control falls within the caliper distance, slightly increasing variance.
Nearest neighbor matching without replacement, where each control is used only once, performed comparably to optimal matching in terms of covariate balance. Matching with replacement, where the same control can serve as a match for multiple treated subjects, did not show superior performance and generally produced estimates with greater variability and higher error. The order in which treated subjects were selected for matching had at most a modest effect on results.
For the ratio of controls to treated subjects, research across 96 different scenarios found that 1:1 matching minimized bias in about 68% of cases. Matching up to two controls per treated subject minimized overall error in roughly 84% of scenarios, offering improved precision without a large increase in bias. Going beyond 2:1 matching rarely helps enough to justify the added complexity.
What PSM Estimates
The most common estimand from PSM is the average treatment effect on the treated (ATT): the average difference in outcomes for people who received the treatment, compared to what would have happened if they hadn’t. This is subtly different from the average treatment effect (ATE), which applies to the entire population regardless of who actually received treatment.
The distinction matters. ATT answers “how much did this treatment help the people who got it?” while ATE answers “how much would this treatment help if applied to everyone?” PSM naturally produces the ATT because it finds, for each treated person, a comparable untreated person. The estimate is then calculated as the average difference in outcomes across all matched pairs.
Checking Whether the Matching Worked
After matching, you need to verify that the groups are actually balanced. The standard tool is the standardized mean difference (SMD) for each covariate. The most widely used threshold is 0.1: if any covariate has an SMD of 0.1 or greater after matching, the balance is considered inadequate. Some researchers use a more lenient threshold of 0.25, but 0.1 has become the consensus benchmark across the field. The goal isn’t to detect tiny, statistically significant imbalances, but to identify imbalances large enough to reflect meaningful confounding.
If balance isn’t achieved, you may need to re-specify your propensity score model by adding interaction terms, nonlinear terms, or additional covariates. Simply reporting that you used PSM without demonstrating balance is a red flag in any published analysis.
Testing Robustness Against Hidden Bias
Because the no-unmeasured-confounders assumption can never be fully verified, sensitivity analyses are essential. Rosenbaum’s sensitivity analysis is the most common approach. It asks: how strong would an unmeasured confounder need to be to change your conclusions? The analysis works by systematically increasing a parameter that represents the degree of hidden bias and checking whether your finding remains statistically significant at each level.
If your results hold up even when allowing for substantial hidden bias, the finding is considered robust. If a small amount of unmeasured confounding could flip the result, you should be cautious about drawing causal conclusions. Other approaches, like propensity score calibration, attempt to directly adjust for unmeasured confounders using validation data from a subset of participants where additional variables were collected. However, no existing method can fully address the joint effect of several unobserved confounders simultaneously.
PSM vs. Other Propensity Score Methods
Matching is just one way to use propensity scores. Alternatives include stratification (grouping subjects into blocks by propensity score), inverse probability weighting (reweighting the sample so treated and untreated groups become comparable), and covariate adjustment (including the propensity score as a variable in a regression model). Each approach has strengths: weighting retains the full sample, stratification is simpler to implement, and covariate adjustment integrates easily into standard models.
PSM’s main advantage is transparency. You can inspect each matched pair, visualize the quality of matches, and clearly communicate what population your results apply to. Its main disadvantage is the loss of unmatched subjects, which reduces sample size and can shift the target population in ways that aren’t always obvious. When your control pool is small or overlap is limited, weighting or stratification may preserve more of your data while still addressing confounding.

