What Is Endogeneity? Definition, Causes, and Examples

Endogeneity is a problem in statistical analysis where an explanatory variable in your model is tangled up with the error term, the part of the equation that captures everything you haven’t measured. When this happens, your results become unreliable because the model can’t cleanly separate cause from effect. It’s one of the most common and consequential issues in economics, finance, and social science research.

The Core Problem

Most statistical models try to estimate how one variable affects another. A simple regression might ask: how much does an extra year of education increase someone’s earnings? The model works by isolating the effect of education while bundling everything else it can’t observe into an “error term.” For the math to produce trustworthy estimates, one critical assumption must hold: the explanatory variables (like years of education) cannot be correlated with that error term.

Endogeneity is what researchers call it when that assumption breaks down. If your explanatory variable is systematically related to the unmeasured factors hiding in the error term, your estimates become biased and inconsistent. “Biased” means your estimate is off-target on average. “Inconsistent” means that collecting more data won’t fix the problem. No matter how large your sample gets, you’ll keep converging on the wrong answer.

Three Main Causes

Omitted Variable Bias

This is the most intuitive source of endogeneity. It occurs when something important is left out of the model, and that missing factor influences both the explanatory variable and the outcome. The classic example is estimating how education affects wages. People with more schooling tend to earn more, but they also tend to have higher innate ability, more motivation, or stronger family support. If you can’t measure and include those traits, they get absorbed into the error term. Since those same traits also predict how much education someone pursues, your education variable becomes correlated with the error. The result: you overestimate (or sometimes underestimate) the true effect of education on earnings.

Simultaneity

Simultaneity arises when two variables cause each other at the same time. Think about the relationship between a product’s price and its demand. Higher prices reduce demand, but higher demand can also push prices up. If you try to estimate one direction of this relationship with a simple regression, the model can’t tell which direction the influence is flowing. The explanatory variable (price) is being shaped by the outcome (demand) at the same moment, which violates the assumption that the explanatory variable is independent of the error term.

Measurement Error

When a variable is measured with systematic error, the noise from that mismeasurement ends up in the error term. If the measurement error is also related to the true value of the variable, you have endogeneity. For example, if you rely on self-reported income in a survey, people who earn more may underreport by a larger amount. That pattern creates a correlation between your income variable and the errors in your model.

Research on pricing and consumer behavior has shown that simultaneity alone may not cause severe problems if the model includes all relevant information. But when omitted variables combine with simultaneity, the distortion gets substantially worse.

A Concrete Example

One of the most cited illustrations of endogeneity comes from a 1991 study by Joshua Angrist and Alan Krueger on the returns to schooling. They wanted to know how much an additional year of education actually increases wages. The obvious problem: you can’t observe individual ability, drive, or family background with enough precision to include them in the model. These unmeasured traits push certain people toward both more schooling and higher earnings, creating a textbook omitted variable problem.

Their solution was clever. They used a person’s quarter of birth as a stand-in variable (called an “instrument”) for years of schooling. Because compulsory schooling laws forced children born earlier in the year to stay in school slightly longer before they could legally drop out, quarter of birth predicted education levels. But there’s no reason your birth quarter would directly affect your wages decades later. This allowed the researchers to isolate the portion of education that was driven by an essentially random factor, sidestepping the endogeneity problem.

How Researchers Detect It

The standard tool for checking whether endogeneity is present is the Durbin-Wu-Hausman test. It works by comparing two different estimates of the same relationship: one that assumes endogeneity exists and one that assumes it doesn’t. If the two estimates are statistically similar, there’s no evidence of a problem and you can proceed with simpler methods. If they diverge significantly, endogeneity is likely distorting your results.

The test’s null hypothesis is that there is no endogeneity. Rejecting that null tells you the explanatory variable is correlated with the error term, and you need to use a corrective technique. An important detail: the residuals from the first-stage regression in the test represent the “contaminated” portion of the explanatory variable, the part that’s entangled with the error term.

Instrumental Variables

The most widely used correction for endogeneity is instrumental variables (IV) estimation. The idea is to find an outside variable, the instrument, that can serve as a kind of filter, separating the clean variation in your explanatory variable from the contaminated part. A valid instrument must satisfy three conditions: it must be correlated with the explanatory variable (relevance), it can only affect the outcome through the explanatory variable (the exclusion restriction), and its relationship with the outcome must not be confounded by other factors (exchangeability).

Only the first condition, relevance, can be directly verified with data. The exclusion restriction is fundamentally unverifiable. You have to argue on logical or theoretical grounds that the instrument has no backdoor path to the outcome. This is why finding a good instrument is considered one of the hardest tasks in empirical research.

How Two-Stage Least Squares Works

The most common IV technique is called two-stage least squares, or 2SLS. One useful way to think about it: your endogenous variable is like a partially rotten apple. Part of its variation is “good” (unrelated to the error term) and part is “bad” (correlated with the error). Ordinary regression uses the whole apple, rot and all. Two-stage least squares uses the instrument as a knife to cut away the bad part.

In the first stage, you regress the problematic explanatory variable on the instrument and any control variables. The predicted values from this regression represent only the variation in your variable that can be explained by the instrument, which is the clean, exogenous portion. In the second stage, you regress the outcome on those predicted values instead of the original variable. Because the predicted values are free of correlation with the error term, the resulting estimate is consistent.

Fixed Effects for Panel Data

When you have data that tracks the same individuals, companies, or countries over multiple time periods, fixed effects models offer another way to handle endogeneity caused by time-invariant omitted variables. The logic is straightforward: if the unmeasured factor (say, a person’s innate ability) doesn’t change over time, you can eliminate it by looking at how each individual’s outcomes change relative to their own average.

The technique works by subtracting each individual’s mean values from their observations, a process called “demeaning.” Since a constant trait like ability has the same value in every time period, it drops out entirely when you take deviations from the mean. What’s left is only the within-person variation over time, which is free of that particular source of endogeneity. An equivalent approach for two-period data simply takes the difference between periods for each individual, achieving the same result.

The limitation is clear: fixed effects only eliminate omitted variables that are constant over time. If the unobserved factor changes, such as motivation that fluctuates year to year, fixed effects won’t solve the problem.

Newer Approaches Using Machine Learning

Traditional IV methods require researchers to specify their models fairly precisely, choosing which variables to include and in what form. In complex settings with many potential confounders, this creates risk. Debiased machine learning, developed by Victor Chernozhukov and colleagues, addresses this by letting flexible algorithms handle the modeling of confounders.

One version of this approach fits machine learning models for both the outcome and the treatment variable, then combines them in a way that remains valid even if one of the two models is somewhat misspecified. This “doubly robust” property provides a safety net that traditional methods lack. For instrumental variables settings specifically, the approach fits three separate models (predicting the instrument, the treatment, and the outcome) using machine learning, then combines them to estimate the causal effect. This reduces the dependence on getting every modeling choice exactly right, which is especially valuable when dealing with high-dimensional data where dozens or hundreds of potential confounders exist.