Goodness of Fit Test: What It Measures and How It Works

A goodness of fit test measures whether the data you collected matches a pattern or distribution you expected. It does this by comparing your observed counts (what actually happened) to expected counts (what should happen if your assumption is correct), then calculating how large the gap is between the two. If the gap is small, your data fits the expected pattern. If the gap is large, something other than your assumed model is likely driving the results.

How the Test Works

The core logic is straightforward. You start with a hypothesis about how your data should be distributed. Maybe you expect equal proportions across categories, or a specific ratio like 3:1, or a known statistical distribution like a bell curve. You then collect real data and count how many observations fall into each category. Those are your observed counts.

Next, you calculate expected counts: the numbers you’d see in each category if your hypothesis were perfectly true. The test statistic is built by taking each category, subtracting the expected count from the observed count, squaring that difference, dividing by the expected count, and then adding all of those values together. A small total means your data closely matches expectations. A large total means there’s a meaningful discrepancy.

The result is compared against a threshold based on your number of categories and your chosen confidence level. If the test statistic exceeds that threshold, you reject the hypothesis that your data follows the expected pattern. If it falls below, you don’t have enough evidence to say the pattern is wrong.

The Null Hypothesis in Plain Terms

Every goodness of fit test is framed as a question: “Could this data have come from the distribution I assumed?” The null hypothesis says yes, the data follows the assumed distribution. The alternative says no, it doesn’t. You’re not proving the distribution is correct. You’re testing whether your data is inconsistent enough to rule it out. A high p-value means your data is compatible with the expected pattern. A low p-value (typically below 0.05) means the mismatch is too large to chalk up to random chance.

The Chi-Square Test: The Most Common Version

When people say “goodness of fit test,” they usually mean the chi-square version. It’s designed for categorical data, where observations fall into distinct groups or bins. You might use it to test whether a six-sided die lands on each face equally often, whether customer preferences split evenly across five products, or whether genetic traits appear in the ratios predicted by inheritance models.

The test requires a few conditions to work properly. Each observation must be independent of the others. Your sample needs to be large enough that every category has a reasonable expected count, generally at least 5. When expected counts are too small, the math behind the test breaks down and the results become unreliable. If some categories have very few expected observations, you can combine adjacent categories to meet this threshold.

Degrees of freedom for the test equal the number of categories minus one, minus any parameters you estimated from the data. So if you have 4 categories and didn’t estimate any parameters, you’d use 3 degrees of freedom. This number determines which reference distribution you compare your test statistic against.

A Classic Example: Genetics

One of the most well-known applications comes from biology. Gregor Mendel predicted that crossing two organisms carrying both dominant and recessive versions of two traits would produce offspring in a 9:3:3:1 ratio. To verify this, researchers count the actual offspring in each category and compare those observed numbers to the counts predicted by the 9:3:3:1 model.

When researchers tested Mendel’s original pea data with a chi-square goodness of fit test, the p-value came back at 0.92, meaning the observed data was extremely consistent with the predicted ratio. A separate analysis of dominant versus recessive traits across several experiments tested for the expected 3:1 ratio and produced a p-value of 0.74. In both cases, the data fit the theoretical model so closely that there was no reason to doubt Mendel’s laws. The same approach is used today in genetics labs, with students testing ratios in organisms like maize using expected probabilities of 9/16, 3/16, 3/16, and 1/16.

Beyond Chi-Square: Other Goodness of Fit Tests

The chi-square test works well for data that falls into categories, but it has limitations with continuous data. If you have measurements on a continuous scale (heights, temperatures, response times) and want to know whether they follow a specific distribution like a normal curve, you’d have to split the data into intervals first. That process loses information, because the test no longer evaluates the full shape of the distribution.

The Kolmogorov-Smirnov test avoids this problem. It compares the entire cumulative distribution of your data against the expected distribution, point by point, and finds the largest gap between the two. Because it uses every data point without binning, it’s more sensitive for continuous data. The tradeoff is that it’s less intuitive to interpret and has its own set of assumptions about sample size and whether you estimated parameters from the data.

Goodness of Fit in Regression Models

The concept of “goodness of fit” also appears in regression analysis, where it takes on a slightly different meaning. Here, the question is how well a model’s predictions match the actual data points. R-squared is the most familiar measure: it ranges from 0 to 1 and is often described as the percentage of variation in the outcome that the model explains. An R-squared of 0.65 would seem to mean the model captures 65% of what’s going on.

In practice, R-squared is more slippery than it looks. A perfectly correct model can produce a low R-squared if the data has a lot of natural variability. A completely wrong model can produce an R-squared close to 1 if the range of the input variable happens to be wide. R-squared also changes based solely on the spread of your input data, even when the underlying relationship stays exactly the same. It says nothing about whether your predictions are accurate in absolute terms. For these reasons, statisticians at the University of Virginia have argued bluntly that R-squared does not actually measure goodness of fit in any reliable way, despite its widespread use.

Applications in Medical Research

In epidemiology, goodness of fit tests help researchers verify that their statistical models match real-world disease patterns. For example, researchers studying kidney failure referral patterns in Wales used a Poisson model (a distribution that describes how often rare events occur in a given time period) to analyze the data. They then used a goodness of fit test to check whether the Poisson model was appropriate.

Medical data creates a particular challenge because it’s often sparse, with many categories containing very few cases. In those situations, the standard chi-square comparison can give misleading results, producing unusually low values that make a poor model look acceptable. Researchers have found that simulation-based approaches, where you generate thousands of fake datasets from the assumed model and see how your real data compares, provide a more reliable check when working with small or unevenly distributed medical datasets.

Measuring Effect Size After the Test

A goodness of fit test tells you whether the difference between observed and expected counts is statistically significant, but it doesn’t tell you how large that difference is in practical terms. For that, you need an effect size measure. Cramér’s V is one common option. It rescales the chi-square statistic to a value between 0 and 1, giving you a standardized sense of how strong the discrepancy is.

The general interpretation: a value of 0.2 or below indicates a weak association, meaning the observed data deviates only slightly from expectations. Values between 0.2 and 0.6 suggest a moderate effect. Anything above 0.6 points to a strong departure from the expected pattern. This matters because with a very large sample, even a tiny, practically meaningless difference can be statistically significant. Effect size helps you decide whether the difference actually matters for your purposes.