Pearson Chi-Square: What It Is and How It Works

The Pearson chi-square test is a statistical method that measures whether the pattern you see in categorical data (counts of things in different groups) is meaningfully different from what you’d expect by chance alone. It’s one of the most widely used tests in research, applied everywhere from medical trials to marketing surveys, whenever you need to know if two categories are genuinely related or if the numbers just happened to shake out that way.

What the Test Actually Does

At its core, the chi-square test compares two sets of numbers: what you observed in your data and what you’d expect to see if nothing interesting were going on. “Nothing interesting” means the variables you’re looking at are completely independent of each other.

Say you’re testing whether a new vaccine reduces infection rates. You give the vaccine to one group and a placebo to another, then count how many people in each group got infected. If the vaccine has zero effect, you’d expect roughly the same infection rate in both groups. The chi-square test quantifies how far your actual results stray from that “no effect” scenario. A large chi-square value means your observed data looks very different from what independence would predict, suggesting a real relationship between the variables.

The formula itself is straightforward: for each cell in your data table, subtract the expected count from the observed count, square that difference, then divide by the expected count. Add up all those values across every cell, and you get the chi-square statistic. In notation, that’s the sum of (O − E)² / E, where O is the observed count and E is the expected count.

How Expected Values Are Calculated

The expected value for any cell in a table is calculated by multiplying the row total for that cell by the column total, then dividing by the grand total of all observations. This produces the count you’d see in that cell if the row and column variables had absolutely no relationship to each other.

For example, if 60% of your total sample is female and 30% of the total sample chose Option A, then under independence you’d expect 18% of the total (0.60 × 0.30) to be females who chose Option A. The expected count is simply that percentage applied to your sample size. When your observed counts deviate substantially from these expected counts across the whole table, the chi-square value grows larger.

Two Main Versions of the Test

The Pearson chi-square is used in two distinct situations, and it helps to know which one applies to your question.

The test of independence asks whether two categorical variables are related. You arrange your data in a contingency table (rows for one variable, columns for the other) and test whether the distribution in one variable changes depending on the other. A classic example: does treatment group (drug vs. placebo) relate to outcome (recovered vs. not recovered)?

The goodness-of-fit test asks whether a single variable’s distribution matches some expected pattern. You might test whether the colors of cars in a parking lot match the national sales distribution, or whether dice rolls are evenly distributed across all six faces. Here, you’re comparing one row of observed counts against one row of expected counts rather than working with a full table.

Degrees of Freedom and the P-Value

Once you calculate the chi-square statistic, you need to know whether it’s large enough to matter. That depends on the degrees of freedom, which reflect how many cells in your table are free to vary. For a contingency table, degrees of freedom equal (number of rows minus 1) multiplied by (number of columns minus 1). A 2×2 table has 1 degree of freedom; a 3×4 table has 6.

With your chi-square value and degrees of freedom in hand, you look up the corresponding p-value. The p-value tells you the probability of seeing results at least as extreme as yours if the variables were truly independent. A small p-value (typically below 0.05) suggests the pattern in your data is unlikely to be a coincidence, and you reject the assumption that the variables are unrelated.

Assumptions and the Rule of 5

The chi-square test is an approximation, and it works well only when certain conditions are met. The most important rule: no more than 20% of cells in your table should have an expected frequency below 5, and no cell should have an expected frequency below 1. When expected counts are too small, the approximation breaks down and results become unreliable.

The test also requires that observations be independent of each other. Each person, item, or event should appear in only one cell of the table. If the same person is counted twice or if observations are paired (like before-and-after measurements on the same subject), the standard chi-square test doesn’t apply.

What to Use When Sample Sizes Are Small

When your sample is small and expected frequencies drop below the thresholds above, Fisher’s exact test is the standard alternative. Unlike the chi-square test, Fisher’s exact test doesn’t rely on approximation. It calculates the exact probability of observing your data under the assumption of independence, making it reliable even with very small counts.

Fisher’s exact test is valid for all sample sizes, not just small ones, but it’s computationally intensive with larger datasets, which is why chi-square remains the default for bigger samples. For 2×2 tables with a total sample size under about 40, a middle-ground option called Yates’ continuity correction adjusts the standard chi-square formula by subtracting 0.5 from the absolute difference between observed and expected values before squaring. This shrinks the chi-square statistic slightly, compensating for the fact that discrete counts don’t perfectly follow the smooth theoretical distribution the test assumes. Yates’ correction applies only to 2×2 tables (1 degree of freedom) and should not be used for larger tables.

Measuring the Strength of a Relationship

A significant chi-square result tells you a relationship exists, but it doesn’t tell you how strong that relationship is. For that, you need an effect size measure. Two common ones are the phi coefficient (for 2×2 tables) and Cramér’s V (for larger tables). Both range from 0 (complete independence) to 1 (complete dependence).

General interpretation guidelines for both measures:

  • Below 0.10: negligible association
  • 0.10 to 0.20: weak association
  • 0.20 to 0.40: moderate association
  • 0.40 to 0.60: relatively strong association
  • 0.60 to 0.80: strong association
  • 0.80 to 1.00: very strong association

Effect size matters because statistical significance is partly a function of sample size. With a very large sample, even a trivial difference between groups can produce a significant chi-square result. Reporting Cramér’s V or phi alongside the p-value gives a much clearer picture of whether the relationship is practically meaningful, not just statistically detectable.

What Chi-Square Cannot Tell You

The chi-square test identifies whether an association exists between categorical variables, but it has clear boundaries. It doesn’t indicate the direction of causation. Finding that smoking status and lung disease are associated doesn’t prove one causes the other based on the test alone. It also doesn’t work for continuous data (like height or weight) unless you first group those values into categories, which sacrifices information.

The test treats all categories as unordered. If your categories have a natural ranking (like “low, medium, high”), the chi-square test ignores that ordering entirely. Other tests designed for ordinal data will be more sensitive to trends across ordered groups. Despite these limitations, the Pearson chi-square remains the go-to method for comparing counts across categories, and understanding when and how to use it is one of the most practical statistical skills you can have.