What Does Chi-Square Tell You—and What It Doesn’t

A chi-square test tells you whether there’s a meaningful relationship between two categorical variables, or whether the pattern you see in your data is just random chance. It works by comparing what you actually observed to what you’d expect if nothing interesting were going on. The bigger the gap between observed and expected, the more likely something real is driving the difference.

How the Test Actually Works

Chi-square is built on a simple idea: if two variables have no connection to each other, the data should fall into categories in predictable proportions. The test calculates what those “expected” counts would look like under the assumption of no relationship, then measures how far your real data deviates from that expectation.

Say you’re looking at whether men and women prefer different phone brands. If gender and brand preference were completely unrelated, you’d expect roughly the same proportion of men and women choosing each brand. The chi-square statistic captures the total mismatch between those expected proportions and what you actually counted. A small statistic means your data looks close to what randomness would produce. A large one means something is probably going on.

The test then converts that statistic into a p-value, which is the probability of seeing a gap this large (or larger) if there really were no relationship. A p-value below 0.05 is the conventional threshold for saying the relationship is statistically significant. Crucially, the p-value is not the probability that your result is wrong. It’s the probability of getting data this extreme if the “no relationship” assumption were true.

Two Types of Chi-Square Tests

The term “chi-square” actually covers two distinct tests, and they answer slightly different questions.

The test of independence is the more common one. It takes a two-way table (rows and columns representing two different categorical variables) and asks: are these variables related? For example, is treatment group related to recovery outcome? Is education level related to voting preference? You’re comparing two or more groups against two or more categories.

The goodness-of-fit test compares a single variable’s distribution to some theoretical or expected pattern. You might ask whether the distribution of blood types in your sample matches the known distribution in the general population, or whether a die lands on each face equally often. Here you’re not comparing groups to each other. You’re comparing one group to a benchmark.

What a Significant Result Does and Doesn’t Mean

When a chi-square test comes back significant, it tells you the two variables are not independent of each other. That’s it. It does not tell you how strong the relationship is, and it does not tell you which specific categories are driving the result. A massive sample can produce a statistically significant chi-square for a relationship so tiny it has no practical importance.

This is why effect size matters. For a simple 2×2 table (two groups, two outcomes), the phi coefficient gives you a single number representing the strength of association. For larger tables, Cramer’s V does the same job. Both range from 0 to 1, where 0 means no association and 1 means a perfect relationship. A Cramer’s V of 0.1 is generally considered small, 0.3 is medium, and 0.5 or above is large. In one textbook example with 200 observations, a chi-square of 32 produced a Cramer’s V of 0.4, interpreted as a medium-to-large effect. Without this step, you’re only getting half the picture.

Chi-square also can’t tell you about the direction of a relationship or causation. It flags that two things co-occur in a non-random way, but not why.

Degrees of Freedom Shape the Result

Every chi-square test has a number called degrees of freedom that affects how you interpret the statistic. For a goodness-of-fit test, degrees of freedom equal the number of categories minus one. If you’re checking whether a die is fair across six faces, you have 5 degrees of freedom.

For a test of independence using a contingency table, degrees of freedom equal (number of rows minus 1) multiplied by (number of columns minus 1). A 3×2 table has (3-1) × (2-1) = 2 degrees of freedom. This number matters because the same chi-square statistic means different things at different degrees of freedom. A chi-square of 6 with 1 degree of freedom is highly significant, but with 10 degrees of freedom it’s not significant at all. Statistical software handles this automatically, but understanding it helps you make sense of the output.

When Chi-Square Results Are Unreliable

Chi-square tests have a few conditions that need to hold for the results to be trustworthy. The most important one involves expected cell counts. Each cell in your table needs a minimum expected frequency, typically at least 5. When expected counts drop below that threshold, the math behind the chi-square approximation starts to break down, and you can get misleading p-values.

For 2×2 tables with small expected counts, you have two options. One is applying a continuity correction (sometimes called Yates’ correction), which slightly adjusts the chi-square statistic downward. This is most useful when your p-value lands near the 0.05 threshold and you want to be cautious. The other, more reliable option for very small samples is switching to a different test altogether, called Fisher’s exact test, which calculates an exact probability rather than relying on an approximation.

The other key requirement is that observations must be independent. Each data point should represent a separate individual or event. If the same person appears in your table twice, or if your observations are paired in some way, a standard chi-square test will give you unreliable results.

Practical Examples

Chi-square tests show up constantly in research and everyday analysis because categorical data is everywhere. A company testing whether customer satisfaction ratings (satisfied, neutral, dissatisfied) differ between two product versions would use a chi-square test of independence. A public health researcher checking whether vaccination rates differ across age groups would do the same.

On the goodness-of-fit side, a geneticist might test whether offspring traits match the proportions predicted by a genetic model. A retail analyst might check whether foot traffic is evenly distributed across days of the week, or whether certain days draw disproportionately more visitors.

In each case, the test answers the same core question: does the pattern in the data reflect a real difference, or could it easily have happened by chance? One useful feature of chi-square is that it can pinpoint which specific categories are responsible for the overall result. If your test is significant, you can look at individual cells to see where observed counts most dramatically exceeded or fell short of expected counts. This makes it more informative than many statistical tests that only give you a single yes-or-no verdict.

Chi-Square vs. Other Tests

Chi-square is specifically designed for categorical data, meaning variables that fall into groups or labels rather than numerical measurements. If your data is continuous (heights, weights, test scores), you’d use a t-test or ANOVA instead. Chi-square fills the gap for situations where your variables are things like “yes/no,” “brand A/brand B/brand C,” or “mild/moderate/severe.”

It’s also a non-parametric test, meaning it doesn’t assume your data follows a normal distribution or any other specific shape. This makes it flexible and widely applicable, but it comes with a tradeoff: it’s less powerful than parametric alternatives when those alternatives’ assumptions are met. For ordered categories (like pain levels from 1 to 5), specialized tests that account for that ordering can detect patterns that chi-square might miss.