Chi-Square Test: When to Use It and When Not To

You use a chi-square test when you’re working with categorical data and want to know whether the patterns you see are statistically meaningful or just due to chance. Categorical data means your variables fall into groups or categories (like yes/no, red/blue/green, or smoker/non-smoker) rather than being measured on a continuous scale like weight or temperature. There are two main versions of the test, each designed for a different question.

The Two Types of Chi-Square Tests

The goodness-of-fit test asks whether the distribution of a single categorical variable matches what you’d expect. You have one variable, one population, and a theoretical distribution to compare against. For example, if a company sells five flavors of ice cream and wants to know whether each flavor sells equally well, a goodness-of-fit test compares the observed sales to a perfectly even split. The null hypothesis is straightforward: the data fits the expected distribution. If the test rejects that hypothesis, at least one category is significantly over- or under-represented.

The test of independence asks whether two categorical variables are related to each other. This is the more common version in practice. You organize your data into a contingency table (rows for one variable, columns for the other) and test whether knowing someone’s category on one variable tells you anything about their category on the other. A hospital might record whether patients received Treatment A or Treatment B, then track whether they recovered or didn’t. The chi-square test of independence determines whether the treatment and the outcome are linked, or whether the differences in recovery rates could be explained by random variation alone.

When Chi-Square Is the Right Choice

Choose a chi-square test when all of the following are true:

Your variables are categorical. Both the independent and dependent variables are groups, not numbers on a scale. Gender, disease status, survey responses like “agree/disagree/neutral,” and color preferences all qualify. If your outcome variable is continuous (blood pressure, income, test scores), you need a different test entirely, like a t-test or ANOVA.
Your observations are independent. Each person or data point contributes to only one cell in the table. You can’t use chi-square on repeated measures, where the same subjects are tested at multiple time points. You also can’t use it when your groups are naturally paired, like comparing a parent with their child.
Your sample was randomly selected. The data should come from a random sampling process, not a hand-picked group.
Your expected frequencies are large enough. The test relies on an approximation that breaks down with small samples. The standard rule: no more than 20% of cells should have expected frequencies below 5, and no cell should have an expected frequency below 1. If your data doesn’t meet this threshold, you need Fisher’s exact test instead.

That last point trips people up because it refers to expected frequencies, not observed ones. Expected frequencies are what you’d predict in each cell if there were no relationship between your variables. Most statistical software calculates these automatically, so you can check the assumption after setting up your table.

Real-World Examples

Chi-square tests show up constantly in research across fields. In medical studies, researchers use them to compare rates of side effects between treatment groups, or to test whether a risk factor (like smoking status) is associated with a disease outcome. In marketing, you might test whether customer preferences for a product differ across age groups. In education, a researcher could examine whether graduation rates depend on the type of financial aid students received.

The common thread is always the same: you’re counting how many observations fall into each combination of categories, then asking whether those counts deviate from what chance alone would produce.

When Chi-Square Won’t Work

Small samples are the most common disqualifier. When your expected cell counts drop below the thresholds, the chi-square approximation becomes unreliable and can produce misleading p-values. Fisher’s exact test is the standard alternative for small samples, particularly with 2×2 tables. It calculates exact probabilities rather than relying on approximation.

Paired or dependent data is another deal-breaker. If you measure the same people before and after an intervention, you need McNemar’s test for 2×2 tables or a related method for larger tables. Chi-square assumes every observation is independent, and violating this assumption inflates your chance of finding a relationship that isn’t real.

Chi-square also can’t tell you about the direction or magnitude of a relationship on its own. A significant result means the variables are associated, but it doesn’t tell you how strongly. For that, you need an effect size measure.

Measuring How Strong the Association Is

A significant chi-square result tells you a relationship exists, but not whether it’s meaningful in practical terms. Effect size fills that gap. For 2×2 tables, the phi coefficient is standard. For larger tables, Cramér’s V is the go-to measure. Both range from 0 (no association) to 1 (perfect association), with these general benchmarks:

Below 0.05: No meaningful association
0.05 to 0.10: Weak
0.10 to 0.15: Moderate
0.15 to 0.25: Strong
Above 0.25: Very strong

Reporting effect size alongside your chi-square statistic is considered best practice. A result can be statistically significant with a very large sample but have a tiny effect size, meaning the association is real but practically irrelevant.

What to Do After a Significant Result

A significant chi-square test with more than four cells (anything beyond a basic 2×2 table) tells you that the variables are related somewhere in the table, but not where. You know the overall pattern isn’t random, but you don’t yet know which specific combinations of categories are driving the result.

The standard approach is to examine adjusted standardized residuals for each cell. These residuals follow a normal distribution, which means any cell with an absolute residual greater than 1.96 is significantly different from what you’d expect at the 0.05 level. If you’re testing multiple cells, you should apply a correction for multiple comparisons. The Bonferroni method is common: divide your significance level (typically 0.05) by the number of cells you’re testing and use a stricter cutoff.

For example, in a study examining whether hair color and eye color are associated, a significant chi-square test might prompt you to look at all 16 cells in a 4×4 table. With a Bonferroni correction, the critical residual value rises from 1.96 to about 2.95. Cornell University’s statistical consulting unit demonstrated this approach and found that people with blue eyes and blond hair appeared significantly more often than expected, while people with brown eyes and blond hair appeared significantly less often.

How to Report Chi-Square Results

If you’re writing up results for a paper or report, the standard format includes the degrees of freedom, sample size, chi-square value, and exact p-value. In APA style, it looks like this: χ²(1, N = 90) = 0.89, p = .35. Report the exact p-value unless it’s less than .001, in which case you write p < .001. Round the chi-square value to two decimal places, and include an effect size measure like Cramér's V alongside it.