Bivariate categorical data is what you get when you measure two categorical variables on the same set of subjects and look at how those variables relate to each other. “Categorical” means the values are groups or labels (like yes/no, male/female, or treatment A/B/C) rather than numbers on a scale. “Bivariate” simply means two variables at once. So if you record both smoking status and biological sex for 226 people, you have bivariate categorical data. The interesting part isn’t the two variables on their own; it’s what happens when you examine them together.
How It Differs From Univariate Analysis
Univariate analysis looks at one variable in isolation. It’s useful for spotting data errors, summarizing a single characteristic, and getting your bearings in a dataset, but it has limited explanatory power. You can report that 17 out of 226 people smoke, and that’s fine, but it tells you nothing about who is more likely to smoke.
Bivariate analysis adds a second dimension. It can explore whether an outcome variable depends on an explanatory variable (an asymmetrical analysis, implying a direction), or it can simply explore the association between two variables without assuming cause and effect (a symmetrical analysis). That shift from describing one thing to examining a relationship between two things is the core idea behind bivariate categorical data.
The Contingency Table
The standard way to organize bivariate categorical data is a two-way contingency table. Rows represent one variable, columns represent the other, and each cell contains a count of how many observations fall into that particular combination. Here’s a small example using smoking status and biological sex for 226 people:
- Female, No: 120
- Female, Yes: 7
- Male, No: 89
- Male, Yes: 10
The row totals (209 nonsmokers, 17 smokers) and column totals (127 females, 99 males) are called marginal frequencies. They collapse the table back down to one variable at a time. The individual cell counts (120, 7, 89, 10) are joint frequencies, showing how both variables combine.
Joint, Marginal, and Conditional Frequencies
Once you have a contingency table, there are three ways to slice the numbers, and understanding the difference is essential.
Joint relative frequency tells you what proportion of the entire sample falls into a specific cell. If 20 out of 200 students scored between 60 and 79% and studied between 21 and 40 minutes, the joint relative frequency for that combination is 10%.
Marginal relative frequency focuses on one variable while ignoring the other. You get it by dividing each row or column total by the grand total. This gives you the overall distribution of one variable independently.
Conditional relative frequency is where things get analytically interesting. It answers questions like: “Among students who studied 41 to 60 minutes, what percentage scored above 80%?” You restrict your view to one column (or row) and calculate percentages within that subset only. If 16 out of 86 students in that study-time group scored 80 to 100%, the conditional relative frequency is about 18.6%. Standard practice is to express conditional distributions as percentages rather than raw counts, because the subgroup sizes often differ.
Visualizing Two Categorical Variables
Bar charts adapt well to bivariate categorical data, but the type of bar chart matters. A segmented (stacked) bar chart divides each bar into colored segments representing the second variable’s categories, shown as percentages. This makes it easy to compare proportions across groups. The tradeoff is that you lose information about sample size: a bar representing 20 infants looks the same width as one representing 120 adults.
A mosaic plot solves that problem by adjusting the width of each bar to reflect how many observations are in that group. If adults make up 60% of the sample, the adult bar takes up 60% of the horizontal space. This means the area of each rectangle is proportional to the number of observations it represents. Mosaic plots convey everything a segmented bar chart does, plus they show you how the sample is distributed across groups.
Side-by-side bar charts are another option, placing bars for each subgroup next to each other rather than stacking them. These work best when you want to compare raw counts rather than proportions.
Testing Whether Two Variables Are Related
The standard statistical test for bivariate categorical data is the chi-square test of independence. Its null hypothesis states that the two variables are independent, meaning knowing one variable tells you nothing about the other. A significant result (typically p less than 0.05) means you reject that assumption and conclude the variables are associated.
The test has several requirements. The data must come from random selection. Cell values must be raw counts, not percentages or transformed numbers. Each subject can appear in only one cell, and the two groups must be independent of each other (no paired or repeated measurements on the same people). Both variables must be categorical, though ordinal categories and collapsed interval data are acceptable. Finally, at least 80% of cells should have an expected count of 5 or more, and no cell should have an expected count below 1. When these conditions aren’t met, alternative tests like Fisher’s exact test are used instead.
Measuring the Strength of an Association
A significant chi-square result tells you an association exists but says nothing about how strong it is. For that, you need an effect size measure.
For a simple 2×2 table, the phi coefficient works well. A value of 0 means complete independence, and a value of 1.0 means complete dependence. For larger tables with more rows or columns, Cramer’s V serves the same purpose and also ranges from 0 to 1. The standard interpretation scale for both measures:
- Below 0.10: negligible association
- 0.10 to 0.19: weak association
- 0.20 to 0.39: moderate association
- 0.40 to 0.59: relatively strong association
- 0.60 to 0.79: strong association
- 0.80 to 1.00: very strong association
Odds Ratios and Relative Risk
When your bivariate categorical data forms a 2×2 table comparing an exposure to an outcome (like treatment vs. no treatment crossed with recovered vs. not recovered), two additional measures come into play.
Relative risk compares the probability of the outcome in the exposed group to the probability in the unexposed group. It’s the more intuitive measure, but it requires knowing the total number of people exposed, which is only available in prospective (cohort) studies.
The odds ratio compares the odds of the outcome rather than the probability. It’s the go-to measure for case-control studies, where you start with people who already have the outcome and look backward, making it impossible to calculate true risk. When the outcome is rare (roughly under 10% of the sample), the odds ratio and relative risk produce similar values. As the outcome becomes more common, the two diverge, and they should not be used interchangeably.
Real-World Examples
Bivariate categorical data shows up constantly in practice. Any time you’re asking whether one group differs from another on a categorical outcome, you’re working with it. A hospital comparing pneumonia rates between patients who received a vaccine and those who didn’t. A survey cross-tabulating political party affiliation with support for a policy (yes/no). A school district checking whether graduation rates (graduated/did not graduate) differ by free lunch eligibility (eligible/not eligible). In each case, you have two categorical variables measured on the same subjects, and the question is whether the variables are independent or associated.

