The right statistical test for comparing two groups depends on three things: what type of data you’re measuring, whether your two groups are independent or related, and whether your data follows a normal distribution. Once you answer those three questions, the choice narrows to one or two options.
This can feel overwhelming if you’re staring at a dataset for the first time, but the decision tree is actually straightforward. Here’s how to work through it step by step.
The Three Questions That Decide Your Test
Before choosing any test, you need to identify three characteristics of your data:
- What type is your outcome variable? Is it continuous (like weight, blood pressure, or test scores), ordinal (like a satisfaction rating from 1 to 5), or categorical (like pass/fail, yes/no, or treatment A vs. treatment B)?
- Are your groups independent or paired? Independent groups have no connection between them, like comparing men to women or a treatment group to a control group made up of different people. Paired groups have a built-in link: the same person measured before and after, twins compared to each other, or subjects deliberately matched on key characteristics.
- Is your continuous data normally distributed? This matters because the most common tests (t-tests) assume your data forms a roughly bell-shaped curve. If it doesn’t, you need a different approach.
Once you have those answers, you can match them to the correct test using the sections below.
Continuous Data With a Normal Distribution
If your outcome variable is continuous and approximately normally distributed, the t-test is your go-to. Which version depends on your group structure.
The independent samples t-test (also called the two-sample t-test or Student’s t-test) compares the average values of two unrelated groups. You’d use this when comparing, say, average exam scores between students who used a study app and students who didn’t. The two groups are made up of entirely different people, and nothing links a specific person in one group to a specific person in the other.
The paired t-test compares two related measurements. The most common scenario is a before-and-after design where the same people are measured twice. It also applies when subjects are naturally linked, like comparing cholesterol levels between siblings, or when a researcher has deliberately matched participants in two groups based on age, sex, or other characteristics. The pairing matters because the measurements within each pair are correlated, and ignoring that correlation would give you the wrong answer.
Check Your Variances First
The standard independent samples t-test assumes both groups have roughly equal spread in their data (equal variances). When that assumption is violated, the test becomes unreliable, particularly when the larger group also has the larger variance. In that situation, the standard t-test is biased toward finding no difference even when one exists.
You can check this assumption using Levene’s test, which is built into most statistics software. If Levene’s test returns a p-value below 0.05, your variances are significantly different and you should use Welch’s t-test instead. Welch’s t-test adjusts for unequal variances and gives more accurate results. Many statisticians now recommend using Welch’s version by default, since it performs well even when variances happen to be equal.
Continuous or Ordinal Data Without Normal Distribution
When your data is skewed, contains clear outliers, or comes from a small sample where you can’t confirm normality, nonparametric tests are the safer choice. These tests don’t assume any particular distribution. Instead of comparing means directly, they work with the ranks of your data, which makes them more robust.
For two independent groups, use the Mann-Whitney U test (also called the Wilcoxon rank-sum test). It’s the nonparametric equivalent of the independent samples t-test and works with any data that can be ranked, including ordinal scales like pain ratings or Likert-type survey items.
For two paired or related groups, use the Wilcoxon signed-rank test. This is the nonparametric counterpart of the paired t-test. It compares two sets of related measurements without requiring normality.
These tests are also the standard recommendation for ordinal data, such as 5-point or 7-point rating scales. While some researchers treat Likert scale data as continuous and run t-tests on it, the more conservative and widely accepted approach is to use these rank-based nonparametric methods.
Categorical Data: Counts and Proportions
When your outcome is categorical rather than numerical, you’re comparing proportions instead of averages. Did more people in Group A recover than in Group B? Did a higher percentage of treated patients experience side effects?
For two independent groups, the chi-square test is the standard choice. It compares observed counts against what you’d expect if there were no difference between groups. There’s one important limitation: the chi-square test relies on an approximation that breaks down with small samples. Specifically, if more than 20% of the cells in your table have expected counts below 5, or any cell has an expected count below 1, the approximation isn’t reliable.
When your sample is too small for chi-square, use Fisher’s exact test. It calculates an exact probability rather than relying on approximation, so it works regardless of sample size. Most software will flag when chi-square assumptions aren’t met and offer Fisher’s exact test automatically.
For two paired groups with categorical outcomes, use McNemar’s test. This applies when the same subjects are classified into categories at two different time points, or when matched pairs are being compared on a yes/no outcome.
How to Check for Normality
Since the choice between parametric tests (t-tests) and nonparametric tests (Mann-Whitney, Wilcoxon) hinges on whether your data is normally distributed, you need a way to check. The Shapiro-Wilk test is the most commonly recommended method. It produces a p-value: if that value is above 0.05, your data is consistent with a normal distribution. If it’s below 0.05, your data deviates significantly from normal and you should consider nonparametric alternatives.
Don’t rely solely on the formal test, though. With very large samples (200 or more), the Shapiro-Wilk test can flag trivial departures from normality that won’t actually affect your t-test results. With very small samples, it may lack the power to detect real non-normality. Looking at a histogram or a Q-Q plot alongside the formal test gives you a more complete picture. When your sample is small and you’re uncertain, nonparametric tests are the safer default since they sacrifice only a small amount of statistical power compared to t-tests.
Quick Reference Table
- Continuous, normal, independent groups: Independent samples t-test (or Welch’s t-test if variances differ)
- Continuous, normal, paired groups: Paired t-test
- Continuous or ordinal, non-normal, independent groups: Mann-Whitney U test
- Continuous or ordinal, non-normal, paired groups: Wilcoxon signed-rank test
- Categorical, independent groups, adequate sample: Chi-square test
- Categorical, independent groups, small sample: Fisher’s exact test
- Categorical, paired groups: McNemar’s test
Sample Size and Effect Size Matter Too
Choosing the right test is only part of the equation. Your sample size determines whether the test can actually detect a real difference if one exists. A study with too few participants might miss a genuine effect simply because random variation drowns it out. On the other hand, an oversized sample can make trivially small differences appear statistically significant, even when they have no practical importance.
This is why researchers run a power analysis before collecting data. A power analysis tells you how many participants you need in each group to have a reasonable chance (typically 80%) of detecting a meaningful difference. The calculation requires you to specify the effect size you care about. For comparisons of two group means, effect size is usually expressed as Cohen’s d: 0.2 is considered small, 0.5 is medium, and 0.8 is large. Free tools like G*Power can walk you through the calculation for whichever test you’ve selected.
Reporting effect size alongside your p-value is increasingly expected in published research. A p-value tells you whether a difference is likely real, but Cohen’s d (or an equivalent measure) tells you whether the difference is big enough to care about. A study could find a statistically significant difference in blood pressure between two groups that amounts to only 1 mmHg, which is meaningless in practice. The effect size catches that.

