Choosing the right hypothesis test comes down to three things: what type of data you have, how many groups you’re comparing, and whether your data meets certain assumptions about its distribution. Once you can answer those three questions, the correct test usually becomes obvious. The tricky part is knowing which questions to ask, so here’s a practical walkthrough.
The Three Questions That Decide Everything
Before you even think about specific test names, work through these criteria in order:
- What type of data is your outcome variable? Is it continuous (like blood pressure, weight, or test scores), categorical (like yes/no, or mild/moderate/severe), or counts? This single question eliminates most of the wrong options immediately.
- How many groups are you comparing? Are you looking at one group against a known value, two groups against each other, or three or more groups?
- Are your groups independent or paired? Independent means different people in each group. Paired (also called dependent) means the same people measured twice, like before and after a treatment.
Every statistical test is essentially designed for one specific combination of those three answers. The rest of this article maps out which combination points to which test.
Parametric vs. Nonparametric Tests
If your outcome is continuous, you face one more fork in the road: does your data follow a roughly normal (bell-shaped) distribution? Parametric tests like the t-test and ANOVA assume it does. They also assume the spread of data is similar across groups and that the data is measured on a true numeric scale. When one or more of those assumptions fail, you use a nonparametric alternative instead, which makes fewer assumptions about the shape of your data.
How do you know if your data is normal? For sample sizes above 30 or 40, normality matters less because parametric tests are robust enough to handle mild deviations. With hundreds of observations, you can generally ignore the question altogether. For smaller samples, you can run a formal normality test. The most common ones produce a p-value: if it’s above 0.05, your data is consistent with a normal distribution. If it’s below 0.05, it’s not. The catch is that these tests have low power in small samples, meaning they’ll often say “normal” even when the data isn’t. So with small datasets, it’s worth also looking at a histogram or a Q-Q plot to visually check for obvious skew.
Comparing Two Groups
This is the most common scenario, and the t-test family handles it when your data is continuous and approximately normal.
An independent samples t-test compares the means of two separate groups. Think of a study where one group of patients receives a drug and a different group receives a placebo. You need one continuous outcome variable and one categorical grouping variable with exactly two categories.
A paired t-test compares two measurements from the same group. A typical example: measuring patients’ pain scores before and after surgery. Because the same individuals appear in both measurements, the data points are linked, and the test accounts for that.
If your data isn’t normally distributed or is ordinal (ranked categories like “low, medium, high” rather than true numbers), switch to the nonparametric equivalents. The Mann-Whitney U test replaces the independent samples t-test, and the Wilcoxon signed-rank test replaces the paired t-test. They compare the ranks of values rather than the means, so they don’t need normality.
Comparing Three or More Groups
When you have three or more independent groups, ANOVA (analysis of variance) replaces the t-test. A one-way ANOVA uses one continuous outcome and one categorical variable with at least three categories. If you’re testing the effect of two categorical variables simultaneously (say, drug type and dosage level), that’s a two-way ANOVA.
For paired or repeated data across three or more time points, repeated measures ANOVA is the right choice. Interestingly, when you only have two time points, repeated measures ANOVA and the paired t-test give identical results, so either works. The repeated measures version only becomes necessary when you add a third measurement or more complex designs with additional factors.
If your data violates normality assumptions, the Kruskal-Wallis test is the nonparametric alternative to one-way ANOVA, and Friedman’s test replaces repeated measures ANOVA.
What Happens After ANOVA
ANOVA only tells you that at least one group differs from the others. It doesn’t tell you which ones. To find that out, you run a post-hoc test. The most commonly used options are Tukey’s test (good for comparing all possible pairs of groups), Bonferroni correction (more conservative, better when you have many comparisons), and Dunnett’s test (specifically for comparing every group against a single control group). Choosing among these depends on how many comparisons you’re making and how cautious you want to be about false positives.
When Your Data Is Categorical
If your outcome variable is categorical rather than continuous, you leave the t-test and ANOVA world entirely. The chi-square test is the standard tool here. It works when both your outcome and your grouping variable are categorical, and it tests whether the distribution of categories differs between groups.
A chi-square test of independence checks whether two categorical variables are related. For example: is there a relationship between smoking status (yes/no) and lung disease (yes/no)? It works with both simple two-category variables and more complex multi-category ones. When your expected cell counts are very small (typically below 5), Fisher’s exact test is a more reliable alternative.
A chi-square goodness-of-fit test is slightly different. Instead of comparing groups to each other, it compares the distribution you observed to a distribution you expected. For instance, testing whether the proportion of patients choosing each of four treatment options matches what you’d expect if there were no preference.
Measuring Relationships, Not Differences
Not every research question is about group differences. Sometimes you want to know whether two variables move together, or whether one predicts the other. This calls for correlation or regression rather than a comparison test.
Correlation measures the strength of a linear relationship between two continuous variables. Pearson correlation is the standard version and assumes both variables are roughly normally distributed. If that assumption fails, or if your data is ordinal, Spearman correlation works on ranks instead and handles non-normal data well.
Regression goes a step further. While correlation tells you how strongly two variables are linked, regression gives you an equation that lets you predict one variable from the other. If you want to say “for every 10-year increase in age, urea levels increase by X amount,” that’s regression. Use correlation when you just want to describe an association; use regression when you want to predict or quantify the size of an effect.
A Quick-Reference Decision Path
Working through these choices in sequence gets you to the right test efficiently:
- Continuous outcome, 2 independent groups, normal data: Independent samples t-test
- Continuous outcome, 2 paired groups, normal data: Paired t-test
- Continuous outcome, 3+ independent groups, normal data: One-way ANOVA
- Continuous outcome, 3+ repeated measures, normal data: Repeated measures ANOVA
- Continuous outcome, 2 independent groups, non-normal data: Mann-Whitney U test
- Continuous outcome, 2 paired groups, non-normal data: Wilcoxon signed-rank test
- Continuous outcome, 3+ independent groups, non-normal data: Kruskal-Wallis test
- Categorical outcome, categorical groups: Chi-square test (or Fisher’s exact for small samples)
- Two continuous variables, describing association: Pearson or Spearman correlation
- Two continuous variables, predicting one from the other: Linear regression
What the P-Value Actually Tells You
Whichever test you choose, you’ll get a p-value, and it’s worth understanding what it does and doesn’t mean. A p-value measures how incompatible your data is with the assumption that there’s no real effect. A small p-value (traditionally below 0.05) suggests the data would be unlikely if nothing were going on. But it does not tell you the probability that your hypothesis is true, and it does not tell you how large or important the effect is.
The American Statistical Association has made this point explicitly: a p-value alone is not sufficient evidence for a scientific claim. A result with p = 0.03 does not mean there’s a 3% chance the effect isn’t real. It means that if the effect truly didn’t exist, you’d see data this extreme about 3% of the time. The distinction matters because a statistically significant result can still reflect a tiny, practically meaningless difference, especially with large sample sizes. Reporting effect sizes and confidence intervals alongside your p-value gives a much more complete picture than the p-value alone.

