Which Statistical Test to Use: Chart for Your Data

Choosing the right statistical test comes down to three things: what type of data you have (continuous, categorical, or counts), how many groups you’re comparing, and whether your data points are independent or paired. Once you know those three details, the correct test is almost always obvious. The chart below walks through every common scenario.

The Three Questions That Determine Your Test

Before looking at any chart, answer these questions about your data:

What type is your outcome variable? Continuous (like blood pressure or weight), categorical (like yes/no or treatment group), or a count (like number of infections per year).
How many groups are you comparing? Two groups, three or more, or are you looking at a relationship between two variables rather than comparing groups at all?
Are your observations independent or paired? Independent means different people in each group. Paired means the same people measured twice, like before and after a treatment.

There’s one more factor that matters for continuous data: whether it follows a roughly normal (bell-shaped) distribution. If it does, you use a parametric test, which is more powerful. If it doesn’t, you use the non-parametric equivalent, which makes fewer assumptions but trades away some statistical power.

Comparing Groups With Continuous Data

This is the most common scenario in research, and where most confusion happens. The logic is straightforward once you see the pattern.

Two independent groups, normal distribution: Independent t-test. This is your go-to for comparing means between two separate groups, like a drug group versus a placebo group. It assumes your data is normally distributed and that the spread (variance) in both groups is roughly equal.

Two independent groups, not normal: Mann-Whitney U test (also called Mann-Whitney Wilcoxon). This compares the rank order of values rather than means, so it doesn’t care about the shape of your distribution.

Two paired groups, normal distribution: Paired t-test (also called dependent t-test). Use this when the same subjects are measured at two time points, or when subjects are matched in pairs.

Two paired groups, not normal: Wilcoxon signed-rank test. The non-parametric version of the paired t-test.

Three or more independent groups, normal distribution: One-way ANOVA. Compares means across multiple groups simultaneously.

Three or more independent groups, not normal: Kruskal-Wallis test. The non-parametric alternative to one-way ANOVA.

Three or more paired/repeated groups, normal distribution: Repeated-measures ANOVA. For when the same subjects are measured at three or more time points.

Three or more paired/repeated groups, not normal: Friedman’s test. The non-parametric version of repeated-measures ANOVA.

Comparing Groups With Categorical Data

When your outcome is a category rather than a number (survived vs. died, improved vs. no change vs. worsened), you leave the t-test family entirely and move to chi-square territory.

The chi-square test is the default for comparing proportions across groups. It works well with larger samples but relies on an approximation that breaks down with small numbers. Specifically, the chi-square test is unreliable when more than 20% of cells in your table have an expected count below 5, or any cell has an expected count below 1.

When your sample is too small for chi-square, switch to Fisher’s exact test. Despite the name suggesting it’s only for small samples, Fisher’s exact test is technically valid for all sample sizes. It calculates exact probabilities rather than relying on approximation. In practice, most software defaults to it automatically when cell counts are low.

Measuring Relationships Between Two Variables

Sometimes you’re not comparing groups at all. Instead, you want to know whether two variables move together.

Pearson correlation measures the linear relationship between two continuous variables. It’s the standard choice when both variables are roughly normally distributed. One common misconception: Pearson’s correlation can still detect relationships in non-normal data, but it’s limited to linear patterns. If the true relationship is curved, Pearson may miss it entirely.

Spearman correlation measures any monotonic relationship (consistently increasing or decreasing, even if not in a straight line) between two variables. Use it when your data isn’t normally distributed, when you have ordinal data (like rankings or Likert scales), or when you suspect a non-linear but consistent trend.

Choosing a Regression Model

Regression is for predicting an outcome from one or more variables. The type of outcome variable dictates the model.

If your outcome is continuous (weight, blood pressure, test scores), use linear regression. It assumes a normal distribution of the outcome and a straight-line relationship with the predictors.

If your outcome is binary (yes/no, alive/dead, success/failure), use logistic regression. It estimates the probability of the outcome rather than predicting a number.

If your outcome is a count of relatively rare events (number of hospital readmissions, number of falls per month), use Poisson regression. It’s designed for non-negative integers where most values cluster near zero.

How to Check for Normality

Since the parametric vs. non-parametric decision hinges on whether your data is normally distributed, you need a way to check. Two approaches work best when used together: visual inspection (histograms and Q-Q plots) and a formal statistical test.

The Shapiro-Wilk test is the most widely recommended normality test. It has more statistical power than the older Kolmogorov-Smirnov test, meaning it’s better at detecting real departures from normality. Most statistical software includes it by default.

One important nuance: normality tests behave differently depending on sample size. With small samples (under 30), these tests often fail to detect non-normality even when it exists, because they simply lack power. With very large samples (200+), they’ll flag trivially small deviations that wouldn’t actually affect your results. As a practical rule, once your sample exceeds about 30 to 40 observations, parametric tests are robust enough to handle mild departures from normality. With several hundred observations, the distribution of your data matters very little for most parametric procedures.

What to Do After a Significant ANOVA

A significant ANOVA result tells you that at least one group differs from the others, but not which ones. You need a post-hoc test to identify the specific group differences, and the choice depends on what comparisons you’re making.

Tukey’s HSD is the most common choice when you want to compare every group to every other group (all pairwise comparisons). It controls the overall error rate simply and effectively. If your groups have unequal sizes, the Tukey-Kramer modification handles that.

Bonferroni correction works best when you have a small number of pre-planned comparisons rather than testing every possible pair. It divides your significance threshold (typically 0.05) by the number of comparisons. The downside is that it becomes overly conservative as the number of comparisons grows, making it harder to detect real differences.

Scheffé’s test is the most flexible option, covering complex comparisons beyond simple pairs (like comparing one group’s mean to the combined average of two other groups). But that flexibility comes at a cost: it’s the most conservative of the three and has less power to detect differences. If you only care about pairwise comparisons, Scheffé’s is generally not recommended.

Repeated Measures and Missing Data

Repeated-measures ANOVA works well when every subject has data at every time point, with measurements taken at equal intervals. In practice, that’s often not the case. Subjects drop out, miss visits, or are measured on slightly different schedules.

Repeated-measures ANOVA handles this poorly. It requires balanced data, meaning any subject with even one missing measurement gets excluded entirely. This shrinks your effective sample size and reduces your ability to detect real effects.

Mixed-effects models (sometimes called multilevel or hierarchical models) solve this problem. They can include subjects with incomplete data, handle unequal timing between measurements, and accommodate more complex study designs. If your data has any missingness or irregular measurement intervals, a mixed-effects model is the better choice.

Time-to-Event Data

When your outcome is the time until something happens (death, relapse, recovery), standard tests don’t apply. You need survival analysis methods, which are specifically designed to handle the fact that some subjects haven’t experienced the event yet by the end of the study.

Kaplan-Meier curves estimate the probability of surviving past each time point and are primarily descriptive. They give you a visual picture of how survival differs between groups. To formally test whether two Kaplan-Meier curves differ, you use the log-rank test, which works like a chi-square test applied across the entire follow-up period.

Cox proportional hazards regression is the regression equivalent for survival data. Like logistic regression adjusts for multiple variables when predicting a binary outcome, Cox regression adjusts for multiple variables when predicting time-to-event. It estimates how strongly each variable affects the risk of the event occurring. The key assumption is that the relative risk between groups stays constant over time. If one treatment doubles the risk of relapse at month 3, it should still double the risk at month 12.

Use Kaplan-Meier and the log-rank test for simple group comparisons. Use Cox regression when you need to adjust for confounders like age, sex, or disease severity.