A repeated measures ANOVA is a statistical test that compares three or more measurements taken from the same subjects under different conditions or at different time points. Unlike a standard ANOVA, which compares separate groups of people, this version tracks the same individuals across multiple observations. That distinction makes it one of the most common tools in research designs where participants are measured more than once, such as before, during, and after a treatment.
How It Differs From a Standard ANOVA
In a regular (between-subjects) ANOVA, you recruit different people for each group and compare the group averages. In a repeated measures (within-subjects) ANOVA, every person appears in every condition. If you’re testing whether a drug lowers blood pressure, a between-subjects design would measure one group on the drug and a separate group on a placebo. A repeated measures design would measure the same people at baseline, then again at one month, then again at three months.
This matters because people naturally differ from one another. One person might have higher blood pressure simply because of genetics, not because of anything the treatment did. When you use different people in each group, those pre-existing differences get mixed into your results. A repeated measures ANOVA mathematically removes that individual variation from the error term, leaving you with a cleaner estimate of whether the treatment itself had an effect. In practice, this means repeated measures designs typically need fewer participants to detect the same size effect.
What the F-Ratio Actually Tells You
The test produces an F-ratio, which is simply a fraction:
F = variation between conditions / error variance
The numerator captures how much the group averages differ across your conditions (say, baseline vs. one month vs. three months). The denominator captures the leftover variation that can’t be explained by either the treatment or by individual differences between people. A larger F value means the differences between conditions are large relative to random noise, which points toward a real effect.
The key advantage over a standard ANOVA sits in how the denominator is built. Total variation within each condition comes from two sources: systematic differences between individuals and random error. Because the same people appear in every condition, the repeated measures version can isolate the individual differences and subtract them out. Both the numerator and denominator then reflect only random error, giving the test more sensitivity.
A Concrete Example
Imagine a researcher studying whether a new physical therapy program improves knee flexibility after surgery. She recruits 20 patients and measures their range of motion at three time points: the day before therapy starts, four weeks in, and eight weeks in. Each patient provides three measurements, so the data has a built-in correlation: a patient who starts with poor flexibility will probably still have relatively poor flexibility at four weeks.
A repeated measures ANOVA accounts for that correlation. It treats time as a categorical variable (three discrete time points, not a continuous timeline) and asks one question: do the average flexibility scores differ across those three time points more than you’d expect by chance? If the overall F-test is significant, the researcher knows that at least one time point differs from the others, though the test alone doesn’t say which one. That’s where follow-up comparisons come in.
Assumptions You Need to Check
Like all parametric tests, repeated measures ANOVA comes with conditions that should hold for the results to be trustworthy.
- Normality. The dependent variable should be roughly normally distributed within each condition. With reasonably large samples, the test is fairly tolerant of mild violations.
- No extreme outliers. Because the same individuals appear in every condition, a single unusual data point can ripple across the entire analysis.
- Sphericity. This is the assumption unique to repeated measures designs. It requires that the variances of the differences between all pairs of conditions are approximately equal. For example, if you have three time points, the spread of (Time 1 minus Time 2) scores should be similar to the spread of (Time 1 minus Time 3) scores and (Time 2 minus Time 3) scores.
Sphericity is the assumption most likely to cause problems in practice, and it only applies when you have three or more levels of the repeated factor (with just two levels, there’s only one pair of differences, so the assumption is automatically met).
What to Do When Sphericity Is Violated
The standard way to check sphericity is Mauchly’s test. If Mauchly’s test is significant (typically p < .05), the assumption is violated and the regular F-test becomes too liberal, meaning it’s more likely to flag a result as significant when it shouldn’t be.
The fix is straightforward: apply a correction that adjusts the degrees of freedom downward, making the test more conservative. Two corrections are widely used, and the choice between them depends on how severe the violation is. Both produce an epsilon value that ranges from 0 to 1, where 1 means perfect sphericity. When epsilon falls below 0.60, the more conservative correction (Greenhouse-Geisser) is recommended. When epsilon is 0.60 or above, the slightly less conservative correction (Huynh-Feldt) is generally preferred. Most statistical software reports both automatically, so you simply pick the appropriate one based on the epsilon value.
Follow-Up Comparisons
A significant F-test tells you that not all condition means are equal, but it doesn’t tell you where the differences lie. If you measured flexibility at three time points and got a significant result, you still need to figure out whether the improvement happened between the first and second measurement, the second and third, or both.
This requires pairwise comparisons, essentially a series of paired t-tests between each combination of conditions. The catch is that running multiple comparisons inflates your chance of a false positive. With three time points, you’d run three pairwise tests, and with five time points, you’d run ten. The most common solution is a Bonferroni correction, which divides the significance threshold by the number of comparisons. So if you’re running three tests at the .05 level, each individual test needs to reach .05 / 3, or roughly .017, to count as significant.
Effect Size
Statistical significance tells you whether an effect probably exists. Effect size tells you how large it is. The most commonly reported measure for repeated measures ANOVA is partial eta squared, which represents the proportion of variance in the outcome that’s explained by the condition after removing individual differences.
General benchmarks for interpretation: 0.01 is considered a small effect, 0.06 a medium effect, and 0.14 a large effect. These are guidelines, not rigid cutoffs, and what counts as “meaningful” depends heavily on the field. A partial eta squared of 0.03 might be trivial in a lab experiment but clinically important in a large public health study.
When a Mixed Model Is a Better Choice
Repeated measures ANOVA works well when your data is complete and tidy: every participant measured at every time point, no dropouts, no missing sessions. Real data is rarely that clean. If even a few participants miss one measurement, a traditional repeated measures ANOVA will drop those individuals entirely, shrinking your sample and potentially biasing your results.
Linear mixed models handle this more gracefully. They can use all available data from each participant, even if some time points are missing, without discarding incomplete cases. They’re also more flexible when the relationship between time and the outcome isn’t linear, or when you need to model complex correlation structures. For longitudinal studies where dropout is common, mixed models are generally the better tool. That said, when your data is complete and your design is straightforward, repeated measures ANOVA gives you a simpler analysis that’s easier to report and interpret.

