How to Test for Homogeneity of Variance: 4 Tests

Testing for homogeneity of variance means checking whether the spread of your data is roughly equal across groups before running tests like ANOVA or t-tests. If you skip this step, your results can be misleading: with a sample size of just 15, falsely assuming equal variances can inflate your Type I error rate by 50%, meaning you’re far more likely to find a “significant” difference that doesn’t actually exist. Several formal statistical tests and visual methods can detect this problem, and choosing the right one depends on your data’s distribution and sample size.

Why Equal Variance Matters

Parametric tests like the t-test and ANOVA pool the variances from each group to estimate a single shared population variance. That pooling only makes sense if the groups actually have similar variances. When they don’t (a condition called heteroscedasticity), the math behind these tests breaks down. The F-statistic and p-values you get no longer mean what they’re supposed to mean, and your control over both false positives and false negatives becomes unreliable.

Homogeneity of variance assumes that even if your group means are different, the spread of scores around each mean is roughly the same. Think of it this way: if you’re comparing test scores across three classrooms, the average score can differ, but the range of scores within each classroom should be similar. When one classroom has scores clustered tightly around the mean and another has scores scattered widely, standard ANOVA can’t handle that cleanly.

Levene’s Test

Levene’s test is the most widely used formal test for equal variances and works well across a range of situations. It tests the null hypothesis that all group variances are equal against the alternative that at least one pair of groups has unequal variances. The test works by calculating how far each data point falls from its group mean (using absolute deviations), then running an ANOVA on those distances. If the groups have similar variance, those deviations will be similar in size across groups. If variances differ, one group’s deviations will be systematically larger.

The result is an F-statistic compared against a critical value from the F distribution. A small p-value (conventionally below 0.05, though this threshold is increasingly recognized as arbitrary rather than sacred) suggests the variances are unequal. A large p-value means you don’t have evidence against equal variances, and you can proceed with standard parametric tests.

Levene’s test has an important advantage: it’s relatively robust to non-normal data. Unlike some alternatives, it doesn’t require your data to follow a bell curve to give reliable results, which is why it’s become the default recommendation in most statistics courses and software packages.

Brown-Forsythe Test

The Brown-Forsythe test is a close relative of Levene’s test with one key modification: it uses the group median instead of the group mean when calculating deviations. This makes it even more resistant to outliers and skewed distributions.

Consider why this matters. If your data has a few extreme values in one group, those outliers pull the group mean toward them, which distorts the deviation calculations in Levene’s test. The median isn’t affected by extreme values in the same way, so the Brown-Forsythe test gives more stable results with messy, real-world data. If your data is noticeably skewed or contains outliers, the Brown-Forsythe test is the better choice. Many software packages offer both options side by side.

Bartlett’s Test

Bartlett’s test is more powerful than Levene’s test when your data is truly normally distributed, meaning it’s better at detecting real differences in variance. But that strength comes with a major caveat: Bartlett’s test is highly sensitive to departures from normality. If your data even slightly deviates from a normal distribution, Bartlett’s test may flag unequal variances when the real issue is non-normality, not heteroscedasticity.

Because of this sensitivity, Bartlett’s test is only appropriate when you’ve already confirmed your data is close to normally distributed (through a Shapiro-Wilk test or Q-Q plot, for example). If you’re unsure about normality, or your data is clearly skewed, stick with Levene’s or Brown-Forsythe.

Fligner-Killeen Test

When your data clearly violates the normality assumption, the Fligner-Killeen test is a strong non-parametric alternative. It works by ranking the absolute deviations from each group’s median rather than using the raw values. Simulation studies have identified it as one of the most robust tests for homogeneity of variances when data departs from normality.

You’ll encounter this test less often in introductory statistics courses, but it’s worth knowing about for heavily skewed data, ordinal-scale measurements, or situations where outliers are common and can’t be removed.

Visual Inspection With Residual Plots

Formal tests give you a p-value, but a residual plot often tells you more about what’s actually happening in your data. In regression contexts, you plot residuals (the difference between observed and predicted values) against fitted values. For group comparisons, you can plot the raw data by group using boxplots or dot plots.

What you’re looking for is the shape of the spread. Equal variance looks like a roughly uniform band of points across the plot. Unequal variance shows up as recognizable patterns: a “butterfly” shape where the spread widens in the middle, a triangle (or funnel) shape where variance grows from left to right, or distinct differences in the height of boxplots across groups. These visual patterns often reveal not just whether variance differs, but how it differs, which helps you decide on the right fix.

Visual inspection is especially useful because formal tests can be overly sensitive with large samples (flagging trivially small variance differences as “significant”) and underpowered with small samples (missing meaningful differences). A plot gives you context that a p-value alone doesn’t provide.

Choosing the Right Test

Your choice comes down to what you know about your data’s distribution:

Data is normally distributed: Use Bartlett’s test for maximum power, or Levene’s test if you want a safer option.
Data is approximately normal with possible outliers: Use the Brown-Forsythe test (Levene’s with medians).
Data is clearly non-normal or skewed: Use the Fligner-Killeen test.
You’re not sure about the distribution: Use Levene’s test or Brown-Forsythe as a default. They perform reasonably well across most conditions.

In practice, many researchers run Levene’s test as a first pass and supplement it with a residual plot. If both point to equal variances, you can proceed confidently with standard parametric tests.

Running These Tests in Software

In R, use leveneTest() from the car package for Levene’s and Brown-Forsythe tests (the function defaults to using the median, making it technically a Brown-Forsythe test; set center = mean for the classic Levene version). Bartlett’s test is available as bartlett.test() in base R, and the Fligner-Killeen test as fligner.test().

In Python, the SciPy library provides scipy.stats.levene(), scipy.stats.bartlett(), and scipy.stats.fligner(). The levene() function accepts a center parameter that lets you choose between mean, median, or trimmed mean.

In SPSS, these tests are found under the Analyze menu within Compare Means. Levene’s test is automatically included in the output when you run an independent samples t-test or one-way ANOVA.

What to Do When Variances Are Unequal

If your test comes back significant, you have two main options. The first and often simplest is to use a test that doesn’t assume equal variances. Welch’s t-test (for two groups) and Welch’s ANOVA (for three or more groups) adjust the degrees of freedom to account for unequal variances, and they perform well enough that some statisticians recommend using them by default regardless of what the variance test shows.

The second approach is transforming your data before analysis. Log transformations, square root transformations, or other mathematical conversions can sometimes stabilize variance across groups. This works best when the pattern of heteroscedasticity is predictable, like when groups with higher means also have higher variance (common in count data or reaction time data). The downside is that your results are now on a transformed scale, which can make interpretation less intuitive.

A more general approach is the generalized least squares technique, which explicitly models the unequal variances rather than assuming them away. This is more complex to implement but gives you valid results without transforming your original data. Between Welch’s correction and data transformation, most researchers find a workable solution without needing this level of complexity.