What Is a Student T-Test? Types, Uses, and Assumptions

A Student’s t-test is a statistical method used to determine whether there’s a meaningful difference between the averages of two groups. If you’re comparing the test scores of two classrooms, the blood pressure of patients before and after a medication, or whether a new manufacturing process produces different results than the old one, a t-test tells you whether the difference you see in your data is real or just due to random chance. It’s one of the most commonly used tools in statistics, especially when working with small sample sizes.

Why It’s Called “Student’s” T-Test

The name has nothing to do with school. William Sealy Gosset developed the t-test in 1908 while working as a chemist at a brewery in Dublin. He needed a way to make reliable comparisons using the small batches of data available in brewing experiments. His employer didn’t allow staff to publish under their own names, so Gosset used the pseudonym “Student.” The name stuck, and the test has been called Student’s t-test ever since.

When to Use a T-Test Instead of Other Methods

The t-test exists because of a very specific problem: most of the time, you don’t know the true variability of the entire population you’re studying. You only have a sample. When you substitute your sample’s variability for the population’s, you introduce some imprecision, and that imprecision matters most when your sample is small.

If you happen to know the population’s standard deviation (a measure of how spread out values are), you can use a simpler method called a z-test. But that situation is rare. In practice, you’re almost always estimating variability from your sample, which is exactly when the t-test is appropriate. The t-test accounts for the extra uncertainty by using a slightly different probability curve called the t-distribution, which has heavier tails than the normal bell curve. Those heavier tails make it harder to declare a result “significant,” which protects you from drawing conclusions your small dataset can’t support.

Once your sample size exceeds about 30, the t-distribution starts to look nearly identical to the normal distribution, and the t-test and z-test give almost the same results. Below 30, the difference becomes important.

Three Types of T-Tests

One-Sample T-Test

This compares the average of a single group to a specific known value. For example, a school administrator might test whether the average reading score of students in a particular district differs from the national average of 500. You have one group of data and one fixed number to compare it against.

Independent Samples T-Test

This compares the averages of two separate, unrelated groups. The people (or items) in one group have no connection to those in the other. A classic example: researchers measured BMI in a group of 10 men and 10 women and used an independent samples t-test to determine whether the average BMI differed between the sexes. In that case, the test showed no statistically significant difference (p = 0.489), meaning the observed gap was small enough to be explained by chance.

Paired Samples T-Test

This compares two measurements taken from the same group, typically before and after some event or treatment. The key feature is that each data point in one set is linked to a specific data point in the other. In one medical study, researchers measured diastolic blood pressure in 20 patients at baseline and again 30 minutes later. The paired t-test showed the average increase of 4.35 points was statistically significant (p < 0.001), meaning the change was unlikely to be random.

The distinction between independent and paired tests matters because paired data, where each subject serves as their own comparison, naturally controls for individual differences and tends to be more sensitive to real effects.

What the T-Test Actually Calculates

A t-test produces a number called the t-statistic, which is essentially a ratio. The numerator is the difference between the group averages you’re comparing. The denominator is a measure of how much variability exists in your data, adjusted for sample size. A larger t-value means the difference between groups is large relative to the noise in the data.

From the t-statistic, the test generates a p-value. This tells you the probability of seeing a difference at least as large as the one in your data if there were actually no real difference between the groups. By convention, if the p-value falls below 0.05 (a 5% threshold first suggested by the statistician Ronald Fisher), the result is considered statistically significant. That 0.05 cutoff means you’d expect to be wrong about 1 time in 20, concluding there’s a difference when none actually exists.

A p-value above 0.05 doesn’t prove the groups are the same. It simply means your data isn’t strong enough to confidently rule out chance as an explanation.

Degrees of Freedom

Every t-test calculation involves a value called degrees of freedom, which reflects how much independent information your data contains. It determines which version of the t-distribution to use when calculating your p-value. The formulas are straightforward:

One-sample t-test: degrees of freedom = n minus 1, where n is your sample size
Independent samples t-test: degrees of freedom = n1 plus n2 minus 2, where n1 and n2 are the sizes of your two groups
Paired samples t-test: degrees of freedom = n minus 1, where n is the number of pairs

Smaller degrees of freedom produce a wider, flatter t-distribution, which makes it harder to reach statistical significance. As degrees of freedom increase, the distribution narrows and approaches the familiar bell curve.

Assumptions Your Data Must Meet

The t-test isn’t appropriate for every dataset. It requires several conditions to produce reliable results:

Numeric data: The variable you’re comparing must be measured on a numeric scale (like weight, temperature, or test scores), not categories.
Random sampling: Your data should be drawn randomly from the population, so that your sample is representative.
Normal distribution: The data in each group should follow a roughly bell-shaped distribution. This assumption becomes less critical as sample sizes grow larger, because the math behind the test compensates. With very small samples from a non-normal population, the t-test can give misleading results.
Equal variance: For the independent samples t-test, the spread of data in both groups should be roughly similar. When group sizes are unequal, violations of this assumption become especially problematic.
Independence: Each observation should be independent of the others (except in the paired test, where the pairing is the whole point).

Welch’s T-Test: A Safer Alternative

The standard independent samples t-test assumes both groups have equal variance. When that assumption is violated and group sizes are unequal, the test can produce false positives at a higher rate than expected. In one simulation, when the ratio of standard deviations between populations was 2 and the smaller group happened to come from the more variable population, the false positive rate jumped from the expected 5% to 8.3%.

Welch’s t-test is a modified version that doesn’t require equal variances. It adjusts the degrees of freedom to account for differences in spread between groups. Many statisticians now recommend using Welch’s t-test by default for independent samples comparisons. When variances are actually equal, Welch’s test is only slightly less powerful than the standard version, making the tradeoff negligible. Most statistical software offers Welch’s test as an option, and some use it as the default.

P-Values Don’t Tell the Whole Story

A common mistake is treating statistical significance as the finish line. A p-value below 0.05 tells you a difference probably exists, but it says nothing about whether that difference is large enough to matter in practice. With a big enough sample, even a trivially small difference can reach statistical significance.

This is where effect size comes in. The most common measure for t-tests is Cohen’s d, which expresses the difference between group averages in terms of standard deviations. Cohen proposed benchmarks that are still widely used: 0.2 is a small effect (real but hard to notice), 0.5 is a medium effect (visible to a careful observer), and 0.8 or above is a large effect. Reporting both the p-value and the effect size gives a much more complete picture of your results than either number alone.