How to Read T-Test Results: What the Numbers Mean

A t-test result tells you whether the difference between two groups (or between a group and a known value) is likely real or just due to random chance. The output typically includes a t-value, degrees of freedom, and a p-value. Once you know what each number means and how they relate to each other, the whole output clicks into place.

The Three Numbers in Every T-Test

Whether you’re looking at output from Excel, SPSS, R, or a published study, you’ll see the same core components: a t-value (sometimes called the t-statistic), degrees of freedom (abbreviated “df”), and a p-value. Some outputs also include the mean difference, standard error, and confidence intervals. Here’s what each one actually tells you.

The t-value is a ratio. It takes the difference between two group averages and divides it by the standard error, which is a measure of how much variability exists in your data. A t-value of 1.0 means the groups differ by exactly one standard error. A t-value of 4.0 means the difference is four times larger than you’d expect from random noise alone. The farther the t-value is from zero (in either direction), the stronger the evidence that the difference is real. A t-value close to zero suggests the groups are basically the same.

Degrees of freedom reflect your sample size. For an independent t-test comparing two separate groups, degrees of freedom roughly equal the total number of observations minus 2. For a paired t-test (where you’re comparing the same people before and after something), it’s the number of pairs minus 1. You don’t need to calculate this yourself; software handles it. But degrees of freedom matter because they determine how the t-value gets converted into a p-value. With a small sample, you need a larger t-value to reach significance. With a large sample, a smaller t-value can be enough.

The p-value is the number most people jump to first. It tells you the probability of seeing a difference this large (or larger) if there were truly no difference between the groups. A p-value of 0.03 means there’s a 3% chance the result is just noise. The conventional cutoff is 0.05: below it, the result is called “statistically significant,” meaning you reject the idea that the groups are the same. Some fields use a stricter threshold of 0.01.

What “Statistically Significant” Actually Means

When a t-test produces a p-value below 0.05, you reject what statisticians call the null hypothesis, which is simply the assumption that there’s no real difference between your groups. A p-value above 0.05 means you “fail to reject” that assumption. Notice the careful language: you don’t prove there’s no difference. You just didn’t find strong enough evidence of one.

That said, the 0.05 threshold is a convention, not a law of nature. The American Statistical Association has stated explicitly that this cutoff is arbitrary and that scientific conclusions should not rest on whether a p-value passes a single threshold. A p-value of 0.049 and a p-value of 0.051 are practically identical in what they tell you about the data. Treat p-values as a sliding scale of evidence rather than a pass/fail gate.

Why the P-Value Isn’t Enough

A small p-value tells you a difference exists. It does not tell you whether that difference matters. This is the distinction between statistical significance and practical significance, and it trips up a lot of people.

Consider two studies testing different cancer drugs, both producing a p-value of 0.01. Drug A extends survival by five years. Drug B extends survival by five months. Both results are statistically significant, but they are worlds apart in real-world importance. The p-value is identical; the actual impact is not.

This is why you should always look at the mean difference reported alongside the t-test. It tells you the size of the gap between groups in the original units of measurement, whether that’s pounds, test scores, hours of sleep, or anything else. A statistically significant difference of 0.2 points on a 100-point exam is probably meaningless in practice, even if the p-value is tiny.

How to Use Effect Size

Effect size gives you a standardized way to judge whether a difference is small or large, regardless of the units involved. The most common measure reported with t-tests is Cohen’s d, which expresses the difference between groups in standard deviation units. A Cohen’s d of 0.2 is considered small, 0.5 is medium, and 0.8 or above is large. As the psychologist Jacob Cohen put it, a medium effect is “visible to the naked eye of a careful observer,” while a small effect is noticeably smaller but not trivial.

Not all software outputs include effect size automatically, but you can calculate it from the numbers provided. If your output gives you the mean difference and the pooled standard deviation, dividing one by the other gives you Cohen’s d. Many researchers now argue that reporting effect size should be standard practice alongside p-values, because it answers the question that actually matters: how big is this difference?

Reading Confidence Intervals

Many t-test outputs include a 95% confidence interval for the difference between the two group means. This interval gives you a range of plausible values for the true difference. If you see “95% CI: 2.1 to 8.4,” it means you can be reasonably confident the real difference falls somewhere between 2.1 and 8.4 units.

The key thing to check is whether the interval includes zero. If zero falls inside the range (for example, “95% CI: -1.3 to 5.7”), the result is not statistically significant at the 0.05 level. Zero represents no difference, and since it’s a plausible value within your interval, you can’t rule it out. If zero falls outside the interval, the result is significant. This gives you the same information as the p-value but in a more informative format, because you can also see the range of likely effect sizes.

Three Types of T-Tests

The interpretation rules above apply to all t-tests, but the type of t-test determines what comparison is being made.

Independent (two-sample) t-test: Compares the averages of two separate, unrelated groups. Example: testing whether a new teaching method produces higher scores than the standard method, using two different classrooms.
Paired t-test: Compares two measurements from the same group. This is common in before-and-after designs, like measuring blood pressure before and after a medication. It also applies when subjects are naturally matched, such as comparing test scores between twins or spouses.
One-sample t-test: Compares a group’s average to a specific known value. Example: testing whether the average height of students at a school differs from the national average. A paired t-test is essentially a one-sample t-test performed on the differences within each pair.

Knowing which test was used helps you understand what the output is comparing, but the way you read the t-value, p-value, and confidence interval is the same across all three.

When T-Test Results May Be Unreliable

T-tests make several assumptions about your data, and when those assumptions are violated, the results can be misleading. The data should come from a random sample, follow a roughly normal (bell-shaped) distribution, and (for independent t-tests) the two groups should have similar amounts of variability. With large enough samples, the normality requirement becomes less important because of how averages behave mathematically. But with small samples, a heavily skewed distribution can produce unreliable p-values.

If your software output flags unequal variances (sometimes shown as Levene’s test), look for the adjusted version of the t-test, often labeled “equal variances not assumed” or “Welch’s t-test.” Most software provides both versions side by side.

Reading a T-Test in Practice

Here’s a step-by-step approach when you encounter t-test results, whether in software output or a journal article:

Identify the test type. Are two independent groups being compared, or is this a before-and-after comparison on the same subjects?
Check the mean difference. What is the actual gap between the groups, in real units? Does this difference seem meaningful for the context?
Look at the t-value. A larger absolute value means the difference is large relative to the noise in the data.
Check the p-value. Below 0.05 is conventionally significant, but treat it as a continuous measure of evidence rather than a binary verdict.
Examine the confidence interval. Does it include zero? How wide is it? A narrow interval means the estimate is precise. A wide interval means there’s a lot of uncertainty.
Look for effect size. A Cohen’s d of 0.5 or higher suggests the difference is meaningful in practical terms, not just statistically detectable.

The most common mistake in reading t-test results is treating a significant p-value as proof that the finding is important, or treating a non-significant p-value as proof that there’s no difference. Neither is true. A significant result with a tiny effect size may be trivial. A non-significant result from a small study may simply mean you didn’t have enough data to detect a real difference. The full picture comes from reading all the numbers together.