What Is the T-Test? Formula, Types, and Assumptions

A t-test is a statistical method that compares the averages of two groups to determine whether the difference between them is meaningful or just due to random chance. It works by calculating a ratio: how large the difference between the group averages is relative to how spread out the data is within each group. The bigger that ratio, the more likely the difference is real rather than a fluke.

How a T-Test Works

At its core, a t-test answers a simple question: are these two groups actually different, or does the gap I’m seeing just reflect natural variation in the data? Imagine you measure the test scores of students in two classrooms. One class averages 78 and the other averages 83. That five-point gap could reflect a genuine difference in teaching methods, or it could be random noise from the particular students who happened to be in each room. A t-test helps you figure out which explanation is more likely.

The test produces a number called the t-statistic, which is then converted into a p-value. The p-value tells you the probability of seeing a difference this large (or larger) if there were actually no real difference between the groups. Researchers typically set a threshold of 0.05 before running the test. If the p-value falls below that threshold, the result is considered statistically significant, meaning you can reject the assumption that the groups are the same. A p-value of 0.01, for example, means there’s only a 1% chance the observed difference happened randomly.

The Formula Behind It

The t-statistic for the simplest version of the test (comparing one group to a known value) uses four pieces of information: the sample average, the value you’re comparing it to, the standard deviation of your sample, and the number of observations. The formula divides the difference between your sample average and the comparison value by something called the standard error, which is just the standard deviation divided by the square root of the sample size.

This structure captures the three things that determine whether a result will be significant. First, the size of the difference you observed. Second, how much variability exists in your data (less variability makes differences easier to detect). Third, how many observations you have (more data gives you more confidence). A large difference, low variability, and a big sample all push the t-statistic higher, making a significant result more likely.

Three Types of T-Tests

There are three common versions of the t-test, and picking the right one depends on what you’re comparing.

  • One-sample t-test: Compares a single group’s average to a specific known value. For example, testing whether the average height of students at a school differs from the national average.
  • Independent samples t-test: Compares the averages of two separate, unrelated groups. For example, comparing blood pressure between patients who received a drug and patients who received a placebo.
  • Paired samples t-test: Compares two measurements taken from the same group, such as a patient’s pain level before and after treatment. Because the same individuals appear in both measurements, the data points are linked, and the test accounts for that connection.

A paired t-test can only compare two related measurements on a continuous outcome (something measured on a numerical scale, like weight or temperature). If you have more than two groups to compare, the t-test is no longer the right tool. You’d need an analysis of variance (ANOVA) instead.

When to Use a T-Test Instead of a Z-Test

You may have heard of the z-test, which does something similar. The distinction comes down to two factors: sample size and how much you know about the broader population. Z-tests are designed for large samples (generally 30 or more observations) where the population’s variability is already known. T-tests are more flexible. They work with small samples and don’t require you to know the population’s variability in advance, which is the situation most researchers actually face. In practice, t-tests are far more common because you rarely know the true population variance ahead of time.

Assumptions the Data Must Meet

A t-test doesn’t work on just any dataset. The data needs to meet several conditions for the results to be trustworthy. The outcome you’re measuring must be on a numerical scale (things like temperature, weight, or test scores, not categories like “yes” or “no”). The data should be collected through random sampling, and it should follow a roughly normal distribution, meaning most values cluster near the center with fewer extreme values on either side.

For the independent samples version, there’s an additional requirement: the two groups should have similar amounts of variability (called homogeneity of variance). When that assumption is violated, meaning one group’s data is much more spread out than the other’s, a modified version called Welch’s t-test is used instead. Welch’s version doesn’t require equal variance between the groups, making it a safer default choice in many situations.

Statistical Significance vs. Practical Importance

A common mistake is treating a statistically significant result as proof that the difference matters in the real world. A t-test can tell you that two groups are different, but it can’t tell you whether that difference is large enough to care about. With a big enough sample, even a tiny, meaningless difference can reach statistical significance.

This is where effect size comes in. One widely used measure is Cohen’s d, which expresses the difference between two groups in standardized units. The general benchmarks: 0.2 is a small effect, 0.5 is moderate, and 0.8 is large. Reporting effect size alongside your p-value gives a much more complete picture. A result might be statistically significant with p = 0.03 but have a Cohen’s d of 0.1, meaning the actual difference between the groups is negligible in practical terms.

The Multiple Testing Trap

One of the biggest pitfalls with t-tests is running too many of them on the same dataset. Every time you perform a t-test, there’s a small chance (typically 5%) of getting a false positive, meaning you conclude there’s a difference when there isn’t one. Run one test and that 5% risk is manageable. But if you compare two groups across five different outcomes, the probability of at least one false positive jumps to 23%. Run 20 comparisons and it climbs to 64%.

This happens because each test is an independent roll of the dice. The more tests you run, the more opportunities you create for random chance to produce a “significant” result. Researchers handle this by adjusting the significance threshold when multiple comparisons are involved, or by using statistical methods designed for multiple groups from the start rather than running a series of t-tests.

A Concrete Example

Suppose a researcher wants to know whether children and adults have different average sodium levels in their blood. They collect samples from 25 children and 25 adults, measure sodium concentration in each person, and run an independent samples t-test. The test calculates the difference between the two group averages, divides it by the pooled variability, and produces a t-statistic. That statistic is then evaluated against a reference distribution based on the degrees of freedom (which depends on sample size) to generate a p-value. If the p-value comes in below 0.05, the researcher concludes that sodium levels genuinely differ between children and adults. If it’s above 0.05, they cannot rule out that the difference is just random variation.

The t-test remains one of the most widely used statistical tools in science, medicine, and business precisely because it answers such a fundamental question: is this difference real? Understanding what it does, what it assumes, and where it falls short puts you in a much better position to interpret the research you encounter every day.