A related samples t-test is a statistical method that compares two sets of measurements taken from the same group of people. If you measure a group’s blood pressure before and after a medication, for example, you’d use this test to determine whether the change was statistically meaningful or likely due to chance. You’ll also see it called a paired samples t-test, dependent samples t-test, or matched pairs t-test. These are all the same procedure.
Why “Related” Samples Matter
The word “related” is the key concept. In this design, every data point in one group has a direct partner in the other group, because both measurements come from the same person. A student who takes a math test in September and again in December contributes one score to each group. Those two scores are linked, or “dependent,” because they share everything about that individual: their background knowledge, test-taking habits, and natural ability.
This is fundamentally different from an independent samples t-test, where two separate groups of people are compared (say, one group receiving a treatment and a different group receiving a placebo). In that design, no subject appears in both groups, and there’s no built-in pairing between any two scores. The related samples approach is more powerful in many situations because each person serves as their own control, which strips away a lot of individual variation that would otherwise cloud the results.
Common Scenarios That Call for This Test
The most intuitive use is a before-and-after design. A physical therapist measures patients’ range of motion before a 6-week program, then measures it again afterward. A psychologist records anxiety scores before and after a mindfulness intervention. A teacher compares exam scores at the start and end of a semester. In each case, the same individuals are measured twice under different conditions.
The test also applies to matched-pairs designs. Imagine a study where researchers can’t measure the same person twice, so instead they carefully match participants into pairs based on age, sex, and health status, then assign one member of each pair to each condition. Because the pairing creates a deliberate link between the two scores, the data is still “related” rather than independent. Both the test-retest and matched-pairs approach use the same statistical procedure.
What the Test Actually Calculates
The math centers on one simple idea: for each pair of scores, calculate the difference. If someone scored 72 on a pre-test and 81 on a post-test, their difference is 9. Do this for every participant, and you end up with a single column of difference scores. The test then asks whether the average of those differences is meaningfully different from zero.
The formula for the t-statistic is the mean of those differences divided by the standard error (which is the standard deviation of the differences divided by the square root of the sample size). A larger t-value means the average change is large relative to how much individual changes varied. The degrees of freedom equal n − 1, where n is the number of pairs, not the total number of individual measurements.
This produces a p-value. If the p-value falls below a chosen threshold (typically 0.05), the conclusion is that the difference between conditions is statistically significant, meaning it’s unlikely to have occurred by chance alone.
The Null and Alternative Hypotheses
The null hypothesis states that the true average difference in the population is zero. In plain terms: whatever you’re testing (a treatment, an intervention, a time gap) had no real effect, and any observed change is just random noise.
The alternative hypothesis can take three forms. A two-tailed alternative says the average difference is simply not zero, in either direction. A one-tailed alternative specifies a direction: the difference is greater than zero, or less than zero. Your research question determines which version you use. If you’re testing whether a new drug lowers cholesterol, a one-tailed test makes sense. If you’re exploring whether a classroom exercise changes test scores but aren’t sure which direction, a two-tailed test is appropriate.
Assumptions Your Data Must Meet
The test requires several conditions to produce trustworthy results:
- Continuous outcome variable. The thing you’re measuring needs to be on an interval or ratio scale, like weight in kilograms or a score from 0 to 100. You can’t use it for ranked or categorical data.
- Paired observations. Each measurement in one condition must have a corresponding measurement in the other.
- Normal distribution of the differences. It’s not the raw scores that need to be normally distributed. It’s the difference scores. This is a common point of confusion. When checking for normality and outliers, you work with the difference column, not the original measurements.
- No extreme outliers in the differences. A single participant whose change is wildly different from everyone else’s can distort the results.
- Random sampling. The participants should represent the broader population you want to draw conclusions about.
The test is limited to comparing exactly two conditions. If you have three or more time points or conditions, you’ll need a different method, such as a repeated measures ANOVA.
Measuring the Size of the Effect
A statistically significant result tells you that a difference exists, but not how big or meaningful it is. That’s where effect size comes in. The most common measure is Cohen’s d, which expresses the size of the difference in standardized units.
General benchmarks for interpreting Cohen’s d: 0.2 is considered a small effect, 0.5 is moderate, and 0.8 is large. A study might find a statistically significant improvement in test scores (p = 0.002), but if Cohen’s d is only 0.23, the actual size of the improvement is small. Both pieces of information matter. Statistical significance tells you the effect is real; effect size tells you whether it’s practically important.
How to Report Results
When writing up findings, standard practice includes the t-statistic, degrees of freedom, p-value, and effect size. A properly formatted result looks like this: t(179) = 3.10, p = .002, Cohen’s d = 0.23. Alongside this, report the means and standard deviations for both conditions so the reader can see the raw numbers. For instance, you might note that participants scored an average of 5.67 (SD = 1.24) in one condition and 5.83 (SD = 1.21) in the other.
If the p-value is less than .05, you reject the null hypothesis and conclude the difference is statistically significant. If it’s greater than .05, you retain the null hypothesis, meaning the data didn’t provide enough evidence to confirm a real difference.
When the Assumptions Don’t Hold
If your difference scores are clearly not normally distributed, or you’re working with ranked data rather than continuous measurements, the related samples t-test isn’t appropriate. The standard alternative is the Wilcoxon signed-rank test. This nonparametric method doesn’t assume a normal distribution. It works by ranking the absolute values of the differences and then comparing positive and negative ranks, which makes it more robust when your data includes skewed distributions or outliers. A simpler but less powerful option is the sign test, which only looks at whether differences are positive or negative without considering their size. The Wilcoxon is generally preferred because it uses more of the information in your data.
Related Samples vs. Independent Samples
Choosing between these two t-tests comes down to one question: are the same people (or deliberately matched people) in both groups? If yes, use the related samples version. If two completely separate groups of participants each provide one measurement, use the independent samples version.
The practical advantage of a related design is statistical power. Because individual differences are controlled for (each person is compared to themselves), you can detect smaller effects with fewer participants. The tradeoff is that repeated testing can introduce its own complications, like practice effects or fatigue, depending on the study design. The degrees of freedom also differ: n − 1 for paired data versus a formula based on both group sizes for independent samples (typically n₁ + n₂ − 2 when variances are assumed equal).

