What Is the Goal of a Hypothesis Test?

The goal of a hypothesis test is to use data from a sample to make a reliable inference about an entire population. More specifically, it gives you a structured way to decide whether the patterns you observe in your data reflect something real or are just the result of random chance. If you measure a difference between two groups or spot a relationship between two variables, hypothesis testing helps you determine whether that finding is strong enough to generalize beyond the data you collected.

How the Two Hypotheses Work

Every hypothesis test starts by setting up two competing statements. The null hypothesis represents the status quo: nothing interesting is happening. There’s no difference between groups, no relationship between variables, no effect from the treatment. The alternative hypothesis is the researcher’s actual claim, the thing they suspect is true and want the data to support.

A few examples make this concrete. If you’re studying whether a pay gap exists between male and female factory workers, the null hypothesis says there’s no difference in salary based on gender, while the alternative says there is. If you’re investigating whether job experience affects the quality of a brick mason’s work, the null says experience has no impact, and the alternative says it does. The test never “proves” the alternative hypothesis. Instead, it asks whether the data are so inconsistent with the null hypothesis that you should reject it.

What a P-Value Actually Tells You

The p-value is the number that drives the decision. It measures how incompatible your observed data are with the null hypothesis. A small p-value means that the results you got would be very unlikely if the null hypothesis were true. A large p-value means your data fit comfortably within what you’d expect under the null.

Before collecting data, you pick a threshold called the significance level (alpha). Most studies use 0.05, meaning the researcher accepts a 5% chance of being wrong. If the p-value falls below that threshold, the result is called statistically significant and you reject the null hypothesis. If it doesn’t, you fail to reject it. Some fields use stricter thresholds. An alpha of 0.01, for instance, means you’re demanding 99% confidence rather than 95%.

The American Statistical Association has published six principles clarifying what p-values can and can’t do. Among the most important: a p-value does not tell you the probability that the hypothesis is true. It also doesn’t measure the size or importance of an effect. And scientific conclusions should never hinge on whether a p-value crosses a single threshold alone. These are common misunderstandings, even among researchers, and they matter because misreading a p-value can lead to overconfident or misleading conclusions.

The Five Steps of a Hypothesis Test

Penn State’s widely taught framework breaks hypothesis testing into five steps:

Check assumptions and write hypotheses. You verify that your data meet the requirements of the statistical test you plan to use, then formally state the null and alternative hypotheses.
Compute the test statistic. This converts your sample data into a single number that measures how far your results are from what the null hypothesis predicts.
Determine the p-value. Using the test statistic, you calculate the probability of getting results at least as extreme as yours if the null hypothesis were true.
Make a decision. Compare the p-value to your pre-set alpha. If p is smaller, reject the null hypothesis. If not, fail to reject it.
State a real-world conclusion. Translate the statistical decision back into plain language about your original research question.

The language here is intentional. You never “accept” the null hypothesis. You either reject it or fail to reject it. Failing to reject simply means you didn’t find enough evidence against it, not that it’s been confirmed.

Two Ways the Test Can Go Wrong

Because you’re drawing conclusions about a whole population from a limited sample, mistakes are always possible. These come in two forms.

A Type I error (false positive) happens when you reject the null hypothesis even though it’s actually true. You conclude there’s an effect when there isn’t one. The probability of making this error is your alpha level. At the standard 0.05 threshold, you accept a 5% chance of a false positive.

A Type II error (false negative) happens when you fail to reject the null hypothesis even though it’s actually false. You miss a real effect. The probability of this error is called beta, and researchers typically set it at 0.20 or lower, giving the test at least 80% power to detect a true effect.

You can reduce both types of errors by increasing your sample size. Larger samples are less likely to differ substantially from the population they represent. But in practice, researchers have to balance these error rates against the cost and feasibility of collecting more data. When the consequences of a false positive are severe, like approving an ineffective drug, researchers use a stricter alpha (0.01 or lower). When missing a real effect would be costly, they prioritize lowering beta instead.

Why Statistical Significance Isn’t the Whole Story

A result can be statistically significant without being practically meaningful. Statistical significance is directly influenced by sample size: with enough data, even a tiny difference will produce a small p-value. Practical significance, by contrast, depends on whether the difference is large enough to matter in the real world.

Consider this example from Penn State. Researchers wanted to know if SAT math scores at a particular college were higher than the national average of 500. They sampled 1,200 students and found a mean score of 506. The p-value was 0.019, well below the 0.05 threshold, so the result was statistically significant. But a 6-point difference on a test with a standard deviation of 100 is just 0.06 standard deviations. In practical terms, that gap is negligible.

Similarly, imagine a six-month weight-loss program that produces a statistically significant but extremely small average weight loss. The math says the effect is real, but the effect is too small for anyone to care about. This is why researchers increasingly report effect sizes alongside p-values. The hypothesis test tells you whether an effect exists. The effect size tells you whether it matters.

How This Works in Real Research

Hypothesis testing is the backbone of clinical trials, public health studies, and virtually every field that collects data. In one well-known example, the Diabetes Control and Complications Trial enrolled 1,441 people with type 1 diabetes and randomly assigned them to either intensive blood glucose monitoring or conventional treatment. The hypothesis: intensive therapy reduces the risk of microvascular complications like kidney disease and nerve damage. The trial ended in 1993 when hypothesis testing revealed a statistically significant reduction in complications for the intensive therapy group, a finding that changed the standard of care for diabetes management worldwide.

The same logic applies to simpler questions. Does a new fertilizer increase crop yield? Do students in smaller classes score higher on standardized tests? Does a website redesign increase the percentage of visitors who make a purchase? In every case, hypothesis testing provides a formal framework for deciding whether the observed results are strong enough to act on, or whether they could easily have happened by chance.