What Is a Hypothesis Test and How Does It Work?

A hypothesis test is a statistical procedure used to determine whether a difference or pattern observed in data reflects something real or could have occurred by chance alone. It’s the formal method behind most scientific conclusions, from determining whether a new drug works better than an existing one to deciding whether a website redesign actually increases sales. The core logic is surprisingly simple: assume nothing is happening, then check whether your data is weird enough to challenge that assumption.

The Basic Logic

Every hypothesis test starts with a default position called the null hypothesis. The null hypothesis states that there is no real difference, no real effect, no real relationship between the things you’re comparing. A new medication doesn’t work better than a placebo. A teaching method doesn’t improve test scores. Two groups are essentially the same.

The alternative hypothesis is the opposite claim: something IS going on. The drug does reduce symptoms. The teaching method does raise scores. These two hypotheses are mutually exclusive, and the entire test is built around one question: is your data unlikely enough, under the assumption that the null hypothesis is true, to justify rejecting that null hypothesis?

This is a subtle but important distinction. You’re not directly proving the alternative hypothesis. You’re asking whether the evidence is strong enough to rule out the “nothing is happening” explanation. If it is, you reject the null. If it isn’t, you “fail to reject” the null, which is not the same as proving the null is true. It just means you don’t have enough evidence to dismiss it.

What a P-Value Actually Tells You

The p-value is the number that drives most hypothesis testing decisions. It represents the probability of seeing results as extreme as yours (or more extreme) if the null hypothesis were actually true. A p-value of 0.03, for instance, means there’s a 3% chance you’d see data this dramatic in a world where nothing real is happening.

The conventional cutoff is 0.05, or 5%. If your p-value falls at or below 0.05, the result is considered “statistically significant,” and you reject the null hypothesis. If it’s above 0.05, you don’t. This 0.05 threshold traces back to the early 20th century, when the statistician Ronald Fisher suggested it as a convenient dividing line. Other fields use stricter thresholds: some areas of drug regulation have required p-values below 0.01 or even 0.001 before accepting a single study as strong evidence.

A critical point that’s easy to miss: a p-value does not tell you the probability that your hypothesis is correct. It only tells you how surprising your data would be under the null hypothesis. A large p-value also doesn’t mean the null hypothesis is true. Many other explanations could be equally consistent with the observed data. The American Statistical Association has emphasized that a p-value alone is not a complete analysis, and that other approaches like confidence intervals should accompany it when possible.

The Five Steps of a Hypothesis Test

While specific tests vary in their math, the general procedure follows five steps:

  • State your hypotheses. Define the null hypothesis (no effect) and the alternative hypothesis (there is an effect). Decide on a significance level, typically 0.05.
  • Check your assumptions. Different tests require different conditions. Some need data that follows a bell curve distribution. Others require minimum sample sizes or a certain type of variable.
  • Compute a test statistic. This is a single number that summarizes how far your observed data deviates from what the null hypothesis would predict.
  • Find the p-value. Using the test statistic, calculate the probability of seeing a result this extreme if the null hypothesis were true.
  • Make a decision and state a conclusion. If p is less than or equal to your significance level, reject the null. If p is greater, fail to reject it. Then translate that decision back into plain language that answers the original research question.

Type I and Type II Errors

Because hypothesis tests deal in probabilities rather than certainties, two kinds of mistakes are possible. A Type I error (false positive) happens when you reject the null hypothesis even though it’s actually true. You conclude the drug works, but it doesn’t. The probability of making this error is equal to your significance level. At the standard 0.05 threshold, you accept a 5% chance of a false positive.

A Type II error (false negative) happens when you fail to reject the null hypothesis even though it’s actually false. The drug really does work, but your test didn’t pick it up. The probability of this error is typically set between 10% and 20% in study design. These two errors trade off against each other: making it harder to commit a false positive (by using a stricter cutoff like 0.01) makes it easier to miss a real effect, and vice versa. Choosing where to set these thresholds depends on the stakes. In criminal justice terms, a Type I error convicts an innocent person, while a Type II error lets a guilty person go free.

Choosing the Right Test

The specific hypothesis test you use depends on your data and your question. The choice comes down to a few key factors: how many groups you’re comparing, what kind of data you have, and whether your observations are independent of each other.

If you’re comparing the average of one group to a known value and your data is numerical, a one-sample t-test is the standard choice. Comparing the averages of two independent groups (say, a treatment group and a control group) calls for a two-sample t-test. If the same people are measured twice, before and after a treatment for example, you’d use a paired t-test instead. For comparing three or more groups at once, ANOVA (analysis of variance) extends the same logic.

When your data is categorical rather than numerical, you’re counting things that fall into categories rather than measuring quantities. A chi-square test handles this. It’s the go-to for questions like whether the proportion of people who prefer Brand A differs across age groups, or whether a coin lands on heads more often than chance would predict.

How It Works in Practice

In a clinical trial, researchers might test whether a new drug reduces symptoms better than an existing one. The null hypothesis is that both drugs perform the same. If 100 patients receive each drug and the new drug group is 2.1 times less likely to experience symptoms, with a p-value of 0.02, the researchers would reject the null hypothesis at the 0.05 level and conclude there’s a statistically significant difference.

The same framework drives A/B testing in business. A company changes its checkout page design and randomly shows half of visitors the old version, half the new one. The null hypothesis: conversion rates are identical. If the new page converts at a higher rate and the p-value comes in below 0.05, the company has statistical grounds for rolling out the change. If the p-value is 0.15, the observed difference could easily be random noise, and the evidence isn’t strong enough to justify the switch.

In both cases, the quality of the study design matters as much as the p-value itself. A result from a well-designed randomized trial, where bias has been carefully minimized, carries far more weight than the same p-value from a study where the groups weren’t truly comparable. The number is only as trustworthy as the process that produced it.

Common Misinterpretations

The most widespread misunderstanding is treating a p-value of 0.03 as meaning “there’s a 97% chance the alternative hypothesis is true.” That’s not what it means. It means your data would be this unusual only 3% of the time if nothing real were happening. The distinction matters because the actual probability that your hypothesis is correct depends on factors the p-value doesn’t account for, like how plausible the hypothesis was before you tested it.

Another common mistake is interpreting “fail to reject the null” as proof that no effect exists. All it means is that your data wasn’t extreme enough to clear the statistical bar. The effect might be real but too small for your sample size to detect. Similarly, statistical significance doesn’t automatically mean practical significance. A study with thousands of participants can produce a highly significant p-value for a difference so tiny it has no real-world consequence. The size of the effect matters just as much as whether the test flagged it.