What Is a Statistical Hypothesis and How Is It Tested?

A statistical hypothesis is a formal, testable claim about a characteristic of a population. It’s the starting point of hypothesis testing, the process researchers use to decide whether the patterns they see in their data reflect something real or are just the result of random chance. Every statistical test revolves around two competing hypotheses, and the goal is to use sample data to figure out which one the evidence supports.

The Null and Alternative Hypothesis

Every hypothesis test has two parts: a null hypothesis and an alternative hypothesis. The null hypothesis is the default assumption that nothing interesting is going on. It states that there’s no difference between groups, no relationship between variables, or no effect from a treatment. The alternative hypothesis is what the researcher actually believes or is trying to demonstrate.

This setup might feel backward. If a scientist thinks a new drug works better than an old one, the alternative hypothesis says “the new drug is better,” while the null hypothesis says “there’s no difference.” The entire test is then designed to see whether the data provides enough evidence to reject that null hypothesis. Think of it like a courtroom: the null hypothesis is “innocent until proven guilty.” The researcher’s job is to present enough evidence to overturn that presumption.

In formal notation, the null hypothesis is written as H₀ and the alternative as Hₐ. For a clinical trial comparing two blood transfusion methods, for instance, H₀ might state that mortality is the same in both groups, while Hₐ states that there is a difference. The two hypotheses must be mutually exclusive, meaning only one can be true.

Why Samples Stand In for Populations

A population is the entire group you’re interested in, like all adults with high blood pressure or every student in a country’s school system. Studying every single person in a population is almost never practical, so researchers study a smaller sample instead. A statistical hypothesis is always a statement about the population, but it gets tested using data from the sample.

The numbers that describe a population (the true average blood pressure of all adults, for example) are called parameters, and they’re usually unknown. The numbers calculated from your sample are statistics, and they serve as estimates. Hypothesis testing uses those sample statistics to make inferences about the unknown population parameters. The whole framework is built on probability: how likely is it that the sample data you collected would look this way if the null hypothesis were actually true?

What Makes a Hypothesis Testable

Not every claim qualifies as a statistical hypothesis. To be useful, it has to be testable with available methods and data, and it must be possible to design an ethical study around it. Some famous hypotheses meet this bar clearly: Edward Jenner’s eighteenth-century claim that cowpox infection protects against smallpox, or Barry Marshall and Robin Warren’s hypothesis that the bacterium Helicobacter pylori causes stomach ulcers (a discovery that earned them a Nobel Prize in 2005).

Other hypotheses are harder to test in practice. The idea that smoking causes lung cancer, for instance, can’t be tested by randomly assigning people to start smoking, because that would be unethical. Researchers had to rely on observational studies instead. A testable statistical hypothesis needs to be specific enough to produce measurable predictions and ethical enough to study directly or indirectly.

P-Values and Significance Levels

Once you’ve collected your data, the next step is calculating a p-value. A p-value tells you how likely it is that your results (or something more extreme) would occur if the null hypothesis were true. A small p-value means the data would be very surprising under the null hypothesis, which is evidence against it.

Before running the test, researchers set a significance level, called alpha. This is the threshold for deciding whether the p-value is small enough to reject the null hypothesis. If the p-value falls at or below alpha, the null hypothesis is rejected in favor of the alternative. If it’s above alpha, the null hypothesis is not rejected.

The most common alpha level is 0.05, a convention that dates back to the statistician Ronald Fisher. This means the researcher accepts a 5% chance of rejecting the null hypothesis when it’s actually true. But 0.05 isn’t a universal rule. Some fields use stricter cutoffs like 0.01, 0.005, or even 0.001 to reduce the chance of false positives. A group of 72 researchers published a widely discussed paper arguing that the standard should shift to 0.005. The best alpha level depends on the context, particularly on how common true effects are in the area being studied.

One-Tailed vs. Two-Tailed Tests

When setting up your alternative hypothesis, you choose between a one-tailed and a two-tailed test. A two-tailed test checks for a difference in either direction. If you’re comparing two medications, a two-tailed test asks “is there any difference?” without specifying which drug should come out ahead. A one-tailed test only looks in one direction: “is Drug A better than Drug B?”

One-tailed tests have more statistical power to detect an effect in the specified direction, because they aren’t spending any of their detection ability on the opposite direction. But they come with a strict rule: you must choose the direction before looking at the data. Switching to a one-tailed test after a two-tailed test fails to find significance is considered inappropriate, no matter how close the result was.

Type I and Type II Errors

Because hypothesis testing is based on probability, it can go wrong in two ways. A Type I error (false positive) happens when you reject the null hypothesis even though it’s actually true. You conclude there’s an effect when there isn’t one. A Type II error (false negative) is the opposite: you fail to reject the null hypothesis when there really is an effect, missing something real.

The courtroom analogy works well here. A Type I error is convicting an innocent person. A Type II error is letting a guilty person go free. The alpha level you set directly controls your Type I error rate: an alpha of 0.05 means you’re accepting up to a 5% chance of a false positive. Reducing alpha makes Type I errors less likely but typically makes Type II errors more likely, since it becomes harder to detect real effects. Researchers balance these two risks based on the consequences of each type of mistake.

Statistical Significance vs. Practical Significance

A statistically significant result doesn’t necessarily mean the result matters in the real world. Statistical significance tells you whether an effect is likely to be real, but it says nothing about how large or meaningful that effect is. A study with thousands of participants can find a statistically significant difference that’s so tiny it has no practical value.

This is where effect size comes in. Effect size measures the magnitude of a difference or relationship, independent of sample size. A blood pressure drug might lower readings by a statistically significant amount, but if that amount is only 0.5 points, it’s clinically meaningless. The p-value confirms the difference isn’t due to chance; the effect size tells you whether anyone should care. Both pieces of information are essential for interpreting results.

The American Statistical Association has emphasized this point directly, noting that a p-value near 0.05 “taken by itself offers only weak evidence against the null hypothesis.” Any effect, no matter how tiny, can produce a small p-value if the sample is large enough. And a large p-value doesn’t prove the null hypothesis is true; it simply means the data didn’t provide enough evidence to reject it. Context, effect size, and study design all matter as much as the p-value itself.