When Conducting a Large Sample Test of H0 Matters

When conducting a large sample test of H₀ (the null hypothesis), you use the standard normal (Z) distribution rather than the t-distribution to determine whether your data provides enough evidence to reject H₀. The general threshold for “large” is a sample size greater than 30, at which point the sampling distribution of the mean closely approximates a normal curve regardless of the population’s shape. This simplifies the math and opens up a straightforward framework for testing claims about population means and proportions.

Why Sample Size of 30 Matters

The Central Limit Theorem is the engine behind large sample testing. It states that as your sample size grows, the distribution of sample means becomes approximately normal, even if the underlying population is skewed or irregular. At a sample size of 30, the t-distribution (used for smaller samples) is essentially identical to the standard normal distribution. That means you can use Z-scores and normal probability tables to calculate your p-values without worrying much about the shape of the original population.

This matters because many real-world datasets aren’t perfectly bell-shaped. Income data skews right, reaction times skew left, and medical measurements can follow all sorts of patterns. The Central Limit Theorem lets you sidestep those complications once your sample is large enough.

How the Test Works

The basic structure of a large sample hypothesis test follows a consistent pattern. You start with a null hypothesis (H₀), which is a default claim about a population parameter, such as “the average blood pressure in this group is 120 mmHg” or “the proportion of defective parts is 3%.” You then collect data and calculate a test statistic that measures how far your sample result falls from what H₀ predicts.

For a large sample test of a population mean, the test statistic is a Z-score: the difference between your sample mean and the hypothesized mean, divided by the standard error. The standard error shrinks as your sample size grows, which means even small departures from H₀ become easier to detect with more data.

For proportions, the logic is the same, but there’s an additional requirement: both np₀ and n(1 – p₀) must be 10 or more, where p₀ is the proportion stated in H₀. This ensures there are enough expected successes and failures for the normal approximation to hold. If you’re testing whether a coin is fair (p₀ = 0.5) with 100 flips, both 50 and 50 easily clear that bar. If you’re testing a rare event where p₀ = 0.01, you’d need at least 1,000 observations.

Setting the Significance Level

Before running the test, you choose a significance level (alpha), which is the probability of rejecting H₀ when it’s actually true. The traditional cutoff is 0.05, meaning you accept a 5% chance of a false positive. Your p-value then tells you the probability of seeing results as extreme as yours if H₀ were true. If the p-value falls below alpha, you reject H₀.

In fields that run many simultaneous tests, like genomics, researchers have argued for stricter thresholds. A 72-author paper proposed moving the default to 0.005 instead of 0.05, because the rate of false positives drops substantially at every combination of sample size and effect size. This is especially relevant in large sample contexts where you have the statistical power to detect tiny differences, making false discoveries a bigger practical concern.

The Power Advantage of Large Samples

One of the main reasons researchers pursue large samples is statistical power: the probability of correctly rejecting H₀ when it’s actually false. Power equals 1 minus the probability of a Type II error (failing to detect a real effect). Larger samples reduce the likelihood that your results will differ substantially from the true population values, which means you’re less likely to miss a genuine effect.

In practical terms, a study with 50 participants might lack the power to detect a moderate treatment benefit, producing an inconclusive result. The same study with 500 participants could detect the same benefit with high confidence. This is why clinical trials, national surveys, and policy research typically aim for large samples whenever feasible.

The Large Sample Paradox

Here’s where large sample testing gets tricky. A p-value depends on two things: the size of the observed difference and the sample size. With enough data, even trivially small differences become statistically significant. Consider a study comparing five-year survival rates of 85% and 90% between two treatment groups. With 100 patients per group, the p-value is 0.39, nowhere near significant. With 1,000 patients per group and the exact same survival rates, the p-value drops to 0.0009, which looks like strong evidence against H₀. The actual difference between groups hasn’t changed at all.

This is why the American Statistical Association has cautioned that a p-value does not measure the size of an effect or the importance of a result. Scientific conclusions should not be based only on whether a p-value crosses a specific threshold. As their executive director put it, the p-value “was never intended to be a substitute for scientific reasoning.”

Why Effect Size Matters More Than P-Values

When your sample is large, statistical significance becomes almost guaranteed for any nonzero effect. The more useful question shifts from “is there a difference?” to “how big is the difference?” This is where effect size comes in.

One common measure is Cohen’s d, which expresses the difference between two group means in standard deviation units. A d of 0.2 is considered small, 0.5 is medium, and 0.8 or above is large. Unlike p-values, effect size is independent of sample size. A study of 10,000 people might find a statistically significant result with a p-value below 0.001, but if the effect size is 0.1, the practical difference between groups is negligible and probably doesn’t justify changing a treatment protocol or business strategy.

The best practice when conducting large sample tests is to report both the p-value and an effect size measure. The p-value tells you whether the result is likely due to chance. The effect size tells you whether it actually matters.

Confidence Intervals in Large Samples

Confidence intervals offer another way to interpret large sample results that many statisticians prefer over p-values alone. The formula follows a simple structure: take your point estimate (the sample mean or proportion), then add and subtract the critical Z-value multiplied by the standard error. For a 95% confidence interval, the critical value is 1.96.

Large samples produce narrower confidence intervals because the standard error shrinks as n increases. A narrow interval is more informative: it pins down the likely range of the true population value with greater precision. But this also connects back to the paradox. A very narrow confidence interval might exclude H₀ by a hair, giving you a significant result for a difference so small it has no real-world meaning. Reading the width of the interval, not just whether it contains the null value, helps you judge practical significance alongside statistical significance.