What Is the Rule of Thumb in Statistics?

A “rule of thumb” in statistics is a simplified guideline that helps you make quick decisions about data without running complex calculations. There isn’t just one rule of thumb. Statistics has dozens of them, covering everything from how data spreads out in a bell curve to how large your sample needs to be. These shortcuts are widely taught and used, though most come with the caveat that they’re approximations, not iron laws.

The Empirical Rule (68-95-99.7)

The most commonly referenced rule of thumb in statistics is the empirical rule, sometimes called the 68-95-99.7 rule. It describes how data spreads out in a normal distribution (the familiar bell curve). About 68% of data falls within one standard deviation of the mean, about 95% falls within two standard deviations, and about 99.7% falls within three standard deviations.

This gives you a fast way to judge whether a particular data point is typical or unusual. If someone’s test score is two standard deviations above the mean, you immediately know they’re in roughly the top 2.5% of scores. If a measurement lands more than three standard deviations away from the mean, it’s extremely rare, occurring less than 0.3% of the time. The rule only applies when data follows an approximately normal (bell-shaped) distribution, but since many natural measurements do, it comes up constantly in practice.

The n = 30 Rule for Sample Size

A classic rule of thumb says you need a sample size of at least 30 for the central limit theorem to kick in. The reasoning: at 30 degrees of freedom, the t-distribution (which accounts for the extra uncertainty in small samples) closely approximates the normal distribution. Once your sample hits that threshold, you can treat your sampling distribution as roughly normal regardless of how the underlying population is shaped.

This is genuinely useful as a starting point, but it’s a simplification. Highly skewed populations may need larger samples before the central limit theorem provides a reliable approximation, while symmetric populations may work fine with fewer. Still, 30 remains the go-to benchmark in introductory statistics courses and quick power estimates.

The 1.5 × IQR Rule for Outliers

To flag potential outliers, the standard approach is to calculate the interquartile range (the span between the 25th and 75th percentiles) and multiply it by 1.5. Any data point more than 1.5 IQR below the 25th percentile or more than 1.5 IQR above the 75th percentile is considered an outlier.

For example, if your 25th percentile is 80 and your 75th percentile is 90, the IQR is 10. Multiply by 1.5 to get 15. Your lower fence is 65 (80 minus 15) and your upper fence is 105 (90 plus 15). Anything below 65 or above 105 gets flagged. This method, developed by the statistician John Tukey, is the basis for the “whiskers” in box-and-whisker plots. The 1.5 multiplier is somewhat arbitrary, but it strikes a practical balance between catching genuinely unusual values and not flagging too many ordinary ones.

The p < 0.05 Significance Threshold

Perhaps the most debated rule of thumb in all of statistics is the convention that results with a p-value below 0.05 are “statistically significant.” This threshold traces back to Ronald Fisher in the early 1900s, who noted that a p-value of 0.05 corresponds to roughly two standard deviations from the mean and called it a convenient limit for judging significance. The number stuck, becoming the default cutoff across nearly every scientific field for decades.

The threshold is conventional, not magical. A p-value of 0.049 isn’t meaningfully different from one of 0.051, yet only the first one traditionally “passes.” The American Statistical Association took the unusual step of releasing a formal statement in 2016, warning against treating 0.05 as a bright line. Among its key points: a p-value doesn’t measure the size or importance of an effect, scientific conclusions shouldn’t rest on whether a p-value passes a single threshold, and a non-significant result doesn’t mean no effect exists. Some researchers have proposed lowering the default to 0.005, though this remains controversial. The broader push is toward reporting effect sizes and confidence intervals alongside p-values rather than relying on a simple pass/fail cutoff.

Cohen’s d: Small, Medium, and Large Effects

When you want to describe how big a difference actually is (not just whether it’s statistically significant), Jacob Cohen’s benchmarks are the standard rule of thumb. A Cohen’s d of 0.2 is considered a small effect, 0.5 is medium, and 0.8 is large. Cohen originally proposed the medium effect as one “observable to the naked eye,” with small and large set equidistant from that midpoint.

These benchmarks give you a common language for comparing results across studies, even when the original measurements used different scales. A treatment that produces a Cohen’s d of 0.3 is doing something, but modestly. One that produces a d of 0.9 is making a substantial, easily noticeable difference. Like all rules of thumb, context matters. In some fields, a “small” effect of 0.2 might be clinically meaningful if it applies to millions of people, while in others a “large” effect of 0.8 might not justify the cost of an intervention.

The 80% Power Standard

Statistical power is the probability that your study will detect a real effect if one exists. The standard benchmark is 80%, meaning you accept a 20% chance of missing a true effect (a Type II error). This target reflects a deliberate tradeoff: lowering your risk of false negatives requires either larger sample sizes or accepting a higher risk of false positives, and 80% is the conventional balance point when the significance level is set at 0.05. Some fields, particularly clinical trials with high stakes, aim for 90% power instead.

The practical takeaway is that if you’re designing a study, you should calculate how many participants or observations you need to reach at least 80% power for the effect size you expect. Underpowered studies are one of the most common problems in published research, producing results that are unreliable or fail to replicate.

The Expected Frequency Rule for Chi-Square Tests

When using a chi-square test to check whether two categorical variables are related (for example, whether treatment group and recovery rate are connected), a long-standing rule says every cell in your table should have an expected frequency of at least 5. If any cell drops below that, the chi-square approximation becomes unreliable.

This guideline is generally attributed to the statistician William Cochran, who acknowledged that the number 5 was chosen somewhat arbitrarily. Subsequent studies have largely confirmed that the test performs well when expected frequencies are at least 5, though alternatives like Fisher’s exact test exist for situations where your counts are too low.

Correlation Strength Benchmarks

Interpreting a correlation coefficient also relies on rules of thumb, though these vary by field. A commonly used scale in psychology classifies correlations of 0.1 to 0.3 as weak, 0.4 to 0.6 as moderate, and 0.7 to 1.0 as strong. Medical research tends to use stricter standards, where a correlation of 0.5 might only qualify as “fair.”

These differences highlight an important point: a correlation of 0.6 might be impressive in one discipline and unremarkable in another. The raw number also doesn’t tell you about causation or practical significance. A correlation of 0.31 between blood pressure and age can be highly statistically significant (with a very low p-value) while still being a weak relationship in practical terms. The strength of the correlation and its statistical significance are separate questions.

The Rule of Three

A lesser-known but elegant rule of thumb applies when you observe zero events in a sample. If you test 300 products and none of them fail, the rule of three says you can be 95% confident that the true failure rate is no higher than 3 divided by your sample size, so 3/300 = 1%. The formula is simply 3/n, where n is the number of observations. It provides a quick upper bound without requiring any complex calculation, and it’s particularly useful in quality control, safety testing, and medical screening where zero observed failures doesn’t guarantee a zero failure rate.