When to Use a Binomial Test: 4 Key Conditions

You should use a binomial test when you have two possible outcomes (success or failure, yes or no, heads or tails) and you want to know whether the proportion of successes you observed differs from what you’d expect. It’s the go-to test for binary data when you’re comparing your results against a specific, predefined probability. If you flipped a coin 100 times and got 62 heads, a binomial test tells you whether that result is unusual enough to doubt the coin is fair.

The Four Conditions That Call for a Binomial Test

A binomial test fits your situation when all four of these are true:

Two outcomes only. Each observation falls into one of exactly two categories: pass/fail, defective/not defective, clicked/didn’t click.
Fixed number of trials. You know in advance how many observations you have (or at least have a definite count).
Independent trials. The outcome of one observation doesn’t influence another. Drawing names from a hat without replacement would violate this; flipping a coin would not.
Constant probability. The underlying chance of “success” stays the same from trial to trial. If conditions change midway through data collection, the binomial model breaks down.

When all four hold, you’re working with a binomial distribution, and the binomial test is the exact way to evaluate whether your observed proportion differs from some expected value.

What the Test Actually Does

The binomial test compares what you observed to what you’d expect under a null hypothesis. That null hypothesis always takes the form: “the true proportion equals some specific value.” You choose that value based on your context. Maybe it’s 0.5 (a fair coin), 0.10 (a historical defect rate), or 0.03 (an industry benchmark).

You can test this in three ways. A two-sided test asks whether the true proportion simply differs from the expected value in either direction. A one-sided test asks whether the true proportion is specifically greater than or specifically less than the expected value. The direction you choose depends on your question. If you’re worried a new manufacturing process has increased defects, you’d test whether the proportion is greater than the historical rate. If you’re hoping a new drug outperforms a known response rate, you’d test whether the proportion exceeds that baseline.

The test calculates a p-value by summing the exact probabilities of getting your result (or something more extreme) under the null hypothesis. This is why it’s called an “exact” test: it doesn’t rely on approximations. It uses the binomial probability formula directly, plugging in your sample size, number of successes, and the hypothesized probability to compute how likely your data would be if the null hypothesis were true.

When to Choose It Over a Chi-Square Test

The chi-square goodness-of-fit test can also compare observed proportions to expected ones, but it’s an approximation. When you have only two categories, the binomial test gives you an exact answer. The chi-square test works well enough with large samples, but it can be unreliable when expected counts in any category drop below about 5. With two categories and a small sample, the exact binomial test is the better choice.

If you have three or more categories (say, red/blue/green), you’ve moved beyond a binomial situation and the chi-square test or a multinomial test is appropriate. The binomial test is specifically built for two-category problems.

Small Samples vs. Large Samples

For small samples, the exact binomial test is your only reliable option. As your sample grows, the binomial distribution starts to resemble a normal (bell-curve) distribution, and you can use a faster normal approximation instead. The standard rule of thumb: if both np and n(1-p) are at least 5, where n is your sample size and p is the hypothesized proportion, the normal approximation is reasonable.

So if you’re testing whether a coin is fair (p = 0.5) and you have 20 flips, np = 10 and n(1-p) = 10, both above 5, so the approximation works. But if you’re testing whether a rare defect rate exceeds 0.02 and you have 100 items, np = 2, which falls below the threshold. In that case, stick with the exact test.

With modern software this distinction matters less than it used to, because computers can calculate exact binomial probabilities for very large samples in milliseconds. But understanding the threshold helps you interpret older literature and know when a reported z-test for proportions is essentially doing a binomial test with a normal approximation.

Common Real-World Applications

Quality Control and Manufacturing

Factories routinely use binomial tests to check whether a defect rate stays within acceptable limits. The National Institute of Standards and Technology outlines this as a standard procedure: you inspect a batch, count the defective items, and test whether the observed defect rate exceeds the prescribed limit. In one NIST example, a fabrication facility introduced a new wafer processing method, inspected 200 wafers, and found 26 defective (13%). They tested this against the historical defect rate of 10% and got a test statistic of 1.414, which fell below the critical threshold of 1.645. The conclusion: not enough evidence to say the new process degraded quality.

Clinical Trials

Drug trials frequently use binomial-based designs to test whether a treatment’s response rate exceeds a minimum threshold. In early-phase oncology trials, for instance, researchers might test whether a drug’s response rate exceeds 5%. The trial’s design is built on the binomial framework: each patient either responds or doesn’t, and the test determines whether the observed response rate is high enough to justify further study.

Marketing and User Research

If you run an A/B test variant and want to know whether the click-through rate differs from your baseline of 3%, that’s a binomial test. Each visitor either clicks or doesn’t. You have a known baseline. You count successes and compare.

Behavioral Research

A related application is the sign test, which is actually a special case of the binomial test. In a before-and-after study, you ignore the size of each change and only record its direction: did each participant improve or not? Then you test whether the proportion who improved differs from 50% (what you’d expect by chance). If 15 children watched a film about sugar and 9 ate fewer candies afterward while only 1 ate more, the sign test uses the binomial distribution to evaluate whether that 9-to-1 split is meaningful.

Running the Test in Software

In R, the function is binom.test(). You provide the number of successes, the number of trials, and the hypothesized probability. The output gives you a p-value, a confidence interval for the true proportion, and the observed proportion. For example, testing 58 successes out of 100 against a hypothesized probability of 0.5 returns a p-value of 0.133, meaning there’s not enough evidence to conclude the true probability differs from 50%.

In Python, the SciPy library offers scipy.stats.binomtest(), which takes the same inputs and returns the same key outputs. Both tools run the exact test by default.

The numbers to focus on in the output are straightforward. The p-value tells you whether to reject the null hypothesis (typically at the 0.05 threshold). The confidence interval gives you a plausible range for the true proportion. In the R example above, the 95% confidence interval ran from 0.477 to 0.678, meaning the true probability of success likely falls somewhere in that range.

Quick Decision Guide

Use a binomial test when:

Your data has exactly two possible outcomes per observation
You’re comparing an observed proportion to a single known or hypothesized value
Your observations are independent of each other
You want an exact result rather than an approximation (especially with small samples or extreme proportions)

Use something else when you’re comparing proportions between two groups (that calls for a two-proportion z-test or chi-square test of independence), when you have more than two outcome categories (chi-square goodness of fit), or when your data involves measurements on a continuous scale rather than binary counts.