A multinomial distribution describes the probability of seeing a particular combination of outcomes when you repeat an experiment multiple times and each trial can land in one of several categories. Think of rolling a die 20 times and asking: what’s the probability of getting exactly five 1s, three 2s, four 3s, two 4s, three 5s, and three 6s? That question is a multinomial problem. It’s the natural extension of the binomial distribution (which handles only two outcomes, like heads or tails) to situations with three or more possible outcomes per trial.
How It Relates to the Binomial Distribution
If you’ve encountered the binomial distribution, you already understand the core logic. A binomial distribution counts successes and failures across repeated trials: flip a coin 10 times, how many heads? There are only two categories. The multinomial distribution removes that two-category restriction. When each trial can produce one of K outcomes (where K is greater than 2), the multinomial distribution gives you the probability of any specific combination of counts across all K categories. When K equals 2, the multinomial distribution collapses back into the familiar binomial.
Conditions That Must Be Met
Four conditions define a multinomial setting:
- Fixed number of trials. You perform the experiment n times, and n is decided in advance.
- Independent trials. The outcome of one trial doesn’t influence any other.
- Mutually exclusive categories. Each trial lands in exactly one of K categories. A die roll can’t be both a 3 and a 5.
- Constant probabilities. The probability of each category stays the same from trial to trial. If the chance of rolling a 3 is 1/6 on the first roll, it’s 1/6 on every roll.
There are also two mathematical constraints baked in. All the category probabilities must add up to exactly 1, because every trial has to land somewhere. And all the category counts must add up to n, because every trial produces exactly one outcome.
The Formula
The probability of observing a specific set of counts (x₁ outcomes in category 1, x₂ in category 2, and so on up to xₖ in category k) across n trials is:
P(X₁ = x₁, X₂ = x₂, …, Xₖ = xₖ) = [n! / (x₁! × x₂! × … × xₖ!)] × p₁ˣ¹ × p₂ˣ² × … × pₖˣᵏ
The first part, the fraction with all the factorials, counts the number of different orderings that produce the same combination of counts. The second part, the product of probabilities raised to powers, gives the probability of any single ordering. Multiplying them together gives you the total probability of that combination regardless of the order it occurred in.
A Worked Example
Suppose two chess players have played enough games that you know the probabilities: Player A wins 40% of the time, Player B wins 35%, and 25% of games end in a draw. They’re about to play 12 games. What’s the probability that Player A wins exactly 7, Player B wins exactly 2, and the remaining 3 are draws?
Start with one specific sequence that fits this description, say AAABDADAAABD. The probability of that exact sequence is (0.40)⁷ × (0.35)² × (0.25)³, because you just multiply the probability of each game’s result in order. But that’s only one arrangement. Many different orderings of 7 A’s, 2 B’s, and 3 D’s are possible. The number of such orderings is 12! / (7! × 2! × 3!) = 7,920.
So the final probability is:
P = [12! / (7! × 2! × 3!)] × (0.40)⁷ × (0.35)² × (0.25)³
That comes out to roughly 0.0248, or about a 2.5% chance. Not likely for any single combination, which makes sense: with 12 games and three possible outcomes each, there are many possible combinations competing for probability.
Mean, Variance, and Covariance
Each individual category in a multinomial distribution behaves, on its own, like a binomial distribution. If you only care about how often category i shows up, you can treat every trial as “category i” or “not category i.” This gives clean formulas for the basic statistics.
The expected count for category i is simply n × pᵢ. If you roll a fair die 60 times, you’d expect each face to appear about 10 times. The variance for that category is n × pᵢ × (1 − pᵢ). In the die example, the variance for any one face is 60 × (1/6) × (5/6) ≈ 8.33.
What makes the multinomial distribution distinctive is the relationship between categories. Because every trial must land somewhere, seeing more of one outcome necessarily means seeing less of others. This creates a negative covariance between any two categories: Cov(Xᵢ, Xⱼ) = −n × pᵢ × pⱼ. If you roll more 3s than expected, you’ll tend to roll fewer of something else. The categories aren’t independent of each other, even though the individual trials are.
Applications in Genetics
Population genetics relies heavily on the multinomial distribution. When researchers sample individuals from a population and observe their genotypes at a particular gene, the genotype counts follow a multinomial distribution. For a gene with two variants (call them A and B), there are three possible genotypes: AA, AB, and BB. A sample of individuals produces counts in each category.
A classic use is testing Hardy-Weinberg equilibrium, which predicts genotype frequencies under random mating. If the frequency of allele A in the population is f, then under random mating you’d expect the genotypes AA, AB, and BB to appear with probabilities f², 2f(1−f), and (1−f)². Researchers model the observed genotype counts as a multinomial distribution and test whether the data fit these predicted probabilities. The same approach extends to genes with more variants: ABO blood groups, for instance, involve three alleles (A, B, O) producing four observable blood types, each with a predicted frequency under Hardy-Weinberg assumptions.
Applications in Machine Learning
One of the most common uses of the multinomial distribution in practice is text classification. The multinomial Naive Bayes classifier treats a document as a collection of word counts, ignoring word order entirely (a “bag of words” model). Each word position in a document is one trial, and each word in the vocabulary is a possible outcome. The distribution of word frequencies across a document then follows a multinomial distribution.
To classify a new document (as spam vs. not spam, or by topic), the algorithm estimates the probability of seeing that particular combination of words given each possible class. It does this using word frequencies learned from training documents. The class that makes the observed word counts most probable wins. This approach is fast, scales well to large vocabularies, and remains surprisingly effective for tasks like spam filtering, sentiment analysis, and topic categorization.
Categorical vs. Multinomial Distribution
You’ll sometimes see the term “categorical distribution,” and the distinction is simple. A categorical distribution describes a single trial with K possible outcomes. Roll one die, and the result follows a categorical distribution. A multinomial distribution describes the counts after repeating that same trial n times. Roll a die 20 times and tally the results, and those tallies follow a multinomial distribution. The categorical distribution is the single-trial special case of the multinomial, just as a single coin flip (Bernoulli) is the single-trial special case of the binomial.

