A likelihood function measures how well a statistical model, with a specific set of parameter values, explains the data you actually observed. It uses the same mathematical formula as a probability distribution, but flips the perspective: instead of asking “given these parameters, what data might I see?” it asks “given this data, which parameter values are most plausible?” This simple reversal of perspective is one of the most powerful ideas in statistics, introduced by R.A. Fisher in 1912 and central to nearly every modern method for fitting models to data.
How Likelihood Differs From Probability
The confusion between likelihood and probability is almost universal, so it’s worth being precise. A probability function takes a fixed set of parameters (say, the chance a coin lands heads is 0.5) and asks how probable different outcomes are. You might ask: if a fair coin is flipped 10 times, what’s the probability of getting exactly 7 heads? Here, the coin’s fairness is assumed, and you’re exploring possible outcomes.
A likelihood function does the opposite. You already know the outcome: you flipped a coin 10 times and got 7 heads. Now you ask, how plausible is it that the coin’s true probability of heads is 0.3? Or 0.5? Or 0.7? You hold the data fixed and vary the parameter. The likelihood function gives you a value for each possible parameter, letting you compare which ones make your observed data more or less plausible.
One critical distinction: a probability distribution always sums or integrates to 1. A likelihood function does not. It is not a probability distribution over parameters. You can compare likelihood values to each other (this parameter is twice as likely as that one), but the raw number itself doesn’t represent a probability.
The Core Formula
Formally, the likelihood function is written as L(θ) = f(x; θ), where f is the probability function (a density for continuous data, a mass function for discrete data), x is the observed data, and θ is the parameter. The notation emphasizes that you’re treating the same formula as a function of θ rather than a function of x.
When you have multiple independent observations, the likelihood is the product of the individual probability values for each data point. For a sample of n observations, that looks like multiplying together n terms, one for each observation. This product structure is what makes the likelihood both powerful and, computationally, a bit tricky to work with.
A Coin Flip Example
Suppose you flip a coin 10 times and get 7 heads. Each flip follows a simple model where the probability of heads is some unknown value p. The likelihood function for the full experiment simplifies to p raised to the power of 7 (for the heads) multiplied by (1 – p) raised to the power of 3 (for the tails).
If you plug in p = 0.5, the likelihood equals about 0.0117. If you plug in p = 0.7, it equals about 0.0267. The value p = 0.7 produces a higher likelihood, meaning it does a better job explaining the data. If you evaluate the likelihood at every possible value of p between 0 and 1 and plot the results, you get a curve that peaks at exactly p = 0.7. That peak is the maximum likelihood estimate, the single parameter value that makes your observed data most plausible.
Why Statisticians Use the Log-Likelihood
In practice, you’ll almost always see the log-likelihood rather than the raw likelihood. There are two practical reasons for this. First, multiplying together hundreds or thousands of small probabilities (one per data point) produces astronomically tiny numbers that computers struggle to represent accurately. Taking the natural logarithm converts that product into a sum, keeping the numbers in a manageable range with higher numerical precision.
Second, sums are far easier to work with mathematically than products. When you need to find the peak of the likelihood function using calculus, differentiating a sum is straightforward. The logarithm is a strictly increasing function, so the parameter value that maximizes the log-likelihood is exactly the same value that maximizes the likelihood itself. Nothing is lost in the translation.
For the coin flip example, the log-likelihood becomes 7 × ln(p) + 3 × ln(1 – p). Taking the derivative with respect to p, setting it to zero, and solving gives you p = 7/10, confirming the intuitive answer.
Maximum Likelihood Estimation
The most common use of the likelihood function is maximum likelihood estimation (MLE). The idea is exactly what it sounds like: find the parameter values that make the observed data as likely as possible. You write down the likelihood (or log-likelihood), take its derivative with respect to each parameter, set those derivatives to zero, and solve.
For a dataset drawn from a normal distribution with unknown mean and variance, the log-likelihood involves two parameters. Maximizing it yields the sample mean as the estimate of the true mean, and the average of the squared deviations as the estimate of the variance. These are the formulas most people learn in introductory statistics, but they emerge naturally from the likelihood framework rather than being arbitrary choices.
MLE is the engine behind many of the models used in modern data analysis. Logistic regression, which predicts binary outcomes like “yes or no” or “click or don’t click,” is fitted by maximizing a likelihood function. Software packages for fitting generalized linear models, including logistic, Poisson, and Cox regression models, all rely on penalized maximum likelihood under the hood. Neural networks trained with cross-entropy loss are also, mathematically, performing a form of maximum likelihood estimation.
The Likelihood in Bayesian Statistics
In Bayesian inference, the likelihood function plays a different but equally central role. Bayes’ theorem combines three ingredients: a prior distribution (what you believed about the parameters before seeing data), the likelihood (how well each parameter value explains the data), and the resulting posterior distribution (your updated beliefs after seeing the data). The likelihood is the bridge between prior and posterior.
If you start with a completely flat prior, one that treats every parameter value as equally plausible before seeing any data, the posterior distribution ends up being proportional to the likelihood function itself. In that special case, Bayesian and maximum likelihood approaches give the same answer. As you introduce more informative priors, the posterior shifts away from the pure likelihood, blending prior knowledge with the evidence in the data.
Sufficient Statistics and Simplification
One useful property of the likelihood function is that you can drop any multiplicative terms that don’t involve the parameter. If part of the formula depends only on the data and not on θ, it’s just a constant that shifts the likelihood up or down uniformly. It doesn’t change which parameter value produces the peak, so you can ignore it.
This connects to the concept of sufficiency. A statistic (like the sample mean, or the total number of heads) is “sufficient” for a parameter if it captures all the information the data contains about that parameter. The Fisher-Neyman factorization criterion gives a precise test: if the likelihood can be split into one factor that depends on θ only through the statistic and another factor that doesn’t involve θ at all, then that statistic is sufficient. In the coin example, the total number of heads (7) is sufficient for p. You don’t need to know the exact sequence of heads and tails to compute the likelihood. This idea allows you to compress large datasets down to a few key summary numbers without losing any information relevant to estimating the parameter.

