What Is Thompson Sampling and How Does It Work?

Thompson sampling is a decision-making algorithm that figures out which option is best by balancing two competing goals: trying new things (exploration) and sticking with what already seems to work (exploitation). It does this using probability. Instead of picking the option with the highest known average or randomly experimenting, it maintains a probability model for each option and updates that model every time it gets new information. At each step, it draws a random sample from each model and picks whichever option produced the highest sample. This simple mechanism turns out to be surprisingly effective across a wide range of problems.

The Core Idea: Learning by Sampling

Imagine you’re choosing between three restaurants, and you’ve only eaten at each one a couple of times. You have some sense of how good each one is, but you’re not sure. Thompson sampling captures that uncertainty mathematically. For each restaurant, you maintain a probability distribution that represents your belief about how good it is. Early on, when you have little data, those distributions are wide and spread out, reflecting high uncertainty. As you eat at a restaurant more often and collect more experiences, the distribution tightens around a more confident estimate.

Here’s where the sampling comes in. Each time you need to pick a restaurant, you draw one random number from each restaurant’s distribution and go to whichever one produced the highest number. A restaurant you’re uncertain about has a wide distribution, so it occasionally generates a very high number, which means you’ll try it again. A restaurant you’ve tried many times and found mediocre has a tight distribution centered on a low value, so it rarely wins the draw. The algorithm naturally explores uncertain options and exploits known good ones, without any explicit rule telling it when to switch strategies.

How the Math Works

Thompson sampling is rooted in Bayesian statistics. You start with a “prior” distribution for each option, representing what you believe before seeing any data. After each observation (a reward, a click, a patient outcome), you combine that observation with the prior to produce a “posterior” distribution. The posterior becomes your updated belief. Next round, you sample from the posterior instead of the prior, so your decisions reflect everything you’ve learned so far.

The most common setup uses the Beta distribution, which is convenient for binary outcomes like “clicked” or “didn’t click.” Each option starts with a Beta(1,1) distribution, which is flat and represents total ignorance. Every success adds 1 to the first parameter, and every failure adds 1 to the second parameter. After 7 successes and 3 failures, for example, the distribution becomes Beta(8,4), which peaks around 0.7. Sampling from this distribution will usually give a number near 0.7, but with some spread that reflects remaining uncertainty.

For continuous outcomes like revenue or ratings, the algorithm uses other distributions (often Gaussian) and the update rules become more involved, but the principle stays the same: maintain a distribution, observe an outcome, update the distribution, sample from it next time.

Why It Outperforms Simpler Strategies

The most obvious alternative strategies have clear weaknesses. A purely greedy algorithm always picks the option with the best average so far, which means it can lock onto a mediocre option early and never discover a better one. An epsilon-greedy algorithm explores randomly some fixed percentage of the time, but it wastes exploration budget on options that are clearly bad and doesn’t explore more when uncertainty is high.

Thompson sampling avoids both traps because the amount of exploration is proportional to uncertainty. When you know very little, you explore a lot. When you’re confident about which option is best, you almost always exploit it. This happens automatically through the shape of the distributions.

Mathematically, Thompson sampling achieves what’s called asymptotic optimality. This means the total “regret” (how much reward you miss out on compared to always picking the best option) grows only logarithmically over time. In practical terms, the penalty for not knowing the best option up front shrinks rapidly as the algorithm learns. Research published at NeurIPS confirmed that Thompson sampling achieves this optimal logarithmic regret bound for a broad family of reward distributions.

Where Thompson Sampling Gets Used

The algorithm’s most visible application is A/B testing and website optimization. Traditional A/B tests split traffic evenly between variants for a fixed period, then pick the winner. Thompson sampling instead shifts traffic toward the better-performing variant as evidence accumulates. This means fewer users see the inferior option, which directly translates to less lost revenue or fewer bad experiences during the test.

Online advertising uses Thompson sampling to decide which ad to show a given user. Each ad has an estimated click-through rate with some uncertainty, and the algorithm balances showing proven ads against testing newer ones that might perform better. Recommendation systems use a similar approach to surface content.

In healthcare, Thompson sampling powers adaptive clinical trial designs, where patients are gradually steered toward the treatment arm that appears more effective. This is ethically appealing because fewer participants receive the inferior treatment. However, multiple simulation studies have found that basic Thompson sampling can underperform in typical clinical trial sizes (hundreds to thousands of patients rather than millions of web visitors). Researchers have developed accelerated versions of the algorithm specifically to improve its short-run performance in these smaller-sample settings, including situations where patients can only be enrolled in batches rather than one at a time.

Thompson Sampling vs. UCB

The main competitor to Thompson sampling is a family of algorithms called Upper Confidence Bound (UCB). UCB takes a deterministic approach: it calculates an optimistic estimate for each option (the average reward plus a bonus for uncertainty) and always picks the option with the highest estimate. Both approaches achieve the same theoretical regret bounds, so the choice between them comes down to practical considerations.

Thompson sampling tends to be easier to implement when reward models are complex, because you just need to be able to sample from a posterior distribution. UCB requires computing confidence bounds, which can be harder to derive for non-standard problems. Thompson sampling also tends to perform better empirically in many real-world benchmarks, partly because its randomized exploration is less predictable and can escape local patterns that trap deterministic methods.

On the other hand, UCB has stronger finite-time guarantees in some settings and is easier to analyze theoretically because it’s deterministic. In practice, many engineering teams default to Thompson sampling for its simplicity and strong empirical results.

Practical Considerations

Thompson sampling works best when you can update your distributions quickly after each observation. In systems with delayed feedback (a customer buys a product three days after seeing an ad), the algorithm may over-explore because it hasn’t yet incorporated recent outcomes. Batched updates and careful handling of delayed rewards are common engineering challenges.

Choosing the right prior matters most when data is scarce. A prior that’s too confident can slow down learning, while a completely uninformed prior works fine in most cases because the data will quickly dominate. For binary outcomes, Beta(1,1) is the standard starting point. For continuous outcomes, a prior centered on zero with moderate variance is typical.

The algorithm also scales naturally to contextual problems, where the best option depends on who the user is. In contextual Thompson sampling, each option’s reward model takes user features as input, and the posterior is over model parameters rather than a single reward rate. This makes it suitable for personalization tasks where a one-size-fits-all recommendation isn’t enough.