What Is a Multi-Armed Bandit and How Does It Work?

A multi-armed bandit is a decision-making framework where you have several options to choose from, each with an unknown payoff, and you need to figure out which one is best while also making the most of what you already know. The name comes from slot machines, which were nicknamed “one-armed bandits.” Imagine standing in front of a row of slot machines, each with different (hidden) odds of paying out. Do you keep pulling the lever on the machine that’s been winning, or try a new one that might pay even better?

That tension sits at the heart of nearly every recommendation engine, ad platform, and adaptive experiment you encounter online. It’s also one of the most studied problems in statistics and machine learning.

The Exploration-Exploitation Trade-Off

Every multi-armed bandit problem comes down to one core dilemma: exploration versus exploitation. Exploitation means sticking with the option that looks best so far. It’s the greedy, short-sighted move. Exploration means trying options you haven’t tested much, hoping to discover something better. It’s the information-gathering, long-sighted move.

Neither extreme works well on its own. Too much exploitation and you miss hidden gems you never bothered to try. Too much exploration and you waste time on duds when you could have been cashing in on a known winner. The entire field of bandit algorithms is essentially about finding the smartest way to balance these two goals.

How the Problem Works

In formal terms, you have a set of K options (called “arms,” keeping the slot machine metaphor). Each arm, when pulled, gives you a reward drawn from some probability distribution you can’t see. One arm might pay out on average 30% of the time, another 45%, another 10%, but you don’t know any of these numbers in advance. All you can do is pull arms, observe what happens, and update your strategy.

Your goal is to maximize the total reward you collect over a fixed number of rounds. The way researchers measure how well an algorithm performs is through a concept called regret. Regret is the gap between what you actually earned and what you would have earned if you’d known the best arm from the start and pulled it every single time. A perfect algorithm would have zero regret, but that’s impossible since you have to spend some rounds learning. The best algorithms keep regret growing as slowly as possible over time.

Common Strategies for Solving It

Several well-known algorithms tackle the bandit problem, each with a different philosophy.

Epsilon-Greedy

The simplest approach. Most of the time (say, 90% of rounds), you pull whichever arm has the highest average reward so far. The remaining 10% of the time, you pick an arm at random. The randomness ensures you keep exploring. It’s easy to implement but not particularly efficient, because it explores blindly rather than targeting the arms where more information would be most useful.

Upper Confidence Bound (UCB)

UCB takes a smarter approach: be optimistic about uncertainty. For each arm, the algorithm calculates two things. First, the average reward you’ve seen from that arm. Second, a “bonus” that’s larger when you haven’t tried that arm very often. It then picks whichever arm has the highest combined score. An arm you’ve barely tested gets a big uncertainty bonus, making it attractive to try. An arm you’ve pulled hundreds of times has a tiny bonus, so it only gets chosen if its actual average is high. This naturally balances exploration and exploitation without any randomness, and it comes with strong mathematical guarantees on how slowly regret grows.

Thompson Sampling

This approach uses probability in a different way. For each arm, the algorithm maintains a belief about what the true reward rate might be, represented as a probability distribution. Each round, it randomly samples from each arm’s distribution and picks the arm whose sample is highest. Arms with uncertain reward rates will occasionally produce high samples (leading to exploration), while arms with consistently high rewards will produce high samples most of the time (leading to exploitation). Thompson Sampling tends to perform extremely well in practice and is widely used in industry.

Contextual Bandits

The basic bandit problem assumes every round is identical: same arms, same conditions. Contextual bandits add a crucial layer by incorporating information about the current situation. Before choosing an arm, the algorithm observes some context, like who the user is, what they searched for, or what time of day it is.

A search engine displaying ads is a classic example. Each time a user types a query, the engine has to pick which ad to show. The probability of a click depends on both the ad and the query. Similar queries should produce similar click-through rates for the same ad. The algorithm learns these relationships over time, using context to make better choices from the start rather than treating every new situation as if it knows nothing. Google’s research on this problem formalized the idea that ads and queries exist in a measurable space where nearby points behave similarly, letting the algorithm generalize from past experience.

How Bandits Compare to A/B Testing

If you’ve heard of A/B testing, bandits are solving a related but fundamentally different problem. In a traditional A/B test, you split your traffic evenly between two or more variants for the entire duration of the experiment. Even if variant A is clearly outperforming variant B after the first day, you keep sending half your users to B until the test formally ends. This is fine for learning which variant is better, but it costs you conversions while you wait.

A bandit algorithm starts shifting traffic toward the better-performing variant as soon as evidence emerges. It still sends some traffic to the other options (exploration), but increasingly less over time. This makes bandits ideal for time-sensitive scenarios: a flash sale, a limited marketing campaign, or any situation where you can’t afford to keep showing a clearly worse option to half your audience just to reach statistical significance. The trade-off is that bandits are optimized for maximizing reward during the experiment, not for producing the cleanest statistical comparison afterward.

Bandits in Clinical Trials

One of the most consequential applications is in medicine. Traditional clinical trials use fixed randomization, assigning patients to treatments in predetermined proportions regardless of how the treatments are performing. Bandit-based adaptive trials can shift more patients toward the treatment that appears to be working better, while still collecting enough data to draw conclusions.

The results are meaningful. In simulation studies, bandit-based designs using an approach called the Gittins index assigned substantially more patients to superior treatments and improved the number of successfully treated patients by roughly 19% compared to fixed randomization. The catch is statistical power: purely reward-maximizing bandit designs tend to be worse at generating the level of certainty regulators require to approve a treatment. Hybrid approaches that blend bandit allocation with traditional randomization have shown promise, achieving over 80% statistical power while still directing more patients to better treatments during the trial itself.

Non-Stationary Environments

Standard bandit algorithms assume the reward probabilities for each arm stay constant. The best slot machine today is the best slot machine tomorrow. In many real-world applications, that assumption breaks down. User preferences shift, ad performance fluctuates with seasons, and the effectiveness of a recommendation changes as trends evolve.

These shifting conditions are known as non-stationary bandits, and they require algorithms that can detect and adapt to change. The changes can take several forms: sudden shifts where the best option flips overnight, incremental drift where rewards slowly evolve, gradual transitions where the system alternates between old and new patterns before settling, or recurring cycles where old patterns come back. Algorithms designed for these environments typically use techniques like discounting older observations so recent data carries more weight, or using sliding windows that only consider the last N rounds of data. The field is still actively developing better solutions for these harder versions of the problem.

Where You Encounter Bandits

Bandit algorithms run behind the scenes in more places than most people realize. News sites use them to choose which headlines to display. Streaming platforms use them to decide which thumbnail image to show for a movie. E-commerce sites use them to pick product recommendations, promotional offers, and page layouts. In all these cases, the platform is continuously experimenting on you in small ways, nudging traffic toward whatever produces the most clicks, purchases, or engagement.

Beyond the web, bandits show up in network routing (choosing the fastest path for data packets), dynamic pricing (adjusting prices to maximize revenue), and resource allocation in computing (deciding how to distribute processing power across tasks). Any scenario where you face repeated choices with uncertain payoffs and want to learn while doing is, at its core, a multi-armed bandit problem.