What Is Bayesian Optimization and How Does It Work?

Bayesian optimization is a strategy for finding the best result when each experiment or evaluation is expensive, slow, or both. Instead of testing thousands of possibilities at random, it uses probability to make smart guesses about where the optimal answer likely sits, then checks only the most promising candidates. The technique is built on two core components: a surrogate model that approximates the thing you’re trying to optimize, and an acquisition function that decides which point to test next.

The Problem It Solves

Many real-world optimization problems are costly to evaluate. Training a machine learning model with a particular set of settings might take hours or days. Running a physical experiment in a lab might cost thousands of dollars. Simulating airflow over a wing design might tie up a supercomputer. In these situations, you can’t afford to try every combination. You need a method that finds a great answer in as few tries as possible.

Traditional approaches like grid search (testing evenly spaced options) or random search (testing options at random) don’t learn from their own results. Bayesian optimization does. After each evaluation, it updates its understanding of the landscape and becomes increasingly precise about where to look next. This makes it especially powerful when you have a limited budget of evaluations, typically somewhere between a handful and a few hundred.

How the Surrogate Model Works

The surrogate model is a cheap-to-evaluate stand-in for the expensive real function. Think of it as a sketch artist working from witness descriptions: with each new piece of evidence, the sketch gets more accurate. The most common surrogate is a Gaussian process, which doesn’t just predict a single value at each point. It predicts a range of likely values along with a confidence level. In areas where you’ve already tested, the predictions are tight and confident. In areas you haven’t explored, the predictions are wide and uncertain.

A Gaussian process starts with a broad prior assumption about how the function might behave (smooth, wiggly, periodic, etc.), then updates that assumption every time it sees a real data point. The result is a full probability distribution over possible outcomes at every location in the search space. This dual output of “best guess” plus “how sure am I” is what makes the whole system work, because the acquisition function needs both pieces of information to decide where to sample next.

Choosing Where to Sample Next

The acquisition function is the decision-making layer. It takes the surrogate model’s predictions and uncertainties and converts them into a single score for every candidate point, answering the question: “If I could only run one more experiment, where should I run it?”

The most widely used acquisition function is called expected improvement. It estimates how much better a candidate point is likely to be compared to the best result you’ve found so far. A point scores high in two ways: if the surrogate model predicts it will probably be good (exploitation), or if the surrogate model is highly uncertain about it (exploration). Expected improvement naturally balances both of these impulses. A point in unexplored territory might turn out to be great, and a point near a known good region might be even better than what you’ve already seen.

A simpler alternative, probability of improvement, only cares whether a point will beat the current best, not by how much. This can cause the optimizer to get stuck making tiny incremental gains rather than discovering a much better region elsewhere. Expected improvement avoids this trap by weighting larger gains more heavily.

The Optimization Loop Step by Step

The full process is iterative and surprisingly compact:

Start with a few initial points. You evaluate the real function at a small number of randomly chosen or strategically placed locations to give the surrogate something to learn from.
Fit the surrogate model. The Gaussian process (or alternative) is trained on all data collected so far, producing predictions and uncertainty estimates across the entire search space.
Maximize the acquisition function. The acquisition function is evaluated (cheaply) over the search space to find the single point most worth testing next.
Evaluate the real function. You run the expensive experiment or computation at that chosen point and record the result.
Update and repeat. The new data point is added to the dataset, the surrogate is re-fit, and the loop continues until you run out of budget or the improvements become negligible.

Each iteration makes the surrogate more accurate, which makes the acquisition function’s recommendations sharper. Early iterations tend to explore broadly. Later iterations increasingly focus on refining the best regions found so far.

Beyond Gaussian Processes

Gaussian processes are the default surrogate, but they have a notable limitation: computational cost scales with the cube of the number of data points. With a few hundred evaluations this is fine, but as datasets grow, the math becomes expensive. Gaussian processes also struggle in very high-dimensional search spaces or when the variables include categories (like choosing between “Adam” and “SGD” as an optimizer) rather than continuous numbers.

An alternative called the Tree-structured Parzen Estimator (TPE) handles these cases more gracefully. Rather than modeling the function directly, TPE splits previous results into “good” and “bad” groups and models the probability of each separately. It scales linearly with the number of evaluations instead of cubically, and it naturally handles mixtures of continuous and categorical variables. This makes TPE a popular choice in machine learning hyperparameter tuning, where search spaces often include both types.

Where Bayesian Optimization Is Used

The technique’s sweet spot is any problem where evaluations are expensive and the search space has roughly 20 dimensions or fewer. Machine learning hyperparameter tuning is the most common application: choosing learning rates, layer sizes, regularization strengths, and similar settings that collectively determine how well a model performs. Research has shown that feature selection methods with hyperparameters tuned using Bayesian optimization often yield better recall rates, and in studies on transcriptomic data for Alzheimer’s disease, Bayesian optimization-guided selection improved the accuracy of disease risk prediction models.

Beyond machine learning, Bayesian optimization shows up in drug discovery (finding promising molecular configurations), materials science (identifying alloy compositions with desired properties), robotics (tuning controller parameters), and engineering design (optimizing shapes or process conditions). In each case, the common thread is the same: evaluations are too costly to do thousands of times, so every experiment needs to count.

Software Tools for Getting Started

Several Python libraries make Bayesian optimization accessible without requiring you to implement the math from scratch. Optuna is one of the most popular, using TPE by default and supporting both simple scripts and distributed computing. BoTorch, built on PyTorch, offers more flexibility for researchers who want to customize their surrogate models or acquisition functions. Scikit-Optimize (and its descendants like ProcessOptimizer) provides a scikit-learn-compatible interface that feels familiar to most Python data scientists. The right choice depends on whether you need simplicity, customization, or integration with a particular framework.

Limitations Worth Knowing

Bayesian optimization is not a universal solution. It works best with continuous, relatively low-dimensional search spaces. Once you exceed roughly 15 to 20 dimensions, the surrogate model has a harder time building an accurate picture of the landscape, and the advantages over simpler methods like random search start to shrink. It also assumes that the function you’re optimizing is reasonably smooth. If the true landscape is wildly noisy or discontinuous, the surrogate’s predictions will be unreliable.

The technique is also sequential by nature: each new evaluation depends on all previous results, which makes it difficult to parallelize. Some extensions allow batched evaluations (testing several points at once), but this adds complexity and somewhat reduces efficiency. For problems where evaluations are cheap and plentiful, simpler methods like random search or evolutionary algorithms are often faster in wall-clock time, even if they need more total evaluations to converge.