What Is Bagging and Boosting in Machine Learning?

Bagging and boosting are two strategies for combining multiple weak machine learning models into a single, stronger one. Both fall under the umbrella of “ensemble learning,” but they work in fundamentally different ways: bagging trains models independently and in parallel to reduce unpredictable errors, while boosting trains models one after another, with each new model focusing on the mistakes the previous ones made. Understanding how they differ helps you choose the right approach for a given problem.

How Bagging Works

Bagging, short for bootstrap aggregating, was introduced by Leo Breiman in 1996. The core idea is straightforward: instead of training one model on your full dataset, you create many slightly different versions of the dataset and train a separate model on each one. Then you combine their predictions.

It follows three steps:

Bootstrapping. You create multiple new training sets by randomly sampling from the original data with replacement. “With replacement” means the same data point can appear more than once in a given sample, so each sample is a slightly different remix of the original.
Parallel training. A separate model is trained on each of these bootstrap samples, independently and at the same time. Because no model depends on another, this step can be run in parallel across multiple processors.
Aggregation. The predictions from all models are combined. For a regression task (predicting a number), you take the average of all predictions. For a classification task (picking a category), you use majority voting: whichever class gets the most votes wins.

The most well-known bagging algorithm is Random Forest, which applies this process using decision trees as its base models. Bagging works best with complex, fully developed decision trees that individually tend to overfit. By averaging many overfitted-but-different models together, the random errors cancel out and you get a more reliable result.

How Boosting Works

Boosting takes the opposite philosophy. Instead of training models independently, it trains them sequentially, where each new model is specifically designed to fix the errors of the ones before it. After each round, the algorithm increases the weight on data points that were misclassified, forcing the next model to pay more attention to the hard cases.

Because of this sequential, error-correcting design, boosting typically uses very simple base models. The default in most implementations is a “decision stump,” a decision tree with just one split. These stumps are individually terrible predictors, but boosting chains dozens or hundreds of them together, each one nudging the overall prediction in the right direction. The final prediction is a weighted combination of all these small corrections.

The most widely used boosting algorithms are AdaBoost (the original, developed by Freund and Schapire), gradient boosting, XGBoost, and LightGBM. XGBoost and LightGBM in particular dominate structured data competitions and are workhorses in industry applications like fraud detection, recommendation systems, and pricing models.

What Each Method Actually Fixes

Every prediction error in machine learning can be broken into two components: bias (the model is systematically wrong because it’s too simple to capture the pattern) and variance (the model is too sensitive to the specific training data and gives wildly different results on new data). Bagging and boosting target different sides of this tradeoff.

Bagging primarily reduces variance. If you have a model that’s powerful but unstable, one that changes dramatically depending on which data it sees, bagging smooths out that instability by averaging across many versions. It doesn’t do much for a model that’s fundamentally too simple to capture the pattern in the first place.

Boosting reduces both bias and variance for unstable models. Because each new model specifically targets the remaining errors, it can gradually learn complex patterns that a single simple model would miss entirely. However, empirical research has shown that boosting can actually increase variance for models that are already stable, which is one reason it sometimes backfires on certain problems.

Overfitting Risk

This is one of the most important practical differences. Bagging is naturally resistant to overfitting. By training on different random subsets and averaging the results, it smooths out noise in the data. A comparative analysis published in Scientific Reports found that as ensemble size increased from 20 to 200 models on the MNIST image dataset, bagging’s accuracy improved slightly (0.932 to 0.933) and then plateaued, never degrading.

Boosting, on the other hand, showed a stronger improvement on the same data (0.930 to 0.961) but eventually showed signs of overfitting. This makes sense given how it works: by repeatedly focusing on the hardest-to-classify data points, boosting can start memorizing noise and outliers rather than learning genuine patterns. Noisy datasets are particularly dangerous for boosting because mislabeled or anomalous data points get amplified by the reweighting process.

Speed and Scalability

Bagging has a clear structural advantage in training speed. Because each model trains independently, you can distribute the work across multiple processors. Research confirms that bagging’s training time stays nearly constant regardless of how many models you add to the ensemble, as long as you have the hardware to run them in parallel. On a chart, its training time looks like a flat horizontal line.

Boosting is inherently sequential. Each model needs to see the errors from the previous one before it can be built, so you can’t parallelize the core training loop. Training time increases linearly as you add more models. If 50 models take 10 minutes, 200 models take roughly 40 minutes. Modern implementations like XGBoost and LightGBM use clever engineering tricks to speed up individual steps, but the sequential bottleneck remains.

When to Use Which

The choice comes down to diagnosing your problem. If your model is already complex and powerful but gives inconsistent results on new data (high variance, overfitting), bagging is the better tool. It stabilizes predictions without adding complexity. Random Forest is the go-to choice here and works reliably across a wide range of problems with minimal tuning.

If your model is too simple and consistently underperforms (high bias, underfitting), boosting is the way to go. It can coax strong performance out of very weak learners by iteratively correcting errors. This is why gradient boosting methods dominate competitions on structured, tabular data: they squeeze out performance that simpler approaches leave on the table.

A few other practical considerations matter:

Noisy data. Bagging handles noise better because it doesn’t amplify individual data points. Boosting can latch onto mislabeled examples and degrade.
Tuning effort. Bagging methods like Random Forest are relatively forgiving with default settings. Boosting algorithms like XGBoost have many hyperparameters (learning rate, tree depth, regularization) and require more careful tuning to avoid overfitting.
Training resources. If you need fast training on a single machine, bagging’s parallelism helps. If you need the absolute best accuracy and can afford the time, boosting often wins.

A Quick Side-by-Side

Training approach. Bagging trains models in parallel on random subsets. Boosting trains models sequentially, each correcting the last.
Base models. Bagging works best with complex models like deep decision trees. Boosting works best with simple models like shallow trees or stumps.
Error reduction. Bagging reduces variance. Boosting reduces bias and variance.
Overfitting. Bagging is resistant. Boosting is prone to it, especially with noise.
Popular algorithms. Bagging: Random Forest. Boosting: AdaBoost, XGBoost, LightGBM, gradient boosting.
Computation. Bagging scales flat with parallelization. Boosting scales linearly with ensemble size.

In practice, many data scientists try both. Bagging offers a reliable, low-maintenance baseline, while boosting offers higher potential accuracy at the cost of more tuning and greater sensitivity to data quality. Neither is universally better. The right choice depends on whether your problem is driven by variance or bias, how clean your data is, and how much time you have to experiment.