What Is Gradient Boosting and How Does It Work?

Gradient boosting is a machine learning technique that builds a prediction by combining many small, simple models (usually decision trees) in sequence, where each new model specifically corrects the errors left behind by the previous ones. It’s one of the most powerful and widely used algorithms for structured data, consistently outperforming deep learning on the kinds of spreadsheet-style datasets common in business, finance, and science.

How Gradient Boosting Works

The core idea is surprisingly intuitive. Imagine you guess someone’s house price and you’re off by $40,000. Instead of starting over, you train a second model whose entire job is to predict that $40,000 gap. If the second model gets within $5,000, you train a third model to close that remaining $5,000 gap. Each round chips away at the error left by everything before it.

More precisely, each new model is trained on the residual errors of the combined prediction so far. These residuals are actually the negative gradient of a loss function, which is a mathematical way of measuring how wrong your predictions are. By fitting each new model to these gradients, the algorithm is performing a kind of gradient descent, not by adjusting numerical weights like a neural network does, but by adding entirely new models to the ensemble. Researchers describe this as gradient descent in “function space” rather than “parameter space,” which is what makes gradient boosting fundamentally different from most other learning algorithms.

The individual models in the sequence are intentionally weak. They’re typically decision trees limited to just a few levels of depth (a default of 3 is common). A single shallow tree is a poor predictor on its own, but hundreds of them stacked together, each one focused on what the last one got wrong, produce a highly accurate combined model.

How It Differs From Other Ensemble Methods

Gradient boosting belongs to a family called ensemble methods, which combine multiple models to get better results than any single model could achieve. But the way it builds that ensemble is distinct.

A random forest, for example, trains many decision trees independently and in parallel, then averages their predictions. Each tree sees a random subset of the data and features, and the diversity among trees reduces overfitting. Gradient boosting takes the opposite approach: trees are built one after another, and each tree depends entirely on the output of the ones before it. This sequential process is slower to train but often more accurate.

An older boosting method called AdaBoost works sequentially too, but uses a different correction mechanism. AdaBoost adjusts the weights on data points, forcing each new model to pay more attention to the examples the previous model misclassified. Gradient boosting instead fits each new model directly to the errors themselves. This shift from reweighting data to modeling residuals is what makes gradient boosting more flexible, because it can be adapted to virtually any type of prediction problem just by swapping out the loss function.

Why It Dominates Tabular Data

Deep learning has transformed image recognition, language processing, and speech. But for tabular data (rows and columns, like a database or spreadsheet), gradient boosting remains the default recommendation. A study comparing XGBoost (a popular gradient boosting library) against several deep learning architectures found that XGBoost outperformed the neural networks across the board, including on the very datasets those deep models were originally designed to handle. XGBoost also required far less tuning to reach good performance.

Tabular data poses specific challenges that neural networks struggle with: missing values, a mix of numerical and categorical features, no spatial or sequential structure to exploit, and sparse inputs. Tree-based methods handle all of these naturally. They split data based on thresholds, so they don’t care whether a feature is a number, a category, or partially missing. This practical robustness is a major reason gradient boosting has dominated machine learning competitions and real-world applications for over a decade.

Key Hyperparameters and How They Interact

Getting good results from gradient boosting usually comes down to tuning a handful of settings that control how aggressively the model learns.

Learning rate controls how much each new tree contributes to the overall prediction. The default in most libraries is 0.1. A faster rate (like 0.5) lets the model converge quickly but risks overcorrecting, bouncing back and forth past the true value like a car fishtailing. A slower rate (like 0.01) moves more cautiously but may never reach the best answer unless you give it enough trees to work with. In benchmarks comparing these rates over 100 iterations, the fast learner achieved an R² of 0.811 while the slow learner scored just 0.495, a massive gap caused by insufficient iterations.

Number of estimators is how many trees are built in sequence. More trees generally improve accuracy because they give the model more chances to correct its errors, but each additional tree adds training time. The learning rate and number of estimators are tightly linked: a low learning rate needs many more trees. When the slow learner from that same benchmark was given 500 trees instead of 100, it caught up to the others. Typical tuning ranges run from 50 to 1,000 trees.

Tree depth limits how complex each individual tree can be. The default of 3 is deliberately shallow, and unlike with random forests, making trees much deeper often hurts performance rather than helping it. That said, modest increases (from 3 to around 10 or 13) can improve results depending on the dataset. Going too deep lets individual trees memorize noise in the training data, which undermines the whole point of using weak learners.

Overfitting and How to Prevent It

Unlike random forests, which are naturally resistant to overfitting, gradient boosting can memorize training data if left unchecked. This is the tradeoff for its sequential, error-chasing design. Each new tree reduces training error, but at some point, it starts fitting noise rather than real patterns.

Several regularization techniques keep this in check. Shrinkage (another name for a low learning rate) forces each tree to contribute only a small correction, requiring more trees but producing a smoother, more generalizable model. Subsampling randomly selects a fraction of the training data or features for each tree, injecting randomness similar to what makes random forests robust. Early stopping monitors performance on a held-out validation set and halts training once accuracy stops improving, preventing unnecessary trees from degrading the model. Libraries also support L1 and L2 penalties on the loss function, which discourage overly complex splits within each tree.

Major Gradient Boosting Libraries

Three libraries dominate modern gradient boosting, each with a different design philosophy.

  • XGBoost was the first to make gradient boosting fast and scalable. It’s designed for speed and supports parallel computation within each tree-building step, making it practical for large datasets. It remains the most widely used option for both regression and classification.
  • LightGBM, developed by Microsoft, is optimized for large-scale data. It uses a technique called Gradient-based One-Side Sampling to skip data points with small gradients (meaning they’re already well-predicted), significantly reducing computation time without sacrificing accuracy. It also grows trees leaf-wise rather than level-wise, which can produce better results with fewer splits.
  • CatBoost, from Yandex, handles categorical features natively without requiring manual encoding. This makes it particularly convenient for datasets with many text-based or label-based columns.

All three implement the same underlying principle of sequential error correction through gradient-fitted trees. The differences lie in how they optimize the process for speed, memory, and specific data types. For most tabular prediction problems, any of the three will outperform a neural network with less effort, and choosing among them often comes down to the specifics of your dataset and infrastructure.