What Is a Regression Tree and How Does It Work?

A regression tree is a type of machine learning model that predicts a numerical value by splitting data into smaller and smaller groups based on a series of yes-or-no questions. Think of it like a flowchart: at each step, the model asks a question about one feature of the data, sends each data point down the appropriate branch, and ultimately lands on a prediction. That prediction is a continuous number (like a price, a temperature, or a cost) rather than a category, which is what distinguishes a regression tree from its close cousin, the classification tree.

How a Regression Tree Makes Predictions

The core idea is surprisingly simple. The algorithm looks at your data and finds the single question that best divides it into two groups where the outcomes within each group are as similar as possible. Then it repeats the process within each group, splitting again and again until it reaches a stopping point. The result is a tree-shaped structure where every endpoint, called a leaf node, contains a cluster of similar data points.

When the tree needs to predict a value for a new observation, it routes that observation down through the questions until it lands in a leaf. The prediction is the average of all the training data points that ended up in that same leaf. For example, if a leaf contains 30 houses that sold for prices ranging from $280,000 to $320,000, the tree would predict the average of those 30 prices for any new house that matches the same profile.

The algorithm’s goal during training is to minimize the total prediction error across all leaves. Specifically, it tries to reduce the sum of squared differences between each data point’s actual value and the average value in its leaf. At every potential split, it tests every possible question on every available feature and picks the one that reduces this error the most.

Why Regression Trees Handle Complexity Well

One of the biggest advantages of regression trees is that they don’t force you to assume anything about the shape of the relationship between your inputs and your outcome. A traditional linear regression assumes the relationship is, well, linear. If it’s not, you have to manually add squared terms, interaction terms, or other adjustments. A regression tree doesn’t need any of that. Discontinuous relationships and nonlinear patterns are naturally accommodated by the splitting process.

Interactions between variables are also captured automatically. Imagine you’re predicting student test scores using prior scores and socioeconomic status. A regression tree might first split on prior scores (above or below 50), and then split on socioeconomic status only for the group with higher prior scores. That means the model has detected that socioeconomic status only matters in certain contexts, without you needing to specify that interaction ahead of time. In a linear model, you’d need to know to include that interaction term before fitting the model.

Regression trees also handle categorical variables (like “region” or “product type”) and missing data with relative ease. For missing values, the original algorithm developed by Leo Breiman uses a technique called surrogate splits. For each question in the tree, the algorithm identifies backup variables that produce similar splits. If a data point is missing the value for the primary question, the tree uses the best available backup variable instead. If every backup is also missing, the data point simply goes to whichever branch contains the majority of observations.

The Overfitting Problem

Left unchecked, a regression tree will keep splitting until every leaf contains just one or two data points. This creates a model that memorizes the training data perfectly but performs poorly on new data. A tree with zero restrictions can hit 100% accuracy on its training set while dropping to something like 88% on a test set. This gap between training and real-world performance is the hallmark of overfitting.

The most common fix is called cost complexity pruning. The idea is to grow the full tree first, then work backward, removing branches that don’t improve predictions enough to justify the added complexity. This process is controlled by a single tuning parameter, often called alpha. When alpha is zero, the tree stays fully grown. As alpha increases, more branches get cut, producing a simpler tree that generalizes better to new data. You typically test several alpha values and pick the one that performs best on a held-out validation set.

Other stopping rules can also prevent overgrowth: setting a minimum number of data points required in each leaf, limiting the total depth of the tree, or requiring that each split improve the prediction error by at least a certain amount.

How Regression Trees Differ From Classification Trees

The structure is identical. Both use the same recursive splitting process and produce the same flowchart-like output. The difference is in what they predict and how they measure the quality of a split. A classification tree predicts a category (spam or not spam, benign or malignant) and measures split quality by how well it separates the classes. A regression tree predicts a number and measures split quality by how much it reduces the squared prediction error. At the leaf level, a classification tree returns the most common category among its training points, while a regression tree returns the average value.

Where Regression Trees Are Used

Regression trees show up wherever you need to predict a number and want a model that’s easy to interpret. In healthcare, they’ve been used to predict total health care costs per patient based on diagnostic categories and demographics, and to estimate the cost of inpatient rehabilitation using age, type of impairment, and measures of motor and cognitive functioning. The tree structure makes it straightforward to see which factors drive the prediction, which is valuable when decisions need to be explained to stakeholders or patients.

In practice, single regression trees are often outperformed by ensemble methods that combine many trees together. Random forests build hundreds of trees on randomly sampled subsets of the data and average their predictions. Gradient boosting builds trees sequentially, with each new tree correcting the errors of the previous ones. Both techniques dramatically improve prediction accuracy while sacrificing some of the interpretability that makes a single tree appealing. A single regression tree remains the best choice when you need a transparent, explainable model, or when you’re exploring a dataset to understand which variables matter and how they interact.

Limitations to Keep in Mind

Regression trees predict in steps. Because each leaf returns a single average value, the model’s output looks like a staircase rather than a smooth curve. This means a regression tree can’t extrapolate beyond the range of its training data, and it can produce abrupt jumps in predictions for data points that are very similar but fall on opposite sides of a split boundary.

Single trees are also sensitive to small changes in the data. Remove a few data points or add some noise, and the first split might change entirely, cascading into a completely different tree structure. This instability is one of the main motivations for ensemble methods, which smooth out this variability by averaging across many trees. If you’re using a single regression tree, pruning helps reduce this sensitivity, but it doesn’t eliminate it.