Interpreting Random Forest Results: Key Methods Explained

A random forest gives you more than a single accuracy score. It produces feature importance rankings, error estimates, and prediction outputs that each tell you something different about your data and your model. Interpreting these results well means knowing what each output actually measures, where it can mislead you, and which tools to reach for when the default outputs aren’t enough.

Start With Model Performance

Before interpreting what your model learned, confirm it learned something useful. Random forests have a built-in performance estimate called the out-of-bag (OOB) error. Each tree in the forest is trained on a bootstrap sample, a random draw with replacement from your training data. Roughly one-third of your observations get left out of any given tree. The model uses those left-out samples to test each tree’s predictions, then aggregates the results into an overall error rate.

The OOB error is often treated as an unbiased estimate of how well your model will perform on new data, functioning like a free cross-validation. That’s mostly true, but research in PLOS One has shown it can actually be slightly pessimistic: it sometimes overestimates your error rate, meaning your model may perform a bit better on new data than the OOB score suggests. This happens because each OOB prediction only uses a subset of trees (the ones that didn’t see that observation), so it’s working with a weaker version of the full forest.

For classification, look at the OOB error rate alongside a confusion matrix to see which classes the model struggles with. For regression, check the OOB R-squared or mean squared error. If these numbers are poor, the interpretation steps below still apply, but you should treat any patterns the model found with skepticism.

Feature Importance: What It Measures

The most common output people look at is the feature importance ranking, a bar chart showing which input variables mattered most. There are two main flavors, and they answer slightly different questions.

Impurity-based importance (sometimes called Gini importance for classification or variance reduction for regression) measures how much each feature contributed to splitting the data cleanly across all the trees. Every time a feature is used to make a split, the model records how much that split reduced the “messiness” of the data. These reductions get summed up across the entire forest. Features that appear in many splits and produce large improvements rank highest.

Permutation importance takes a different approach. It randomly shuffles one feature’s values across all observations, breaking the relationship between that feature and the outcome, then measures how much the model’s accuracy drops. A large drop means the model depended heavily on that feature. A small drop means the model could get by without it.

The practical difference matters. Impurity-based importance is fast to compute since it’s calculated during training, but it has a known bias: it tends to inflate the importance of features with many unique values. A continuous variable with thousands of distinct values will look more important than a binary yes/no variable, even if the binary variable is a better predictor. Permutation importance avoids this bias because it directly measures predictive contribution. When in doubt, use permutation importance as your primary ranking and treat impurity-based importance as a quick first look.

Why Your Rankings Might Be Unstable

If you train the same random forest twice with a different random seed, you might get a different feature importance ranking. This isn’t a bug. It’s a consequence of randomness in the algorithm, and it’s more pronounced than most people expect.

Research published in BMC Bioinformatics found that with the common default of 500 trees, importance rankings can be remarkably unstable. In one dataset, repeating the model ten times with 500 trees produced a variable importance stability score of just 0.018 on a 0-to-1 scale, and not a single variable was selected consistently across all ten runs. Increasing to 137,000 trees brought stability up to 0.845. In another dataset, 500 trees yielded a stability of 0.612, which only reached 0.983 at 17,000 trees.

The fix is straightforward: increase the number of trees until importance rankings stabilize. A common starting rule is to set the number of trees to ten times the number of variables, then increase from there until the error rate and importance scores stop changing. The optimal number depends on your dataset’s size and complexity, not just the number of features. If you’re publishing results or making decisions based on feature rankings, run the model multiple times and check whether the top features stay consistent.

How Features Shape Predictions

Importance rankings tell you which features matter, but not how they matter. A feature could have a linear relationship with the outcome, a threshold effect, or a U-shaped curve. To see the shape of the relationship, you need partial dependence plots.

A partial dependence plot (PDP) shows the average predicted outcome as one feature varies across its range, while all other features are held at their observed values. If you’re predicting house prices and want to understand the effect of square footage, the PDP sweeps square footage from its minimum to maximum and plots the model’s average prediction at each point. A steadily rising line means bigger houses predict higher prices. A flat line that suddenly shoots up at 2,000 square feet means the model found a threshold effect.

You can also create two-feature PDPs that show how a pair of features interact, displayed as a heatmap or 3D surface. These are useful for spotting interactions (cases where the effect of one feature depends on the value of another), but they get hard to read beyond two features.

One important limitation: PDPs assume that the feature you’re examining is independent of the other features. If two features are highly correlated (say, income and education level), the plot may show predictions for combinations that don’t actually exist in your data, like someone with a graduate degree earning minimum wage. This produces misleading curves. When your features are correlated, accumulated local effects (ALE) plots are a better alternative. They restrict the analysis to actual observed data ranges and avoid extrapolating into unrealistic combinations.

Explaining Individual Predictions With SHAP

Feature importance and partial dependence plots describe the model as a whole. But you’ll often need to explain a specific prediction: why did the model flag this particular loan application as high risk, or predict this patient’s outcome as poor?

SHAP values decompose any single prediction into the contribution of each feature. The core idea is simple. Your model has an average prediction across all the training data. For any individual observation, the actual prediction differs from that average. SHAP values tell you exactly how much each feature pushed the prediction above or below the average. They satisfy a clean mathematical property: the sum of all SHAP values for a given observation, plus the average prediction, equals that observation’s actual predicted value.

A positive SHAP value for a feature means it pushed the prediction higher for that specific case. A negative value means it pushed it lower. If someone’s predicted house price is $50,000 above average and the SHAP value for “pool” is +$15,000, the pool feature alone accounts for $15,000 of that difference.

SHAP also bridges local and global interpretation. By plotting SHAP values for every observation in your dataset, you get a “beeswarm” plot where each dot represents one person or observation. The dot’s position on the horizontal axis shows the impact that feature has on the prediction for that case. Color typically represents the feature’s actual value (high or low). This lets you see, at a glance, both which features are important globally and the direction and magnitude of their effects across your entire dataset.

Using the Proximity Matrix

Random forests produce a less well-known output called the proximity matrix, which measures how similar any two observations are based on the model’s perspective. It works by tracking how often two observations land in the same terminal node (the endpoint of a decision path) across all the trees in the forest. Two observations that consistently end up in the same leaf nodes are “close” in the model’s view of the world, meaning they share the feature patterns the model considers important.

This is useful for two things. First, outlier detection: observations that rarely share leaf nodes with other observations are isolated in the model’s feature space and may be anomalies worth investigating. Second, clustering: you can feed the proximity matrix into standard clustering algorithms to find groups of similar observations, with the advantage that “similarity” is defined by the patterns the model actually uses rather than raw Euclidean distance.

Inspecting Individual Trees

Sometimes the most intuitive interpretation comes from looking at actual decision paths. While you can’t practically examine all 500 or 5,000 trees, pulling out a handful and visualizing them can reveal the splitting logic the forest relies on. Most implementations let you export individual trees as flowchart-style diagrams showing which feature is tested at each node, what the threshold is, and how many observations flow down each branch.

This is especially useful early in a project when you’re building intuition about what the model is doing, or when you need to explain the model’s logic to stakeholders who find bar charts of importance scores unconvincing. A single tree won’t represent the full forest’s behavior, but it shows the types of rules the forest is constructing. If the first split in most trees is on the same feature, that’s a strong signal about what drives predictions.

Common Interpretation Mistakes

The biggest mistake is treating feature importance as proof of causation. A random forest tells you which features are useful for prediction, not which features cause the outcome. If your model ranks “number of fire trucks dispatched” as highly important for predicting fire damage, that doesn’t mean sending fewer trucks would reduce damage. Predictive importance reflects association, not direction of effect.

The second mistake is ignoring correlated features. When two features carry similar information, the forest splits its attention between them. Both may appear moderately important when either one alone would rank highly. If you remove one, the other’s importance will jump. Before concluding a feature doesn’t matter, check whether it’s correlated with a feature that does.

Third, don’t over-interpret small differences in importance scores. If feature A scores 0.152 and feature B scores 0.148, they’re effectively tied. Given the instability issues discussed earlier, those positions could easily swap on the next run. Focus on the tiers (which features form the top group, the middle group, and the bottom) rather than exact rankings.