What Is an RBF Kernel and When Should You Use It?

The RBF kernel (Radial Basis Function kernel), also called the Gaussian kernel, is a function that measures the similarity between two data points based on how far apart they are. It’s the most widely used kernel in support vector machines (SVMs) and shows up across machine learning whenever you need to classify data that can’t be separated by a straight line. The core idea: points close together get a high similarity score, points far apart get a score near zero, and the result is a flexible, curved decision boundary that can wrap around complex patterns in your data.

How the RBF Kernel Works

The RBF kernel takes two data points, measures the squared distance between them, and passes that distance through a bell-shaped curve. The formula is:

K(x, x₀) = exp(−γ ‖x − x₀‖²)

Here, ‖x − x₀‖² is the squared distance between the two points, and γ (gamma) is a parameter that controls the width of the bell curve. The output is always a value between 0 and 1. If two points sit right on top of each other, the distance is zero and the kernel returns 1 (perfect similarity). As points move further apart, the output decays smoothly toward zero.

This bell-curve shape is what makes the RBF kernel so useful. Each training example radiates influence outward like a hill, and the gamma parameter controls how quickly that influence drops off. A large gamma creates a narrow, steep hill where only the nearest neighbors matter. A small gamma creates a broad, gentle hill where distant points still have influence.

What Gamma Actually Controls

Gamma is the single most important setting when using an RBF kernel, and it directly controls the bias-variance tradeoff. Think of it as the inverse of the radius of influence around each data point your model learns from.

When gamma is large, each training point only influences a tiny region around itself. The model can twist and contort to fit every individual point, which sounds good but typically leads to overfitting. The decision boundary becomes jagged and erratic, memorizing the training data rather than learning general patterns. At the extreme, the area of influence shrinks so much that it only includes the training point itself, and no amount of regularization can rescue the model.

When gamma is small, each training point influences a wide area. The model becomes overly smooth and can’t capture the actual shape of the data. At the extreme, every point influences every other point equally, and the model behaves almost like a linear classifier, too constrained to pick up on meaningful structure.

The sweet spot is somewhere in between, and finding it usually requires a grid search or cross-validation. In scikit-learn, the default gamma is set to 1 divided by the number of features, which serves as a reasonable starting point for many datasets.

The Infinite-Dimensional Trick

What makes the RBF kernel mathematically remarkable is that it implicitly maps your data into an infinite-dimensional space. If you expand the exponential function in the kernel formula using a Taylor series, the result can be rewritten as a dot product between two infinitely long vectors. Each original data point gets transformed into this infinite-dimensional representation where classes that were tangled together in the original space become cleanly separable.

The key insight is that you never actually compute those infinite-dimensional vectors. The kernel function gives you the dot product in that space directly, using just the simple distance calculation in the original space. This shortcut is called the “kernel trick,” and it’s why SVMs with RBF kernels can handle wildly nonlinear classification problems without the computational cost of explicitly working in high-dimensional space.

When To Use an RBF Kernel

The RBF kernel is a strong default choice when you don’t know much about the structure of your data. It handles nonlinear relationships naturally, works across a wide range of domains, and has only one kernel-specific parameter to tune (gamma). It’s used extensively in pattern recognition, text classification, image retrieval, genomics, and medical imaging. In one notable application, a modified Gaussian RBF kernel was used to classify prostate cancer status from MRI, CT, and ultrasound imaging data.

That said, the RBF kernel isn’t always the best option. When your data is already linearly separable, or when you have far more features than samples, a linear kernel is faster and often just as accurate. The RBF kernel also struggles when samples from different classes are extremely similar to each other, as in certain text classification tasks where the overlap between categories is high. In those scenarios, combining kernels or switching to a different approach can yield better results.

Computational Cost

Prediction time for an SVM with an RBF kernel scales as O(n_SV × d), where n_SV is the number of support vectors and d is the number of features. Every time the model classifies a new data point, it computes the kernel function between that point and every support vector. A linear kernel, by contrast, runs at O(d) regardless of how many support vectors exist.

This means RBF kernel models slow down as the dataset grows more complex and the number of support vectors increases. Pruning support vectors (removing the least important ones) speeds up predictions proportionally, since the cost scales linearly with their count. For applications where prediction speed is critical and accuracy can take a small hit, this tradeoff is worth considering.

Feature Scaling Is Not Optional

Because the RBF kernel measures the distance between data points, features with large numeric ranges will dominate features with small ranges. If one feature varies between 0 and 1,000 while another varies between 1 and 10, the kernel will almost entirely ignore the smaller feature. The decision boundary will be shaped by whichever feature has the biggest numbers, not whichever feature is most informative.

Standardizing your features (rescaling each one to have a mean of 0 and a standard deviation of 1) fixes this problem. After scaling, all features contribute roughly equally to the distance calculation. Skipping this step doesn’t just reduce accuracy; it produces a completely different model that can look nothing like the one trained on properly scaled data.

Avoiding Overfitting and Underfitting

The RBF kernel has two parameters that interact with each other: gamma (the kernel width) and C (the regularization strength of the SVM itself). Getting both wrong is easy, and the symptoms are predictable.

Overfitting shows up when gamma is too high, C is too high, or both. The model achieves near-perfect accuracy on training data but performs poorly on new data. Visually, the decision boundary looks like it’s wrapping tightly around individual clusters of points, creating islands of classification rather than smooth regions. The classic diagnostic: a large gap between training accuracy and validation accuracy.

Underfitting happens when gamma is too low, C is too low, or both. The model produces an overly simple boundary that misclassifies many training examples. It performs poorly on both training and test data because it can’t capture the complexity of the underlying pattern.

The standard approach is to run a grid search over a range of gamma and C values using cross-validation. Logarithmic scales work well for both parameters. Testing gamma values from 0.001 to 1000 and C values across a similar range will typically bracket the optimal combination.