What Is a Kernel in SVM? The Kernel Trick Explained

A kernel in a support vector machine (SVM) is a function that measures similarity between two data points, allowing the algorithm to find patterns in data that isn’t separable by a straight line. Instead of literally transforming your data into a higher-dimensional space (which would be computationally expensive), the kernel computes what the result of that transformation would be, skipping the heavy lifting entirely. This shortcut is known as the kernel trick, and it’s the core idea that makes SVMs powerful for complex, non-linear classification problems.

Why SVMs Need Kernels

At its simplest, an SVM draws a line (or a flat plane, in higher dimensions) that separates two classes of data with the widest possible margin. This works beautifully when the data is already neatly separable. But real-world data rarely cooperates. Imagine two classes of points arranged in concentric circles: no straight line can separate them.

The classic solution is to project the data into a higher-dimensional space where a flat separator does exist. A point described by two features might get mapped into a space with dozens or hundreds of features, where the circular boundary becomes a flat plane. The problem is that computing in these enormous spaces is expensive. If you have d original features and map to all possible combinations, you end up with 2^d terms to calculate for every pair of data points.

Kernels solve this by computing the similarity between two points as if they were already in that high-dimensional space, without ever building the full transformed vectors. A sum of 2^d terms collapses into a product of just d terms. The computation drops from exponential to linear, and the SVM learns the same decision boundary it would have found in the expanded space.

How the Kernel Trick Works

SVMs don’t actually need the transformed data points themselves. During training, the algorithm only ever uses dot products between pairs of data points. A dot product is a single number that captures how similar two vectors are. The kernel function takes two original data points and returns the dot product their transformed versions would have produced, without performing the transformation.

Formally, a kernel function K(x, z) equals the dot product of the transformed points: φ(x)ᵀφ(z). The SVM never computes φ(x) or φ(z) individually. It just calls the kernel function whenever it needs a similarity score between two points. This means you can work in an effectively infinite-dimensional space at minimal computational cost.

Not every function qualifies as a valid kernel. A function must satisfy a mathematical property called the Mercer condition, which essentially requires it to be “positive semi-definite.” In practical terms, this guarantees that the function behaves like a real dot product in some space. All the standard kernels used in practice meet this requirement.

Common Kernel Types

Most SVM implementations offer a small set of well-tested kernels, each suited to different data shapes.

Linear kernel: Computes the plain dot product between two data points with no transformation at all. It’s the fastest option and works well when the data is already close to linearly separable, or when you have many more features than samples (common in text classification, for example).
Polynomial kernel: Captures interactions between features up to a specified degree. A degree-2 polynomial kernel accounts for all pairwise feature combinations, a degree-3 kernel captures three-way interactions, and so on. Higher degrees fit more complex boundaries but risk overfitting.
RBF (radial basis function) kernel: Also called the Gaussian kernel, this is the most widely used non-linear kernel and the default in scikit-learn’s SVC classifier. It maps data into an infinite-dimensional space and can model arbitrarily complex boundaries. It measures similarity based on the distance between two points: close points get a high similarity score, distant points get a score near zero.
Sigmoid kernel: Inspired by the activation function used in neural networks. It’s less commonly used in practice than the others and doesn’t satisfy the Mercer condition for all parameter values, which can make it unreliable.

The RBF Kernel and Its Gamma Parameter

Because the RBF kernel is the default choice in most libraries, understanding its key parameter, gamma, is worth some attention. Gamma controls how far the influence of a single training example reaches. A low gamma value means each point’s influence extends far, producing smoother, broader decision boundaries. A high gamma value means each point only influences its immediate neighborhood, producing tighter, more complex boundaries.

When gamma is too large, the model wraps its decision boundary so closely around individual training points that it essentially memorizes the data. This is classic overfitting: great accuracy on training data, poor performance on anything new. When gamma is too small, the model becomes overly constrained and behaves almost like a linear classifier, unable to capture the true shape of the data. Tuning gamma (often alongside a regularization parameter called C) is one of the most important steps when using an RBF kernel.

Choosing the Right Kernel

Your choice of kernel depends primarily on two things: the structure of your data and its size.

If you have a high-dimensional dataset (many features relative to the number of samples), a linear kernel is usually the best starting point. Text data converted to word-frequency vectors, for instance, often has thousands of features and is already close to linearly separable. Adding a non-linear kernel in these cases adds complexity without improving accuracy.

For lower-dimensional data with clearly non-linear relationships, the RBF kernel is a strong default. It can approximate a wide range of decision boundaries and only requires tuning gamma and C. Polynomial kernels offer more explicit control over the degree of feature interaction, but the RBF kernel generally performs just as well or better across most tasks.

Dataset size matters for a practical reason: SVM training time scales at least quadratically with the number of samples. Scikit-learn’s documentation notes that training may become impractical beyond tens of thousands of samples when using non-linear kernels. For very large datasets, linear kernels with specialized solvers, or kernel approximation methods, are more realistic choices.

Specialized Kernels for Non-Numeric Data

One of the most powerful aspects of the kernel framework is that it isn’t limited to numerical vectors. As long as you can define a valid similarity function between two objects, you can use an SVM on them. This has led to specialized kernels for data types that don’t naturally fit into a spreadsheet.

String kernels measure similarity between sequences of characters or symbols. They’ve been especially influential in bioinformatics, where researchers use them to classify protein sequences, predict gene function, and detect evolutionary relationships between organisms. A string kernel might compare two DNA sequences by counting how many short subsequences they share, capturing structural similarity that would be invisible to a standard numerical kernel. Graph kernels extend the same idea to network-structured data, comparing molecules, social networks, or other objects defined by their connections.

These specialized kernels illustrate the broader principle: the kernel is the part of an SVM that encodes your knowledge about what makes two data points similar. Choosing or designing the right kernel is, in many ways, the most important decision you make when building an SVM.