L2 regularization is a technique that prevents machine learning models from overfitting by adding a penalty based on the size of the model’s coefficients to the loss function. The penalty is the sum of the squared coefficients, multiplied by a tunable strength parameter. This small addition pushes the model toward smaller, more balanced weights, which typically produces better predictions on new data.
How the Penalty Works
In a standard regression model, the algorithm minimizes the difference between its predictions and the actual values. L2 regularization modifies this by tacking on an extra term: the sum of every coefficient squared, scaled by a parameter usually called alpha (α) or lambda (λ). The full loss function looks like this:
Loss = (prediction error) + α × (sum of squared coefficients)
That second term is the penalty. As any individual coefficient grows larger, its square grows even faster, so the penalty disproportionately discourages big weights. The result is that the optimization process has two competing goals: fit the training data well and keep the coefficients small. The strength parameter α controls the balance. A larger α forces more shrinkage; a smaller α lets the model behave more like unregularized regression.
The Bias-Variance Tradeoff
Every predictive model walks a line between two types of error. Bias is the error from oversimplifying, essentially missing real patterns. Variance is the error from being too sensitive to the training data, fitting noise that won’t repeat in new observations. An unregularized model with many features tends to have low bias but high variance: it memorizes the training set and performs poorly elsewhere.
L2 regularization deliberately introduces a small amount of bias by shrinking coefficients away from their optimal training-data values. In exchange, variance drops significantly. As Columbia University course materials on regularized regression put it, ridge estimates “trade off bias for variance.” Increasing the penalty strength α increases bias and decreases variance; decreasing it does the opposite. The sweet spot, typically found through cross-validation, is where total prediction error on unseen data is minimized.
Why It Helps With Correlated Features
When two or more input features are highly correlated (a situation called multicollinearity), standard regression becomes unstable. Small changes in the training data can cause coefficients to swing wildly in opposite directions because the model can’t distinguish which correlated feature deserves the credit. This inflates the standard errors of the coefficients and makes predictions unreliable.
L2 regularization fixes this by constraining the coefficients to stay small. Rather than assigning a huge positive weight to one correlated feature and a huge negative weight to another, the model distributes weight more evenly across them. Mathematically, the penalty adds a stabilizing term to the matrix that gets inverted during optimization, guaranteeing that the solution stays well-behaved even when features are nearly identical. A larger λ produces more shrinkage, reducing the impact of multicollinearity while keeping all features in the model.
L2 vs. L1 Regularization
L1 regularization (also called Lasso) penalizes the sum of the absolute values of the coefficients instead of their squares. This seemingly small difference produces very different behavior.
- L2 (Ridge) shrinks all coefficients toward zero but never actually reaches zero. Every feature stays in the model, just with a reduced influence. This is useful when you believe most features contribute something and you want to dampen noise rather than eliminate variables.
- L1 (Lasso) can push coefficients all the way to exactly zero, effectively removing features from the model. This makes it a built-in feature selection tool, useful when you suspect many inputs are irrelevant.
If you want the benefits of both, Elastic Net combines L1 and L2 penalties in a single model, letting you tune how much of each to apply.
Feature Scaling Matters
Because L2 regularization penalizes the squared size of coefficients, features measured on different scales get penalized unequally. A feature measured in thousands (like income in dollars) will naturally have a smaller coefficient than one measured in single digits (like number of bedrooms), even if both are equally important. The penalty would hit the income coefficient lightly and the bedroom coefficient heavily, distorting the model.
The fix is straightforward: standardize your features before applying L2 regularization. Subtracting the mean and dividing by the standard deviation puts every feature on the same scale, so the penalty treats all coefficients fairly. Most machine learning libraries make this easy with built-in scalers, and skipping this step is one of the most common mistakes when using ridge regression.
L2 Regularization in Deep Learning
In neural networks, L2 regularization is often called “weight decay” because it continuously nudges every weight closer to zero during training. For simple gradient descent, L2 regularization and weight decay are mathematically identical (after adjusting for the learning rate). But this equivalence breaks down with adaptive optimizers like Adam, which scale gradients differently for each parameter.
The problem is subtle but important. When L2 regularization is combined with Adam, the regularization gradient gets mixed in with the loss gradient before Adam’s adaptive scaling is applied. This means weights that have historically large gradients end up being regularized less than they should be. In practice, the regularization becomes uneven across different weights in the network.
A paper by Loshchilov and Hutter introduced AdamW, which “decouples” weight decay from the gradient update. Instead of adding the regularization term to the gradient, AdamW applies weight decay as a separate step after the adaptive update. This ensures every weight is regularized at the same rate, regardless of its gradient history. AdamW has become the default optimizer for training large neural networks, including most transformer-based language models, precisely because it handles regularization more consistently than Adam with standard L2.
When to Use L2 Regularization
L2 regularization is a strong default choice in several common situations. If your model has many features relative to the number of training examples, it’s likely to overfit, and L2 will help. If your features are correlated with each other, L2 will stabilize the coefficients without forcing you to manually remove redundant variables. And if you want a simple, reliable way to improve generalization without much tuning, L2 regularization with cross-validated strength selection is hard to beat.
It’s less ideal when you actually need the model to identify which features matter and discard the rest. Since L2 never zeros out a coefficient, it won’t simplify your model in that way. For feature selection, L1 or Elastic Net is the better tool. In deep learning, favor AdamW over naive L2 regularization when using adaptive optimizers, since the decoupled approach produces more predictable and effective regularization.

