The L2 norm is a way to measure the size, or “length,” of a vector. Also called the Euclidean norm, it works exactly like the distance formula you learned in geometry: square each component of the vector, add those squares together, then take the square root. For a vector with components (3, 4), the L2 norm is √(9 + 16) = 5. If that looks familiar, it’s the Pythagorean theorem extended to any number of dimensions.
The Formula
For a vector with n components (x₁, x₂, …, xₙ), the L2 norm is:
‖x‖₂ = √(x₁² + x₂² + … + xₙ²)
Take a three-dimensional vector like (3, 4, 12). You square each component (9, 16, 144), sum them (169), and take the square root to get 13. The result is a single non-negative number that represents how “big” that vector is. A zero vector is the only vector with an L2 norm of zero.
What It Means Geometrically
The L2 norm gives you the straight-line distance from the origin (the zero point) to the tip of the vector. In two dimensions, this is ordinary Euclidean distance. In three dimensions, it’s the length of an arrow pointing from the center of a room to some point in space. The same logic extends to 100 or 10,000 dimensions, even though you can’t visualize them.
A closely related concept is Euclidean distance between two points. If you have two vectors A and B, the distance between them is the L2 norm of their difference: √(Σ(Aᵢ − Bᵢ)²). The L2 norm itself is just the special case where one of those points is the origin.
How It Differs From the L1 Norm
The L1 norm measures vector size differently: instead of squaring and square-rooting, it simply adds up the absolute values of each component. For a vector (3, −4), the L1 norm is |3| + |−4| = 7, while the L2 norm is √(9 + 16) = 5. The two norms answer different questions about the same vector and produce different geometric shapes.
If you draw all the points that have a norm of exactly 1 in two dimensions, the L2 norm traces out a circle. The L1 norm traces out a diamond (a square rotated 45 degrees), with corners at (1, 0), (0, 1), (−1, 0), and (0, −1). There’s also an L-infinity norm, which takes the maximum absolute value of any component, and its unit shape is a square with corners at (±1, ±1). As you increase the “p” in the general Lp norm, the shape gradually flattens from the L1 diamond toward the L-infinity square, passing through the L2 circle along the way.
This geometric difference matters in practice. The L1 norm’s diamond shape has sharp corners that sit on the coordinate axes, which tends to push solutions toward sparse results where some components are exactly zero. The L2 norm’s smooth circle spreads values more evenly, shrinking all components without eliminating any of them entirely.
Key Mathematical Properties
The L2 norm satisfies four properties that all norms must have:
- Non-negativity: the norm is always zero or positive
- Zero only at zero: the only vector with a norm of 0 is the zero vector itself
- Scaling: multiplying a vector by a constant scales the norm by the absolute value of that constant, so ‖3x‖ = 3‖x‖
- Triangle inequality: the norm of two vectors added together is never greater than the sum of their individual norms, meaning ‖x + y‖ ≤ ‖x‖ + ‖y‖
The triangle inequality is the formal version of “a straight line is the shortest path between two points.” These properties guarantee that the L2 norm behaves like a sensible notion of distance.
L2 Norm in Machine Learning
The L2 norm shows up constantly in machine learning, most visibly in two places: loss functions and regularization.
Loss Functions
When a model makes predictions, you need to measure how wrong those predictions are. L2 loss (also called squared error loss) takes the difference between each predicted value and the actual value, then squares it. Averaging those squared errors gives you Mean Squared Error (MSE). Taking the square root of MSE gives you Root Mean Square Error (RMSE), which is essentially the L2 norm of the error vector divided by the square root of the number of data points. RMSE is popular because it’s in the same units as the thing you’re predicting, making it easy to interpret.
Regularization and Weight Decay
L2 regularization, commonly called Ridge regularization, adds a penalty to the loss function based on the squared L2 norm of the model’s weights. The regularized loss looks like this: the normal prediction error plus α times the sum of the squared weights, where α controls how aggressively the penalty is applied. This discourages any single weight from growing excessively large and pushes the model toward smaller, more evenly distributed coefficients.
As α increases, the weights shrink closer to zero, gradually reducing the influence of features that might otherwise dominate the model. This fights overfitting because large weights often mean the model is memorizing noise in the training data rather than learning genuine patterns. Unlike L1 regularization (used in Lasso regression), L2 regularization rarely drives weights all the way to zero. It shrinks everything proportionally instead of eliminating features outright.
In neural networks, L2 regularization is often called weight decay. When using standard stochastic gradient descent, the two are mathematically equivalent: the gradient of the L2 penalty term produces the same weight update as the weight decay rule. With more advanced optimizers that use momentum or adaptive learning rates, though, the equivalence breaks down, and the two techniques can produce different results.
Why L2 Is the Default in Many Fields
The L2 norm is often the first choice in statistics, physics, and engineering for a practical reason: squaring is smooth and differentiable everywhere, which makes optimization straightforward. The L1 norm has a sharp corner at zero, creating complications for gradient-based methods. The L2 norm also connects directly to the Pythagorean theorem and to the concept of energy in physics, where quantities like kinetic energy depend on the square of velocity. When someone says “distance” without further qualification, they almost always mean the L2 norm.

