What Is Huber Loss and When Should You Use It?

Huber loss is a loss function used in regression that combines the best properties of squared error and absolute error. For small errors, it behaves like mean squared error (MSE), penalizing predictions quadratically. For large errors, it switches to a linear penalty like mean absolute error (MAE), which prevents outliers from dominating the model. The transition point between these two behaviors is controlled by a single parameter called delta.

Introduced by statistician Peter J. Huber in his 1964 paper “Robust Estimation of a Location Parameter,” it was originally designed for robust statistics. Today it’s widely used in machine learning, reinforcement learning, and any regression task where your data might contain outliers.

How Huber Loss Works

Huber loss is defined as a piecewise function. When the prediction error is small (less than or equal to delta), the loss equals one half of the error squared, just like MSE. When the error is large (greater than delta), the loss equals the absolute value of the error, just like MAE. This means the function is quadratic near zero and linear in the tails.

The quadratic region near zero is useful because it’s smooth and differentiable, which makes gradient-based optimization stable and efficient. The linear region in the tails is useful because it grows much more slowly than a squared function, so a single extreme outlier can’t blow up your loss and drag the model toward it.

The key insight is that this isn’t just gluing two functions together. The piecewise definition is constructed so the function and its first derivative are continuous everywhere. The gradient transitions smoothly from being proportional to the error (in the quadratic region) to being a constant (in the linear region). This continuity is what makes Huber loss practical for optimization algorithms that rely on gradients.

The Role of Delta

Delta is the hyperparameter that controls where Huber loss switches from quadratic to linear behavior. Any prediction error smaller than delta gets the MSE-style treatment. Any error larger than delta gets the MAE-style treatment.

Choosing delta is really about defining what counts as an “outlier” in your problem. A smaller delta means you’re saying that even moderate errors should be treated linearly, making the model more robust to outliers but less sensitive to small differences in predictions. A larger delta means you’re giving the quadratic penalty more room to operate, making the model behave more like standard MSE and respond more aggressively to errors of all sizes.

In scikit-learn’s HuberRegressor, this parameter is called epsilon and defaults to 1.35. That default comes from robust statistics theory, where 1.35 provides roughly 95% of the efficiency of ordinary least squares when the data is actually normally distributed, while still offering meaningful protection against outliers. In practice, you’ll often tune this value using cross-validation based on your specific dataset.

Why Not Just Use MSE or MAE?

MSE squares every error, which means large errors get amplified dramatically. If your training data has even a few outliers, those squared penalties can dominate the total loss and pull your model’s predictions toward the outliers instead of fitting the majority of data points well. MSE works beautifully on clean data, but real-world data is rarely clean.

MAE avoids this problem by treating all errors linearly, so outliers have proportional rather than outsized influence. The tradeoff is that MAE’s gradient is constant (either +1 or -1), which creates two problems. First, the gradient doesn’t shrink as predictions get closer to the target, making it harder for optimization to converge precisely. Second, MAE isn’t differentiable at zero, which can cause instability during training.

Huber loss gives you the convergence benefits of MSE where it matters (near the correct answer) and the outlier robustness of MAE where it matters (far from the correct answer). It’s differentiable everywhere, and its gradient naturally decreases as predictions improve, helping the model settle into a good solution without oscillating.

Pseudo-Huber Loss

Standard Huber loss has one minor limitation: while it’s continuous and has a continuous first derivative, its second derivative has a sharp change at the transition points (positive and negative delta). For most applications this doesn’t matter, but some optimization methods that use second-order information benefit from a fully smooth function.

The pseudo-Huber loss solves this by approximating Huber loss with a single smooth formula instead of a piecewise definition. It’s calculated as delta squared multiplied by the quantity: the square root of one plus the error squared over delta squared, minus one. This produces a curve that closely matches Huber loss in both the quadratic and linear regions but transitions between them with no abrupt changes in any derivative. SciPy provides a built-in implementation through its scipy.special.pseudo_huber function.

When To Use Huber Loss

Huber loss is a strong default choice for regression problems where you suspect your data contains outliers but you don’t want to throw those data points away entirely. Common applications include sensor data (where occasional bad readings are inevitable), financial data (where extreme price movements are real but shouldn’t dominate a model), and reinforcement learning (where reward signals can be noisy or unbounded).

If your data is genuinely clean and normally distributed, MSE will typically perform just as well and is simpler to work with. If your data has heavy-tailed noise and you care more about median-like predictions than mean-like predictions, MAE might be the better choice. Huber loss occupies the practical middle ground, and adjusting delta lets you slide along the spectrum between MSE-like and MAE-like behavior to find the right fit for your problem.