Feature scaling is a data preprocessing step that transforms the values of different features (variables) into a comparable range before feeding them into a machine learning algorithm. Without it, a feature measured in thousands can dominate one measured in single digits, even if the smaller-numbered feature is more important for making accurate predictions.
Why Raw Numbers Mislead Algorithms
Imagine you’re building a model with two features: one ranges from 0 to 1,000 and another ranges from 1 to 10. Many algorithms treat those raw numbers as if they’re on equal footing. When the algorithm calculates the distance between two data points, the feature with the larger range overwhelms the calculation, and the smaller feature gets virtually ignored. A scikit-learn example illustrates this perfectly: in a wine dataset, the variable “proline” (ranging 0 to 1,000) completely dominated a nearest-neighbor classifier, while “hue” (ranging 1 to 10) barely influenced the result. Scaling the data produced an entirely different, and more accurate, decision boundary.
The same problem shows up in principal component analysis (PCA), a common technique for reducing the number of features in a dataset. PCA looks for the directions in which data varies the most. If one feature simply has bigger numbers, PCA will conclude that feature drives the most variance, even when that variance is just an artifact of scale rather than genuine signal.
Min-Max Normalization
Min-max normalization rescales every value in a feature to fit between 0 and 1. The formula is straightforward: subtract the feature’s minimum value from each data point, then divide by the range (maximum minus minimum). After this transformation, the smallest original value becomes 0, the largest becomes 1, and everything else falls proportionally in between. A variation of this approach maps values to a -1 to 1 range instead.
This method works well when you know your data doesn’t contain extreme outliers and you want a bounded, predictable range. It’s a natural fit for neural networks, which often expect inputs in a narrow range, and for algorithms like k-nearest neighbors where distance calculations need features on equal footing.
The downside: min-max normalization is sensitive to outliers. A single extreme value stretches the range, compressing all the other data points into a tiny portion of the 0-to-1 scale.
Standardization (Z-Score Normalization)
Standardization transforms each feature so it has a mean of 0 and a standard deviation of 1. For each value, you subtract the feature’s mean, then divide by its standard deviation. Unlike min-max normalization, the result isn’t bounded to a fixed range. Values can land above 1 or below -1, depending on how far they sit from the average.
This is the most widely used scaling method for algorithms that assume normally distributed data or that rely on gradient-based optimization. Logistic regression, support vector machines (SVMs), and neural networks all converge faster and more reliably on standardized inputs. However, standardization still uses the mean and standard deviation in its calculation, which means outliers can skew the result by pulling those statistics away from where most of the data sits.
Robust Scaling for Messy Data
When your dataset contains significant outliers, robust scaling offers a better alternative. Instead of using the mean and standard deviation, it centers data around the median and scales it using the interquartile range (the spread between the 25th and 75th percentiles). Because those statistics aren’t influenced by a handful of extreme values, adding or removing outliers from the training set produces roughly the same transformation.
The trade-off is that the resulting values span a wider, less predictable range than other methods. In practice, most transformed values tend to fall somewhere between -2 and 3, but the outliers themselves remain in the data. They’re just no longer distorting the scale for everything else.
Power Transforms for Skewed Distributions
Sometimes the problem isn’t just scale but shape. If a feature’s values are heavily skewed to the right (a long tail of large values), linear scaling methods won’t fix the underlying distribution. Power transforms like Box-Cox and Yeo-Johnson apply a nonlinear mathematical function to make the data more symmetric and closer to a bell curve. The transformation strength is controlled by a parameter that can be automatically tuned to find the best fit.
Box-Cox only works with strictly positive values. Yeo-Johnson generalizes the approach to handle zero and negative values as well. These transforms are particularly useful before applying algorithms that assume roughly normal input distributions, or when even a log transformation isn’t strong enough to correct severe skewness.
Which Algorithms Need Scaling
Not every model cares about feature scale. The distinction comes down to how the algorithm makes its decisions.
- Sensitive to scaling: K-nearest neighbors, SVMs, neural networks, logistic regression, and PCA. These algorithms either compute distances between data points, use gradient descent to optimize, or assume features contribute on a comparable scale. A large-scale study on arXiv confirmed that logistic regression, SVMs, neural networks, KNN, and TabNet all showed significant performance changes depending on which scaler was used.
- Generally insensitive: Tree-based models like random forests, XGBoost, CatBoost, and LightGBM. Decision trees split data based on thresholds within a single feature at a time, so the relative scale between features doesn’t affect the split. These ensemble methods demonstrated robust performance largely independent of scaling across both regression and classification tasks.
That said, “insensitive” doesn’t mean “never benefits.” Even with tree-based models, scaling can occasionally help with convergence speed or when combining trees with other techniques in a pipeline.
Choosing the Right Method
Your choice depends on two things: the shape of your data and the algorithm you plan to use.
If your data is roughly normally distributed without major outliers, standardization is the default starting point. It works with the widest range of algorithms and is the most commonly recommended preprocessing step. If your algorithm specifically expects inputs in a fixed range (like a neural network with sigmoid activation), min-max normalization is the better fit.
If your data has significant outliers you can’t or don’t want to remove, robust scaling protects the majority of your data from being compressed by those extreme values. And if your features are heavily skewed rather than just differently scaled, a power transform can reshape the distribution before you apply any linear scaler on top.
One practical detail that’s easy to overlook: always fit your scaler on the training data only, then apply that same transformation to the test data. If you scale the entire dataset before splitting, information from the test set leaks into the training process, giving you an overly optimistic view of how well your model will perform on new data.

