Standardizing a variable means transforming it so it has a mean of 0 and a standard deviation of 1. You do this by subtracting the mean from each value and dividing the result by the standard deviation. The formula is simple: z = (x − μ) / σ, where x is the original value, μ is the mean of all values, and σ is the standard deviation. The result, called a z-score, tells you how many standard deviations a value sits above or below the average.
Why Standardizing Matters
Variables in a dataset often live on completely different scales. One column might range from 0 to 1,000 while another ranges from 1 to 10. When you feed raw data into many algorithms, the variable with the larger range dominates simply because its numbers are bigger, not because it carries more useful information. In a real example using wine data from scikit-learn, a feature called “proline” (ranging up to 1,000) completely overshadowed a feature called “hue” (ranging from 1 to 10) when calculating distances between data points. The model essentially ignored hue, even though it was just as informative.
This problem shows up in any algorithm that relies on distances or directions in data space. K-nearest neighbors classifies points by how close they are to each other, so a feature measured in thousands will swamp one measured in single digits. Principal component analysis looks for the directions of greatest variance, and without standardization, it will mistake a large scale for large importance. Support vector machines and logistic regression also learn better when features are on comparable scales, because standardization helps the model assign appropriate weight to each variable during training.
Tree-based models like random forests and gradient-boosted trees are a notable exception. They split data based on thresholds within individual features, so the absolute scale of a variable doesn’t affect the outcome.
Standardization vs. Min-Max Normalization
Standardization (z-score scaling) and min-max normalization are two different approaches to putting variables on a common scale. Min-max normalization rescales values to a fixed range, typically 0 to 1, by subtracting the minimum and dividing by the range. Standardization centers values around zero with unit variance instead.
Standardization is generally preferred when your data roughly follows a bell-shaped distribution or when you’re using linear models like logistic regression and support vector machines. These algorithms perform better when features resemble a standard normal distribution with zero mean and unit variance, because this helps the model learn appropriate weights for each feature. Min-max normalization is more useful when you specifically need bounded values within a fixed interval, such as when feeding data into a neural network layer that expects inputs between 0 and 1.
Another advantage of standardization is resilience to shifting distributions. If the range of values for a feature changes between datasets (a phenomenon called concept drift), min-max normalization can produce wildly different scaled values. Standardization handles this more gracefully because it anchors to the mean and spread rather than to extreme values.
Step-by-Step Calculation
Suppose you have five exam scores: 72, 85, 90, 68, and 95. To standardize them:
- Calculate the mean: (72 + 85 + 90 + 68 + 95) / 5 = 82
- Calculate the standard deviation: Find how far each score is from 82, square those differences, average them, and take the square root. The standard deviation here is approximately 10.7.
- Apply the formula to each value: For the score of 72, the z-score is (72 − 82) / 10.7 = −0.93. For 95, it’s (95 − 82) / 10.7 = 1.21.
A z-score of −0.93 means that score falls about one standard deviation below the mean. A z-score of 1.21 sits about 1.2 standard deviations above. After transformation, the five standardized values will average exactly zero, and their standard deviation will be exactly one.
Handling Outliers
Standard z-score scaling has a weakness: outliers distort the mean and standard deviation, which throws off the transformation for every other data point. If a few extreme values inflate the standard deviation, the bulk of your data gets squeezed into a narrow range. In one scikit-learn demonstration, outliers caused most of the transformed data to land between −0.2 and 0.2 for one feature while spreading between −2 and 4 for another, defeating the purpose of putting features on a common scale.
A robust alternative uses the median and interquartile range (IQR) instead of the mean and standard deviation. You subtract the median from each value and divide by the IQR (the range between the 25th and 75th percentiles). Because the median and IQR aren’t pulled by extreme values, adding or removing outliers from the dataset produces roughly the same transformation. The outliers themselves remain in the data, but they no longer warp the scaling of everything else. Most transformed values end up in a comparable range across features, which is the whole point.
Implementation in Python
Python’s scikit-learn library provides a StandardScaler class that handles the math for you. The typical workflow looks like this:
First, import the scaler and create an instance. Then call fit on your training data to compute the mean and standard deviation of each feature. Finally, call transform to apply the scaling. You can combine these two steps into a single fit_transform call when working with training data.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Notice that you only call fit_transform on the training set. For the test set, you call transform alone, using the mean and standard deviation already learned from the training data. This distinction matters, and the next section explains why.
Implementation in R
R has a built-in scale() function that standardizes columns by default. Calling scale(x) subtracts each column’s mean and divides by its standard deviation. Both the center and scale arguments default to TRUE. You can pass custom values instead of using column means and standard deviations by supplying numeric vectors to either argument. For example, scale(x, center = FALSE) skips the mean subtraction and only divides by the root mean square of each column, which is not the same as dividing by the standard deviation unless the data is already centered.
Avoiding Data Leakage
One of the most common mistakes when standardizing is fitting the scaler on the entire dataset before splitting into training and test sets. This leaks information from the test set into the training process, because the mean and standard deviation now reflect data the model isn’t supposed to have seen. The result is an inflated sense of how well your model performs.
The correct procedure: split your data first, then fit the scaler only on the training set. Apply that same transformation (same mean, same standard deviation) to the test set. This way, the test set is scaled using statistics it had no part in creating, which mirrors what happens in the real world when new data arrives. If you’re using cross-validation, the same principle applies within each fold. Automating this through a pipeline prevents accidental contamination.
Interpreting Standardized Coefficients
Standardizing your variables changes how you interpret regression results. With raw, unstandardized variables, a regression coefficient tells you the expected change in the outcome for a one-unit increase in that predictor, and one unit means something different for every variable. After standardization, the coefficient represents the expected change in the outcome for a one-standard-deviation increase in the predictor. This puts all predictors on a common footing, making it straightforward to compare which ones have the strongest relationship with the outcome.
This is especially useful when your predictors use incompatible units. Comparing the effect of income (measured in thousands of dollars) to education (measured in years) is meaningless with raw coefficients. Standardized coefficients let you say which factor has a relatively larger association with the outcome. The apparent strength of a relationship can look small, medium, or large depending entirely on how the variable was scaled, so consistency in scaling choices matters when comparing results across studies.
For skewed variables that don’t follow a symmetric distribution, an alternative approach divides by the interquartile range instead of the standard deviation. The resulting coefficients then represent the change in the outcome across the middle 50% of the predictor’s distribution, which can be more meaningful when extreme values stretch the standard deviation.

