When should you use different scaling methods?

Question

There are already some good answers on when feature scaling is desired and when to center vs. scale. However, they don't explain which scaling method to use in which situation. For context, assume that you have some numerical data that you've already decided needs to be scaled, and that both model predictive power and interpretability are important. Some reasons for scaling could be:

Preparing the data to create interaction variables
Ensuring that regularization affects all variables equally
Speeding up gradient descent
Making the coefficients easier to interpret

Jeff Hale has a sensible-looking blog post on different scaling methods. Here are his conclusions:

Do his conclusions make sense, and if not, what's the correct methodology for choosing a scaling method? Ideally there'd be some scientific evidence as well.

I'll also add Sebastian Raschka's recommendation to use standardization for PCA, because "we are interested in the components that maximize the variance".

Could you provide more information? What kind of data is being scaled here? — ajax2112, Apr 18 '20 at 01:10
The Hale post asserts that linear scaling can yield distributions closer to normal, which is always wrong; that (value $-$ median) / IQR can moderate outliers, which is often wrong; and that it is a goal to reduce homoscedasticity, which gets homo- and hetero- the wrong way round. I can't recommend it. — Nick Cox, Apr 18 '20 at 19:33
I found the Raschka post less clear than posts here on CV. A sharper consistency on covariance matrix as compared with correlation matrix would help. We are always interested in principal components with large fractions of the total variance, and that's true regardless of whether that is the sum of variances of standardized or of unstandardized variables. More generally, blogs run by individuals can vary to excellent to the opposite: peer review has many faults, but lack of peer review has even more. — Nick Cox, Apr 18 '20 at 19:36

score 2 · Accepted Answer · answered Apr 18 '20 at 18:42

There is no one right way to rescale or transform data, in other words, a cookbook, swiss army knife answer to your question does not exist. They are all heuristics.

That said, Jeff Hale's distinction between column vs row rescaling is important. While he gets column rescaling very well, he misses the boat wrt row rescaling. Row rescaling, aka ipsative scaling, is useful, e.g., when an array of continuous variables have the same unit of analysis and a wide possible range of values. By dividing each variable by the maximum value (within the row), they are all rescaled into a range between 0 and 1, which will sum to 1, thus normalizing the values within that row.

Many, many more transformations are possible than the few Hale describes, e.g., natural logs, Box-Cox transformations (power), inverse hyperbolic sine function, polynomial and root functions, to name a few. All of them act on the data pdf by either compressing or expanding the data range, as appropriate.

Given the absence of a ground truth wrt rescaling and transforming, one solution is empirical. This entails trying several different methods and evaluating their performance across a multiverse of models. Model comparisons can be based on predictive accuracy (e.g., MSE, MAD, RMSE, AIC, etc.) as well as metrics of dependence or insight, e.g., Pearson and Spearman correlations, distance correlations, mutual information, entropy, etc.

When should you use different scaling methods?

1 Answers1