Should I scale all features?

Question

I was wondering whether or not one should generally scale all numerical features.

For instance, I had a state column with string values like KY, TX, or AL. These I converted to an indexed column with a StringIndexer. Which yields values like 2.0, 18.00.

In other columns, there are sums going up to 4K, and then there were Yes/No type columns that I converted to 0 or 1. So there are only numerical columns now.

Should I scale all of these (together)? (My intuition says yes but I couldn't find any info)

Edit for EdM's comment:

It's a one-time analysis of a static data set. Outliers have been taken care of.
My motivaton: I read that [(.4,.2),(0,1)] performs better than [(400,200),(0,1)] due to possible magnitudal bias in some common algorithms. Are you saying this is not the case? If so can you explain further?
Why I ask: My concern is that there might be some "loss of significance". To illustrate: In the dataset, "gender" has a meaning: 1 is male and 0 is female. But if I scale and the value is 0.7 then what does that mean? Or am I overcomplicating things here?
I try to rephrase the question: Could I just scale down all continuous values to [0,1] but leave the binary/category transformed values?

For completeness: I one-hot encoded the state column

Please say more about why you wish to scale predictors at all. In many applications it doesn't help, and can make it more complicated to apply a model to new data having different predictor distributions. In particular, what do you wish to accomplish by scaling the categorical predictors? — EdM, Jul 18 '20 at 20:07
As EdM implies, it is not general. Whether it is useful depends on what you are doing, and in that respect how to do it also depends. You are aware of the competing needs of mathematical appropriateness and interpretability, which is more important in your case? — ReneBt, Jul 19 '20 at 03:53
Your state numbers are not in fact numbers, but labels. So scaling would be meaningless (and might even generate rounding errors). Similarly with gender labels or indicators — Henry, Jul 22 '20 at 17:15

score 5 · Accepted Answer · edited Jul 22 '20 at 17:16

The answer depends on the type of model.

Ordinary least squares and generalized linear regressions don't need scaling (unless you are challenging the floating-point precision of your computer). If you change the scale of a predictor variable all that happens is that the estimated coefficient has a corresponding change of scale so the end result is the same. Tree-based models (e.g., random forest, boosted trees) use cutoffs within the range of a categorical predictor or selection of one level from a categorical predictor. So there shouldn't be any advantage to scaling with tree-based models. And if you do scale with these approaches and then use your model to predict on new cases, you have to scale the new data in a way that matches your original scaled data. So why scale to start with?

Scaling is important is when the modeling method effectively makes direct comparisons among the predictors in some way. For example, clustering based on multi-dimensional Euclidean distances among cases needs scaling so that a predictor whose scale leads to high numerical values doesn't overwhelm the contributions of predictors whose scales lead to numerically small values. That's also the case for principal component analysis, where you need to start with similar variances among the predictors. It's needed for approaches like ridge or LASSO regression, which put a penalty on the sum of the squares or the sum of the absolute values (respectively) of regression coefficients. Unless all the predictors are on a common scale, you will be differentially penalizing predictors depending on their numerical scales and their corresponding scale-dependent coefficient magnitudes. I believe that is also the case for neural-net methods, although I don't have experience with them.

For these methods that need scaling, however, don't restrict continuous values to a range of [0,1] as you seem to propose. The best way to meet these requirements for each continuous predictor is to calculate the mean and standard deviation, then for each individual value subtract the mean and divide by the standard deviation. This puts each continuous predictor into a scale with mean 0 and standard deviation 1 so that all predictors have the same empirical variance.*

If you do have to scale for one of those latter modeling approaches, the handling of binary or multi-level categorical features is a conundrum. Scaling will give different weights among a set of binary predictors depending on the 0/1 class balance for each predictor. For multi-category predictors, the weighting will differ depending on your choice of a reference predictor unless you take special precautions. See this page for more discussion and further links.

So I'd say don't scale unless your modeling approach requires it. If it does, use your knowledge of the subject matter to help decide whether or how to scale categorical predictors.

*This is sometimes called "normalization" of the data even though the final values need not follow a normal distribution. Not everyone uses terms like "scaling" or "standardizing" or "normalizing" in the same way, so you have to look into just what was done when you read others' work.

(+1) Excellent and thorough discussion. – Nick Cox Jul 22 '20 at 17:16 — Nick Cox, Jul 22 '20 at 17:16

Should I scale all features?

1 Answers1