Why wider range for a feature in Machine learning affects training?

Question

I was reading through the Google Machine learning crash course and I can't digest the below point:

If a feature set consists of multiple features, then feature scaling provides the following benefits:

Helps the model learn appropriate weights for each feature. Without feature scaling, the model will pay too much attention to the features having a wider range.

Could anyone explain with an example of how the model will pay too much attention (How?) to the features having a wider range.?

depends on the model ( for tree's no effect) for models depending on dot product (neural nets ,linear svm) or euclidean distance(kmeans) will have an effect. two aspects I can think of a) weight initialisation is based on assumption inputs are scaled to be normalised b) regularisation large range may well have small weights and so be unregularised — seanv507, Oct 31 '18 at 17:09
@seanv507, I didn't get your point 'b' completely, could you please explain it more? Also, it would be great if you provide an example or point me to some blog post that explains both of your points! - Thanks :) — Anu, Nov 08 '18 at 01:50

score 0 · Answer 1 · answered Nov 08 '18 at 07:49

K means looks at Euclidean distance, if one dimension is much larger ( in range) than the others then it will dominate, and your clustering will be determined by that.

Weight regularisation (as used by ridge regression and svms and neural nets) penalises large weights.

They all take an inner product of weight and input. So for a particular dimension, if you rescale the input by 10, then you need to divide the weight by 10 to get the same effect on the output. So now if you use weight regularisation It penalizes the size of the weight, If your input is 10 times larger, then your weight is 10 times smaller and you will be penalised less, so those dimensions that have a large range will tend to be restricted less by regularisation than those that don't.

One way to think about the regularisation is that relatively small inputs should have small outputs/effect.

So the question to ask yourself is does the relative scale of my inputs matter? Eg in a signal processing application, you would expect all dimensions ( sample position) to have potentially the same size, and those ones that vary much less are likely noise to be ignored. ( So in this case you don't want to normalise your inputs)

On the other hand in many applications their measurement scale just depends on arbitrary units ( eg measure weight in kilos and height in meters) In this case, one can guess that effect sizes should be relative to the range of that dimension. So our prior is that a 1 std deviation change in one dimension, is likely to cause a bigger effect than a 0.1 std deviation change in another.

Why wider range for a feature in Machine learning affects training?

1 Answers1