Why is scaling usually done within the range of -1 and +1 in machine learning?

Question

I observe that in machine learning, scaling is done to make the numeric input fit within the range from -1 to +1. Why not a bigger range like -10 to -10? If smaller range is better, then why not -0.1 to +0.1?

Why is this numeric range (-1 to +1) preferred? Is there a mathematical reason to choose this range?

Related: [1](https://stats.stackexchange.com/questions/283458/why-is-there-a-performance-difference-before-and-after-scaling-and-normalisation?rq=1) and [2](https://stats.stackexchange.com/questions/189652/is-it-a-good-practice-to-always-scale-normalize-data-for-machine-learning?noredirect=1&lq=1) — mhdadk, Dec 25 '20 at 06:39
So floating-point numbers are not continuous. They have a limited number of elements that they can represent. They also have an exponent so they can represent very big numbers. One of the consequences of this is that the number that it is not actually always the number you think it is. Double precision floating-point, expressed as a decimal, that is to say expressed in base 10, will get you about 16 decimal places. When your number is farther from One, the distance between adjacent elements is larger. Ideal representation happens in the domain between -1 and +1. — EngrStudent, Dec 25 '20 at 17:21
@EngrStudent But that doesn't give any reason to scale to the range of -1 to 1 instead of say, -2 to 2 or -0.5 to 0.5. If a number is rounded when you scale it to -1 to 1, it'll be rounded twice as far if you scale it to -2 to 2, and half as far if you scale it to -0.5 to 0.5, so it's totally a wash. (The exception is if you're using some extremely small numbers.) — Tanner Swett, Dec 26 '20 at 07:23
There are other reasons to go to +/- 1. They are somewhat less compelling but not required. There are bases, the Lego blocks of functions, that have an output range within that window and this allows them to be used. There are statistical conditioning processes which, though they have their own risks, they can be quite helpful And their outputs tend to be in that range: Dividing by the variance after subtracting the mean Tends to give a range closer to +/- 3, But they are known to help in the machine learning process. — EngrStudent, Dec 26 '20 at 13:35

Gaussian Prior · Answer 1 · 2020-12-25T16:58:09.147

There is no direct mathematical reason behind it. If our features were to be scaled on a range -10 to 10 for every variable, there would be no mistake if we made sure that there is no variable let's say A, that actually has the values (-10) or (10) while another one does not let's say B, because this would imply that feature A is inherently most important and would dominate feature B in distance based algorithms such as K-nn (distance between two datapoints would be more dependent on feature A than B). Now deep neural networks may prefer values in ranges like (-1,1) or (0,1) as inputs because they are favored by a proper parameter initialization, and providing 'larger' numbers into the network in conjunction with a saturating non-linear activation function such as sigmoid could lead to minimal, or even zero gradients back propagated into the network sooner than expected, which is antithetical to the whole learning process for NNs. Now using a range such as (-epsilon, epsilon) is cheeky, and would not hurt if done properly, however there are things to consider such as numerical stability and if I may bring forward the NN case again, it could be harder for the network to converge to a proper weight matrix in some cases.

Oh okay! By the way I have never heard of the argument you described above nor have I ever thought of floating point precision having anything to do with it! Thanks a lot! — Gaussian Prior, Dec 25 '20 at 19:23

Why is scaling usually done within the range of -1 and +1 in machine learning?

1 Answers1