Why is there a performance difference before and after scaling and normalisation?

Question

Here is my understanding of scaling, "The main advantage of scaling is to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges."

Let's take k-nearest neighbours:

If you are performing scaling to bring the measurements so the ranges are close, then I can see that there is an increase in computation performance (not having to calculate very large distances due to differences in ranges) Distances can be computed quickly since all columns are scaled in the same manner!

In layman's terms, for a novice, its harder to compute 100000/5000, instead if the expression is scaled to 100/5 by removing the zeroes from numerator and denominator, it becomes easier to compute the division.

But isn't the value the same?

However, in machine learning algorithms, there appears to be a difference in accuracy measure for a data set which is unscaled compared to data set which is scaled! What is the theory behind this, if it can be explain in layman's terms? You're more or less just subtracting and dividing each value in all columns with the same factor so that the distances can be computed quicker! But if there is a noticeable difference in accuracy score measurement, then doesn't that mean that, the scaled data set is now a DIFFERENT one and not the same as the unscaled one?

I really hope I made the question clear. I'm really sorry if I sound dumb..

It solely means that your algorithm has obtained a different solution. This can well happen since in the case of non-scaled data some solvers have difficulty obtaining a minimum quickly, if at all. Some good points are also made [here](https://stats.stackexchange.com/q/189652/79815). — Niels Wouda, Jun 04 '17 at 09:20

Karel Macek · Accepted Answer · 2017-06-04T08:56:43.400

Perhaps, this is not going to be a complete answer as your question is very broad. Let me emphasize some aspects:

Numerical stability. If you have all inputs in range zero to one, the float number representation is let's say more predictable.
Interpretation. If we consider e.g. linear or logistic regression, we can interpret the parameters as influence of being best in a specific input parameter.
Robust methods. E.g. the penalization of coefficients in ridge regression can be fair. Without normalization, you would overuse the columns with a wide range and ignore those with a small range.

Note that the normalization is typically carried out column-wise, i.e. each column has its own normalization. This causes the difference. Imagine that one column's values range from 0 to 1e6 while the other's from 1 to 1e-6. Without normalization, the k-means will use just the first column as the second has no influence. After normalization of both to 0 to 1, the impact of the second column will come.

Why is there a performance difference before and after scaling and normalisation?

1 Answers1

Linked