1

Working with data that use different dimensions, you do not want that one dimension dominate.

This means feature scaling! A very intuitive way is to use min-max scaling so you scale everything between 0 to 1.

What I do not understand and what is not intuitive for me at all is to use z-score for feature scaling.

Why is z-score used? What is the motivation to not use min-max and to use z-score? Why is it a good idea to scale your data in standard deviations from the mean? What was the motivation to use z-score for scaling? Why is min-max not used all the time? What problem does z-score solve what min-max does not solve?

hope someone can help me and make it somehow clear.

John Smith
  • 51
  • 2
  • 1
    One of many possible explanations is that the z-score is the Mahalanobis distance (in one dimension). See https://stats.stackexchange.com/questions/62092 for some explanations of what that means. There are all kinds of reasons *not* to scale by the range, not least is that with potentially unbounded data the range is one of the least stable statistics one can imagine. Related topics are *correlation,* (univariate) *regression,* and the *68-95-99.7 rule.* – whuber Oct 07 '21 at 21:17
  • thanks for response whuber! # Mahalanobis distance: the Mahalanobis distance makes sense for me to detect outliers, but I do not understand how you can use it to motivate your feature scaling with z-score # range stability: you said that rang is one of the least stable in statistics. What do you mean by that? What stability do you mean? What does it mean to be stable? Why is the range not stable and on what sense? Why are standard deviation unit more stable? – John Smith Oct 08 '21 at 15:22
  • Concepts of [robust statistics](https://en.wikipedia.org/wiki/Robust_statistics) will explain all this. – whuber Oct 08 '21 at 16:23
  • ohh, you mean robust like "robust to outliers"? If that is the case I still do not get it to pick the z-score to scale my data. The z-score uses the mean and not the median and we can show that the mean is not robust to outliers. I still do not see the motivation to use the z-score to scale my data and not only and always pick min-max. I mean if your data has outliers, we can track them down and remove them. What is if you have features that are not all normal, isn't min-max better? Maybe you have a simple example to see why it makes sense to use z-score over min-max. – John Smith Oct 08 '21 at 22:19
  • If you *don't* "track them down and remove them," then the min or max or both will be outliers and using them for your normalization screws up *all* the data. That's the basic problem. – whuber Oct 09 '21 at 13:14
  • I agree with you tracking down outliers and removing them is needed. But that is not the problem I have. It is still min-max VS. z-score for features scaling. For me min-max is the most natural and instinctive way to scale data but somebody came up with the the idea to use z-score to scale features. I did get a lot of answers without the WHY. Just phrases like: "z-score handles outliers better" OR "z-score is good for unbound data". I would like to understand WHY these statements are right and what motivated people to use the z-score for scaling, why it works so well and why you should use it. – John Smith Oct 09 '21 at 18:20

0 Answers0