When to scale/normalize for supervised learning algorithms?

Question

I'm trying to understand which supervised learning algorithms require normalization/scaling of features.

It appears that when an algorithm works by calculating the conditional probability (Naive Bayes, Linear Discriminant Analysis), then normalization is not necessary.

However, for algorithms that use any kind of distance metric (K-nearest neighbors, Support Vector Machine), then normalization is necessary.

Is this right? I'm a complete n00b so please jump in and correct me wherever appropriate.

score 1 · Answer 1 · answered Feb 07 '15 at 17:48

Your "heuristic" that needing normalization has to do with conditional probability is wrong.

IMHO a better explanation is that some classifiers have built-in scaling, others don't.

Consider LDA: During LDA, the data is projected so that the within-class covariance ellipsoid becomes a unit sphere. This projection heuristic removes issues with different scales between the variates. This actually does a bit more than what scaling individual variates can achieve.
But it has nothing to do with calculating conditional probabilities.

In fact, you could do $k$-nearest neighbours classification in the LD score space. Equivalently, you could use Mahalanobis-distance wrt. the overall within class covariance for your $k$-nearest neighbours.

Naive Bayes classifiers all have the built-in behaviour that variables are treated individually - this can lead to totally different reasons for why scaling may not be needed compared to the built-in projection of an LDA.

The convergence behaviour of other classifier training algorithms (e.g. SVM, neural networks) may even be sensitive not only to the relative but evenn to the absolute scale of the features for numeric reasons.

Kate · Answer 2 · 2014-07-18T21:09:10.637

0

although I am not really an expert, BUT yes, what your stating sounds right.

scaling is a method of standardizing your range of data. The explanation you gave in the question is correct in the sense that which learning algorithms need scaling.

lets say you have a large data set (of people) and one of the features (age) ranges from 5-100, while another feature (lets say sex), is either 1 or 2 (M or F). plus many other features that fall in the smaller number category. When using for example NN, the first feature will greatly influence the out come of the result. while it might not necessarily have been the most important.

hope this helps

edited Jul 18 '14 at 21:09

answered Jul 18 '14 at 20:22

Kate

39
5

Would you mind making your answer a little bit more stand-alone? As it stands now, your answer is mainly pointing to a Wikipedia page (not even a subsection). Since pages can change (or even move or be deleted), it's better if your answer is complete by itself, and does not rely too much on external links. – Patrick Coulombe Jul 18 '14 at 20:48
I hope this answer is a little more clear. thanks for the input. – Kate Jul 18 '14 at 21:10

When to scale/normalize for supervised learning algorithms?

2 Answers2