9

According to the documentation of the StandardScaler object in scikit-learn:

For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

I should scale my features before classification. Is there any easy way to show why I should do this? References to scientific articles would be even better. I already found one but there are probably many other.

scallywag
  • 93
  • 1
  • 1
  • 4

2 Answers2

16

All kernel methods are based on distance. The RBF kernel function is $\kappa(\mathbf{u},\mathbf{v}) = \exp(-\|\mathbf{u}-\mathbf{v}\|^2)$ (using $\gamma=1$ for simplicity).

Given 3 feature vectors: $$ \mathbf{x}_1 = [1000, 1, 2], \quad \mathbf{x}_2 = [900, 1, 2], \quad \mathbf{x}_3 = [1050, -10, 20]. $$

then $\kappa( \mathbf{x}_1, \mathbf{x}_2) = \exp(-10000) \ll \kappa(\mathbf{x}_1, \mathbf{x}_3) = \exp(-2905)$, that is $\mathbf{x}_1$ is supposedly more similar to $\mathbf{x}_3$ then to $\mathbf{x}_2$.

The relative differences between $\mathbf{x}_1$ and: $$ \mathbf{x}_2 \rightarrow [0.1, 0, 0],\quad \mathbf{x}_3 \rightarrow [0.05, -10, 10]. $$

So without scaling, we conclude that $\mathbf{x}_1$ is more similar to $\mathbf{x}_3$ than to $\mathbf{x}_2$, even though the relative differences per feature between $\mathbf{x}_1$ and $\mathbf{x}_3$ are much larger than those of $\mathbf{x}_1$ and $\mathbf{x}_2$.

In other words, if you do not scale all features to comparable ranges, the features with the largest range will completely dominate in the computation of the kernel matrix.

You can find simple examples to illustrate this in the following paper: A Practical Guide to Support Vector Classification (Section 2.2).

Tim
  • 108,699
  • 20
  • 212
  • 390
Marc Claesen
  • 17,399
  • 1
  • 49
  • 70
  • you might also want to discuss regularisation: the scale of the weights depends on the scale of the inputs... – seanv507 May 27 '15 at 09:25
  • The effect of regularization is that different scalings imply different optimal $C$, which is somewhat orthogonal to this particular issue. – Marc Claesen May 27 '15 at 09:28
  • 3
    But it could indeed be that proximity along one dimension is more important. So the goal is not really to have the same variance in all features but to have them scaled such that distances along every feature has about the same importance w.r.t. the task. – isarandi May 27 '15 at 12:44
  • @Marc Claesen, if your variables are of different orders of magnitude, then your weights will also be of different orders of magnitude, and the l2 norm will focus on the inputs which have small variance and correspondingly large weights. put another way, weight norm regularisation ensures that 'small' inputs have small effects. This only makes sense if you have standardised 'small' (across your inputs) eg by normalising your variables – seanv507 May 27 '15 at 13:17
  • 1
    @seanv507 that only applies to linear SVM. – Marc Claesen May 27 '15 at 13:18
  • @MarcClaesen, AFAIK it applies equally to nonlinear SVM, just that its not explicit - the weight regularisation is in the higher dimensional space, but the size of the weights is still impacted by the original variables. – seanv507 May 27 '15 at 14:31
  • @isarandi thanks for the insight. Would you mind elaborating "distances along every feature has about the same importance w.r.t. the task"? I don't quite get it. – aerin Jan 11 '18 at 23:13
  • How was "relative difference" calculated? Nothing I can think of gives those results. – Rachel Feb 26 '19 at 05:41
1

It depends on what kernel you are using. By far the most commonly used (apart from linear) is the gaussian kernel, which has the form

$$ f = exp \left ( \frac{- || x{_{1}} - x{_{2}} || ^2 }{2\sigma ^2} \right ) $$

An SVM takes this function and uses it to compare the similarity of a point ($x1$) to every other point in the training set by summing the differences as:

$$ (x{_{1}}-l{_{1}})^2+(x{_{2}}-l{_{2}})^2...+(x{_{n}}-l{_{n}})^2 $$

where $x$ is your example and the values of $l$ are the landmarks.

If the feature $x{_{1}}$ ranges from 0 - 50,000 while the feature $x{_{2}}$ ranges from 0 - 0.01, you can see that $x{_{1}}$ is going to dominate that sum while $x{_{2}}$ will have virtually no impact. For this reason it is necessary to scale the features before applying the kernal.

If you want to learn more I recommend module 12 (Support Vector Machines) from the Stanford online course in machine learning at Coursera (free and available any time): https://www.coursera.org/course/ml

ralph346526
  • 141
  • 2