SGD is sensitive to feature scaling

Question

I am taking a deep learning class and the class slides state one of SGD's problems as: "Gradient is scaled equally across all dimensions." Now what is meant by this is I believe, when we have d-dimensional features, the learning rate is multiplied with all of them. We cannot cherry-pick learning rates for dimensions.

I have two questions regarding this: Isn't this also a problem of Batch Gradient Descent? Since for Batch GD too, we use the same learning rate for all of the features?

My second question is, I stumbled upon the scikit documentation when I was looking for an answer. Here on the page for SGD, the scikit documentation states "SGD is sensitive to feature scaling." Is this somehow related to the fact given in my class slides? I really couldn't understand what it means to be sensitive to feature scaling and why it happens in SGD. Any help would be appreciated.

score 1 · Accepted Answer · answered Dec 07 '20 at 12:43

Not only SGD but batch or mini-batch GD is also sensitive to feature scaling. Batching is about number of training samples to use in one update iteration.

... "SGD is sensitive to feature scaling." Is this somehow related to the fact given in my class slides?

Yes, it is. In vanilla SGD, we apply a single learning rate, and if the dimensions have very different scales, the chosen learning rate might be too small for some and too large for others. It'd be much easier to tune it when you have all similar scales.

SGD is sensitive to feature scaling

1 Answers1