I am taking a deep learning class and the class slides state one of SGD's problems as: "Gradient is scaled equally across all dimensions." Now what is meant by this is I believe, when we have d-dimensional features, the learning rate is multiplied with all of them. We cannot cherry-pick learning rates for dimensions.
I have two questions regarding this: Isn't this also a problem of Batch Gradient Descent? Since for Batch GD too, we use the same learning rate for all of the features?
My second question is, I stumbled upon the scikit documentation when I was looking for an answer. Here on the page for SGD, the scikit documentation states "SGD is sensitive to feature scaling." Is this somehow related to the fact given in my class slides? I really couldn't understand what it means to be sensitive to feature scaling and why it happens in SGD. Any help would be appreciated.