How does Feature Scaling help Gradient Descent?

Question

I am following deep learning.ai's videos on Coursera. I have a couple of questions about feature scaling using the formula:

$$ (x - \mu)/ \sigma $$

Edit: There are similar questions which deal with the same topic, but none of them answer these questions in particular. I have highlighted the question in bold to emphasise.

1.) What is the use of subtracting the mean? My understanding is that dividing by the SD scales the features and subtracting the mean centres the data around zero. But why is centring the data around zero useful?

2.) I understand that the values of mean and SD should be consistent across training and test sets. Are $ \mu $ and $ \sigma $ calculated on the entire dataset(train and test together)? Or are the values calculated on the train set and then applied to the test set?

Thanks in advance.

Question 1 is addressed by "Neural Networks input data normalization and centering". Please re-read the answers there. I've added some additional duplicates which address Question 2. — Sycorax, Oct 21 '19 at 16:39
Understood. Thank you. That particular question focussed more on min-max normalisation vs normalising by the mean. However, I find there to be a lot of interest in centering around the mean in neural networks. I read that this is one of the reasons tanh activations are preferred over sigmoid. So my question is why is there so much importance for centering around the mean? Does it optimise the performance in any way? — Nitin, Oct 21 '19 at 17:10
The point of centering is to move the center to 0; the reason this is important is described in the linked thread. There are lots of ways to move the center to 0; the mean is nice because it's more robust than the min and the max, which can be strongly influenced by very large or very small values. But remember -- the only reason we care about centering at all is to precondition optimization; most choices of centering and scaling perform similarly (unless something weird is happening, like larger outliers). $\tanh$ being centered at 0 is a tangential issue. — Sycorax, Oct 21 '19 at 17:14
Thank you. Not sure if I should open a new question, but could you elaborate on this statement "tanh being centered at 0 is a tangential issue." — Nitin, Oct 21 '19 at 17:31
It seems like a distinct question from the ones you mention in the post; you don't mention $\tanh$ in any of the drafts. Before you ask a new question, please make use of the search feature because I believe we've addressed this question before. Information on how to use search can be found here: https://stats.meta.stackexchange.com/questions/5549/best-practices-for-searching-cv — Sycorax, Oct 21 '19 at 17:33

How does Feature Scaling help Gradient Descent?

0 Answers0