Why is mean normalization useful in gradient descent?

Question

I understand that feature scaling speeds up gradient descent by making the geometry of the cost function more hospitable (i.e. more circular), but why is mean normalization useful? Doesn't it just "move" the cost function closer to the origin?

Alex R. · Answer 1 · 2017-11-22T04:38:58.713

Your wording is bizarre. Normalization makes your input commensurate in size with the initializations you pick for your weights/parameters. The mean shift balances your input to have both negatives and positives, which is important when your model has asymmetric functions, such as ReLU neurons or the sigmoid of a logistic regression. It also prevents your functions from being saturated, I.e. have vanishing gradient

score 1 · Accepted Answer · answered Nov 22 '17 at 04:27

Normalization of data is a tricky process. You have your terminology a little bit mixed up, but your intuition seems correct. What you refer to as normalization is in fact the centering of data. This is beneficial for certain algorithms (PCA for instance), or as @Alex R. pointed out for activation functions for neural networks. It does not have any bearing on gradient descent itself.

The process of normalization is in fact what you refer to as scaling. As you correctly point out, normalizing the data matrix will ensure that the problem is well conditioned so that you do not have a massive variance in the scale of your dimensions. This makes optimization using first order methods (i.e. gradient descent) feasible.

Your confusion may stem from the fact that the process of normalizing and centering your data is sometimes referred to as standardizing the data.

Ok, that makes sense, and clears up my confusion. (I now realize I was confused because in the video by Andrew Ng I was watching on feature scaling and its usefulness in gradient descent, he sort of tacks on centering of data (or what he calls "mean normalization") without giving it context. Video is at https://www.youtube.com/watch?v=cOrx5-9-YjQ ) — Tyler, Nov 23 '17 at 04:46

Why is mean normalization useful in gradient descent?

2 Answers2