Why I think zero-centered activation function is no better than no zero-centered case? What's wrong with my understanding?

Question

I read the answer in Why are non zero-centered activation functions a problem in backpropagation? I can understand that for a positive activation function, gradient of each dimension is of the same sign. but I still have a question:

for a zero centered data, we cannot have a gradient whose every dimension's component is of the same sign, so we still have some direction where we cannot go. Is it right?

For instance, the answer says

Say there are two parameters $w1$ and $w2$, if the gradients of two dimensions are always of the same sign, it means we can only move roughly in the direction of northeast or southwest in the parameter space. This may leads to a zig-zag path if the optimal direction is northwest or southeast.

But if the data is zero-centered, the gradients of two dimensions are always of the different sign. Then we can only move in the direction of northwest or southeast, still leading to a zig-zag path if the optimal direction is northeast or southwest.

So in both cases we have some directions where we cannot go, leading to a zig-zag path.

What's wrong with my understanding?

score 0 · Answer 1 · answered Aug 13 '20 at 05:14

I think there is nothing wrong with your understanding and I do agree with your point. In my opinion, there is a wide-spread misunderstanding on this. The explanation given here seems quite reasonable. In other words, multiplication around zero tends to preserve range, which is desirable since every activation function is identical hence operates on a fixed range.

Why I think zero-centered activation function is no better than no zero-centered case? What's wrong with my understanding?

1 Answers1