Batch Normalization or just z-normalization as a Nonlinearity

Question

It is already common to do something "like"**(see asterisks below) z-standardization of the outputs of one neural network layer before passing it to the next. z-standardization would transform the columns of $H_{\ell}W_{\ell} + \beta_{\ell}$ (where $\ell$ denotes a layer and $H$ denotes a "hidden" matrix containing the values of hidden neurons, or the input data) to have 0 mean and unit standard deviation.

**In reality, batchnorm is used, which incorporates learnable weights to the standardization function to help the model "undo" or "modify" the deterministic nature of z-scaling.

(1) I observe that z-scoring is a nonlinear function in $W_{\ell}$, because we must compute the sample standard deviation of $H_{\ell}W_{\ell} + \beta_{\ell}$, which involves a square root. It follows that batch-norm will be nonlinear in the previous layer's weights as well.

(2) Thus, if we do batch norm, we do not need to use a common activation function such as ReLU or tanh to prevent the whole "stack" (composition) of affine layers from collapsing into one affine layer. Meanwhile, the community generally believes batch norm is "good". Thus, why not just use batch norm between layers and free ourselves from choosing between ReLU, ELU, etc. etc.?

^^ that is the purpose of my question

what follows is an observation of why my question might also be useful:

(3) Then, we may observe all the questions online about whether batch norm should be used before or after the activation function. But couldn't we just use batch norm and avoid this question?

Insights and/or references on this specific topic would be much appreciated!

Thanks

I'm not sure what information you're seeking. My best guess is that your question is premised on a misunderstanding of how batch norm works. The point of batch norm is to use running mean and running standard deviation estimates; these estimates are used to compensate for the shifting means and standard deviations of inputs to the norm layer which occur because the network is training. Does this answer your question, or do you need clarification about a different component of batch norm? (What component do you wish to understand in more detail?) — Sycorax, Sep 16 '19 at 20:04
@Sycorax. In some sense, that's the point of my question. I understand the original motivation behind batch norm, but couldn't it also double as a nonlinearity? I mean, why do we need to compose something like a sigmoid, a relu, an elu etc. etc. with a z-standardization or its glorified cousin batch norm? — RMurphy, Sep 17 '19 at 13:59
A $z$ score is just a linear transformation of the inputs; if this is unclear, note that you can re-write $\frac{x - \mu}{\sigma}$ as $\frac{1}{\sigma}x - \frac{\mu}{\sigma}$. The duplicate question addresses why neural networks use nonlinear activation functions instead of linear functions: linear functions are closed under composition, so a network of linear functions is simply a linear model. — Sycorax, Sep 17 '19 at 14:42
@Sycorax, please, it is not a duplicate question. I understand perfectly well why nonlinear functions are not used, in terms of the answer you have given me above. Also, I do not agree that computing a standard deviation is linear in its arguments. The x you have written will have weights from the previous layer, and we will evaluate a square root of those weights. — RMurphy, Sep 17 '19 at 19:24
My advice is to use the edit button rewrite your question to clearly articulate what you know, what you would like to know, and where you are stuck. Right now, I can't make heads or tails of what you're trying to ask and what you would like to know. — Sycorax, Sep 17 '19 at 19:33
Wht's the question? you made statements, and but didnt pose a clear question — Aksakal, May 23 '20 at 18:04

Batch Normalization or just z-normalization as a Nonlinearity

0 Answers0