Statistical questions related to Deep Learning

Question

I have questions regarding this Normalized weights and Initial inputs video on Udacity course Deep Learning.

In this video it talks about variables that go into Big-loss function should have zero mean and equal variances. I know weights and biases are variables and data and labels are its input. My question is how zero mean and equal variance would help in optimization?

The other question I have is, in this video it talks about weight initialization randomly using Gaussian distribution, I cannot understand which values will weights and biases will take if they are initialized using normal distribution with mean 0 and standard deviation sigma = 0.1 ? I'm very confused about this part, Could you explain this with the help of an example?

And also in this normal distribution diagram, the probability distribution should sum up to 1 instead 5. Right?

What's the meaning of the parameter size in the function numpy.random.normal ?

In general, it's better to ask each question separately. – shadowtalker Jun 29 '16 at 04:19 — shadowtalker, Jun 29 '16 at 04:19

score 1 · Answer 1 · edited Apr 13 '17 at 12:44

Question 1

The video actually explains this, but the explanation involves a lot of hand-waving.

Normalizing to mean 0 and variance 1 has the following effects:

All variables are measured on the same scale. This means that no one variable can "dominate" the optimization process. This is actually illustrated nicely at 0:47 in the video. When the scale of one input is large relative to another, the gradient with respect to that input will also tend to be larger than the gradient with respect to the other. This will lead to relatively large updates in the weights, causing the objective function to "bouce around" instead of converging gradually; it has the same effect as setting the learning rate too high.
It will help prevent the weights or the objective function from taking on values that are extremely large or extremely small. When a typical computer does arithmetic, it is much more likely to make mistakes or cause errors on extremely large or small numbers. Extreme values can also lead to numerical instability, where the answers diverge wildly given only small perturbations in the inputs.

Question 2

"Initialized using the normal distribution" means that a random value from the normal distribution is drawn for each weight.

Question 3

No. The y-axis represents "probability density," which is not the same thing as probability. It can only be interpreted as a relative value. It's not totally wrong (but also not totally right) to think of it this way: values close to a point with high probability density are proportionally more likely to occur than values close to a point with low probability density.

This is a subtle idea, and I have my own vivid memories of confusion around it. User "whuber" explains it better than I do in his answer to "Can a probability distribution value exceeding 1 be OK?".

Question 4

It's the size of the sample.

To read the help in the console, type:

import numpy.random
?numpy.random.normal

score 0 · Answer 2 · answered Mar 23 '16 at 13:31

I would comment here, but I don't have enough reputation. I'm not sure my answer will help, but by normalising the data the algorithm doesn't have to worry about sharp variances or other problem areas, since everything is scaled together to be within a specific range (usually 0 to 1 or -1 to 1). The attempt is to get a better fit/optimise without issues.

Weight initialisation, generally, is actually random so you shouldn't need to worry about this. The point is to have the algorithm iteratively adjust the weights until the result of the error function is as low as possible.

I don't know if that answers your questions, but I hope it helps.

Statistical questions related to Deep Learning

2 Answers2

Question 1

Question 2

Question 3

Question 4