Should we average weight decay loss in neural network?

Question

In a typical neural network, which way is the common way to add regularization?

Assuming regression task, regression error loss is Mean-squared-error

Then we can have two choice of regularization on weights:

$\lambda$ * $\sum ||W||^2$
$\lambda$ * $\textbf{average} ||W||^2$

I have seen most people use the first option, just being curious to ask.

depending on what you mean by average, the two should be equivalent as they differ by the scalar value of the number of samples. — meh, Jul 10 '18 at 21:02
What are you summing/averaging over? It's not clear from your expressions — user20160, Jul 10 '18 at 21:24
Is the difference between the two, the second $\lambda$ will be $\lambda/n$ of the original? If so, im not sure if it really matters much. — Anonymous Emu, Jul 10 '18 at 22:14
I agree with @AnonymousEmu, it's just a different scale for lambda variable. With average you just reduce value of the lambda implicitly — itdxer, Jul 17 '18 at 15:22

Sycorax · Answer 1 · 2018-08-03T02:56:21.070

1

Using the average implicitly rescales $\lambda$. This means that choosing the average or the sum isn't really consequential, because whatever the optimal $\lambda$ is on the mean scale has an equivalent choice of $\lambda$ on the sum scale, and vice versa. $$ \begin{align} \lambda \sum_i w_i^2 &= \lambda\sum_iw_i^2 \\ &= {n\lambda} \left[\frac{1}{n}\sum_iw_i^2 \right]\\ \end{align} $$

edited Aug 03 '18 at 02:56

answered Aug 03 '18 at 02:33

Sycorax

76,417
20
189
313

Should we average weight decay loss in neural network?

1 Answers1

Linked

Related