The conventional definition of the L2-regularization "weight decay" hyperparameter $\lambda$ is generally of the form
$$\text{J}(\mathbf{w}\vert \mathbf{X},\mathbf{y})= \text{L}(\mathbf{\widehat{y}}(\mathbf{X}),\mathbf{y}\vert\mathbf{w})+\lambda \mathbf{w}^{T}\mathbf{w}$$
where $\text{J}(.)$ is a loss function "using L2-regularization" and $\text{L}(.)$ is some conventional "underlying loss". This definition has a number of advantages including
- a simple way to express its implications as "how much weight to place on the aggregate magnitude of the model's $\mathbf{w}$", and
- its interpretation as the contribution to the MAP estimate made by IID Gaussian priors for each of the $w_j\sim\mathcal{N}(0,\sigma_w)$, in which case $\lambda\propto 1/\sigma_w^2$ (if it is also the case that $p\left(y^{(i)}|\mathbf{x}^{(i)},\mathbf{w}\right) = \mathcal{N}(y^{(i)};\widehat{y}(\mathbf{x}^{(i)};\mathbf{w}),\sigma_y)$ — and thus $\text{L}(.)$ is the mean squared error — then $\lambda = {\sigma_y^2}/{\sigma_w^2}$; if $p\left(y^{(i)}|\mathbf{x}^{(i)},\mathbf{w}\right)$ is a categorical/multinoulli distribution — and thus $\text{L}(.)$ is cross-entropy — then $\lambda = 1/\sigma_w^2$).
But ML packages and API's as well as some discussions seem to use "$\lambda$", or terms that one would expect to correspond to it, inconsistently. To cite just a few examples:
- TensorFlow's
tf.nn.l2_loss(w)
is half ofnp.dot(w, w)
so any weight applied to the former needs to be doubled in order to match $\lambda$ (assuming, as is generally the case in TensorFlow documentation, all underlying losses are means over batch size). - Stanford's CS231n divides its regularization terms by 2 for "computational convenience", so its "$\lambda$" must be divided by 2 to match $\lambda$.
- Nielsen's (generally excellent) discussion of deep learning treats regularization as a term applied to each training example and follows CS231n (and others) in dividing by 2, so that his "$\lambda$" must be divided by twice the batch size (which he conflates with the whole "training set") to match $\lambda$.
TensorFlow's
tf.matrix_solve_ls(m, ...)
does not average over the length ofm
, so itsl2_regularizer
argument must be divided by the batch size to match $\lambda$. (The same is true of the corresponding argument,alpha
, in scikit-learn'slinear_model.Ridge
.)What the
"L2Regularization"
argument to Mathematica'sPredict
is doing is anyone's guess (as can often be the case with Mathematica).
I understand that ML can get a bit sloppy about (the abundant) terminology, so maybe this is just how things are, but I also wonder if any of this means anything. It certainly results in confusion. Is there a pattern here; some logic to the various approaches; anything really gained by the different meanings for "$\lambda$"? Or is this just a matter of taste?
It seems a shame to stray from the clean and powerful definition above, so I want to be sure I'm not missing some deeper logic to these departures.