1

The conventional definition of the L2-regularization "weight decay" hyperparameter $\lambda$ is generally of the form

$$\text{J}(\mathbf{w}\vert \mathbf{X},\mathbf{y})= \text{L}(\mathbf{\widehat{y}}(\mathbf{X}),\mathbf{y}\vert\mathbf{w})+\lambda \mathbf{w}^{T}\mathbf{w}$$

where $\text{J}(.)$ is a loss function "using L2-regularization" and $\text{L}(.)$ is some conventional "underlying loss". This definition has a number of advantages including

  • a simple way to express its implications as "how much weight to place on the aggregate magnitude of the model's $\mathbf{w}$", and
  • its interpretation as the contribution to the MAP estimate made by IID Gaussian priors for each of the $w_j\sim\mathcal{N}(0,\sigma_w)$, in which case $\lambda\propto 1/\sigma_w^2$ (if it is also the case that $p\left(y^{(i)}|\mathbf{x}^{(i)},\mathbf{w}\right) = \mathcal{N}(y^{(i)};\widehat{y}(\mathbf{x}^{(i)};\mathbf{w}),\sigma_y)$ — and thus $\text{L}(.)$ is the mean squared error — then $\lambda = {\sigma_y^2}/{\sigma_w^2}$; if $p\left(y^{(i)}|\mathbf{x}^{(i)},\mathbf{w}\right)$ is a categorical/multinoulli distribution — and thus $\text{L}(.)$ is cross-entropy — then $\lambda = 1/\sigma_w^2$).

But ML packages and API's as well as some discussions seem to use "$\lambda$", or terms that one would expect to correspond to it, inconsistently. To cite just a few examples:

  • TensorFlow's tf.nn.l2_loss(w) is half of np.dot(w, w) so any weight applied to the former needs to be doubled in order to match $\lambda$ (assuming, as is generally the case in TensorFlow documentation, all underlying losses are means over batch size).
  • Stanford's CS231n divides its regularization terms by 2 for "computational convenience", so its "$\lambda$" must be divided by 2 to match $\lambda$.
  • Nielsen's (generally excellent) discussion of deep learning treats regularization as a term applied to each training example and follows CS231n (and others) in dividing by 2, so that his "$\lambda$" must be divided by twice the batch size (which he conflates with the whole "training set") to match $\lambda$.
  • TensorFlow's tf.matrix_solve_ls(m, ...) does not average over the length of m, so its l2_regularizer argument must be divided by the batch size to match $\lambda$. (The same is true of the corresponding argument, alpha, in scikit-learn's linear_model.Ridge.)

  • What the "L2Regularization" argument to Mathematica's Predict is doing is anyone's guess (as can often be the case with Mathematica).

I understand that ML can get a bit sloppy about (the abundant) terminology, so maybe this is just how things are, but I also wonder if any of this means anything. It certainly results in confusion. Is there a pattern here; some logic to the various approaches; anything really gained by the different meanings for "$\lambda$"? Or is this just a matter of taste?

It seems a shame to stray from the clean and powerful definition above, so I want to be sure I'm not missing some deeper logic to these departures.

orome
  • 368
  • 1
  • 4
  • 15
  • 1
    You might as well ask the same question for the definition of $L$! Since the definition of $J$ matters only up to multiplication by any positive constant, the same holds for its components. – whuber Aug 28 '17 at 20:36
  • @whuber: Well, yes, but the constants need to be consistent. In this case the cases of MSE and cross-entropy, for example, no constant bleeds from either version of $\text{L}$ to $\lambda$. – orome Aug 28 '17 at 20:40
  • @whuber: Also, it won't be useful (as an answer in this context) to do down that path. This is a practical question about consistent meaning of $\lambda$ (e.g. in APIs). – orome Aug 28 '17 at 20:42
  • 1
    Since there is an implicit and unspecified constant in $L$, that makes the question of what constant to absorb in $\lambda$ meaningless. Even entropy is defined only up to a constant (until you specify what units of entropy you want to use) and MSE is just a proxy for quadratic loss--defined up to a constant, of course. As a practical question, the only possible answer would seem to be "read the manual." – whuber Aug 28 '17 at 20:42
  • @whuber: It could be that that's all there is to the answer: No body makes the effort to reconcile the values of $\lambda$ across different "underlying" loss functions (or, as you say, even to care much about how that function is defined, if it that has no effect on argmax). Presumably (and this may be the answer behind the answer) because in practice $\lambda$ is guesswork anyway, so whatever apparent meaning it has in specific circumstances (e.g. the ratio of variances when $\text{L}$ is MSE) doesn't matter much when choosing useful values of $\lambda$. – orome Aug 28 '17 at 20:53
  • 1
    Some software even uses $C=\lambda^{-1}$! The Python package sklearn (to pick a prominent example) regularizes by $C$ and only accepts $C>0$; both decisions are somewhere between arbitrary and inscrutable. – Sycorax Aug 28 '17 at 22:48
  • I see a lot of plots use the quantity $\|\hat\beta_{OLS}\|_2^{-1} \|\hat\beta_\lambda\|_2$ to index the tuning parameter, where $\hat\beta_{OLS}$ is the least square estimator and $\hat\beta_\lambda$ is the ridge estimator. This seems to be a more natural scale than any directly related to $\lambda$, which, as you mention, has a somewhat arbitrary scale. – user795305 Aug 29 '17 at 14:41
  • 1
    @Sycorax I believe that is because sklearn does not use a standard method for solving these regressions, they are based on an SVM solver. Not that this changes the fact that it is inscrutable. – Matthew Drury Sep 06 '17 at 17:54

0 Answers0