3

Specifically talking about Ridge Regression's cost function since Ridge Regression is based off of the $l_2$ norm. We should expect the cost function to be:

$$J(\theta)=MSE(\theta) + \alpha\sqrt{\sum_{i=1}^{n}\theta_i^2}$$

Actual:

$$J(\theta)=MSE(\theta) + \alpha\frac{1}{2}\sum_{i=1}^{n}\theta_i^2$$

Siong Thye Goh
  • 6,431
  • 3
  • 17
  • 28
Nick L
  • 31
  • 2

3 Answers3

6

One of the factors to consider is computational simplicity.

By not introducing the square root, the gradient has a more elegant form.

Also, minimizing $MSE$ subject to $\|\theta\|_2\le c$ is equivalent to minimizing $MSE$ subject to $\|\theta\|_2^2\le c^2$.

h4nek
  • 3
  • 2
Siong Thye Goh
  • 6,431
  • 3
  • 17
  • 28
2

Why should we expect there to be a square root involved? This would simply be a type of scaling which means $\alpha$ values are much larger than without the square root.

The total sum $\sum_{k=1}^K \theta_k^2$ is obviously the same in both forms. Since the alpha parameter is there to scale the importance of the regularisation part relative to the loss associated to the model fitting part it has no intrinsic interpretation.

Therefore, the square root would just be an additional scaling for alpha. It doesn't matter fundamentally, just as it doesn't matter whether there is a $\frac{1}{2}$ in the second equation or not. Also note that the second part is a sum over all parameter coefficients, usually denoted $K$, and not over the observation space usually denoted by $N$.

Finally, compare the L1-norm which in fact is fundamentally different by using absolute parameter values for $\theta$ rather than the squared parameter values in the L2-norm (although, again, any scaling would not matter since in the L1-norm, the $\alpha$ or $\lambda$ parameter is equally inconsequential.

Mark Verhagen
  • 564
  • 2
  • 11
  • "Why should we expect there to be a square root involved?" because Euclidean distance is calculated that way. But another way to convince yourself of *not* square-rooting is that both the variance and bias are in terms of squared-response-units. You can then imagine that the penalty is per-one-unit of regressor. That is $( \vec{\theta} \cdot \mathbf{1})^2$ has a consistent unit with the MSE. – AdamO Feb 24 '20 at 19:29
0

l2 norm of $\bf {\vec x} = {\lvert\lvert \bf {x \rvert\rvert}}_2^2$ = $x_1^2+x_2^2+\ldots+x_n^2$, which is present in the objective function of ridge regression. The $\frac{\alpha}{2}$ that precedes the l2-norm is a tuning parameter in the objective function of ridge regression which is always $\ge0$.
If I'm not mistaken, you are referring this from Andrew ng's course; in the lecture where he teaches ridge regularisation, he deliberately chooses to use $\frac{\alpha}{2}$ instead of just ${\alpha}$, for reasons I do not recall now.

Hope this helps.

Nizam
  • 167
  • 11
  • Also, I would suggest you go through this https://stats.stackexchange.com/questions/287920/regularisation-why-multiply-by-1-2m?rq=1 – Nizam Feb 16 '20 at 09:16