1

Let $J(w)$ be some cost function. By adding $L^1$ regularization we get $$ \tilde{J}(w) = J(w) + \beta\sum_i|w_i| $$ To study the effect of $L^1$ regularization on the optimum weights, we can approximate $J(w)$ with a quadratic function: $$ \hat{J}(w) = J(w^*) + \frac{1}{2}(w-w^*)^T H(w^*)(w-w^*), $$ where $w^*$ is the minimum of $J$ and so the gradient is zero. According to the Deep Learning book (see pages 207-208) I'm reading, if we assume that $H = diag(\gamma_1,\ldots,\gamma_N)$, where each $\gamma_i>0$, then

the solution of the minimum of the $L^1$ regularized objective function decomposes into a system of equations of the form: $$ \tilde{J}(w) = \frac{1}{2}\gamma_i(w_i-w_i^*)^2 + \beta|w_i|. $$ It admits an optimal solution (for each dimension $i$), with the following form: $$ w_i = sign(w_i^*)\max(|w_i^*|-\frac{\beta}{\gamma_i},0). $$

I don't understand the quoted part. This is my starting point: $$ \tilde{J}(w) = J(w^*) + \frac{1}{2} \sum_i \gamma_i (w_i - w^*_i)^2 + \beta \sum_i |w_i| $$ Shouldn't we look for points where the gradient is zero? From $$ \frac{\partial\tilde{J}(w)}{\partial w_i} = \gamma_i (w_i - w^*_i) + \beta sign(w_i) = 0, $$ I get $$ w_i = w^*_i - \frac{\beta}{\gamma_i}sign(w_i). $$ What am I doing wrong?

Kiuhnm
  • 347
  • 2
  • 8
  • 1
    1) In the last two lines, $\text{sign}(w_i)$ should be $\text{sign}(w_i^*)$. 2) Consider the situation where $w_i^* = 0.5$, $\beta = \gamma_i = 1$. $w_i = -0.5$ in your formulation, but $0$ in the quote. This is because the derivative changes abruptly (loosely writing) at $0$,.You can see by looking at the original expression that $w_i$ shouldn't get more negative once it's reached zero (coming from a positive direction) as that will increase both terms in $\tilde{J}(w)$. See also http://stats.stackexchange.com/questions/74542/why-does-the-lasso-provide-variable-selection/74569#74569. – jbowman Nov 19 '15 at 20:53
  • @jbowman OK. I found the minima of $J_i$ for $w_i<0$, $w_i>0$ and $w_i=0$. Ignoring $w_i=0$, I get the minimum at $w_i=w_i^*-sign(w_i^*)\frac{\beta}{\gamma_i}$. Now the mimum is in 0 if $|w_i^*|\leq\frac{\beta}{\gamma_i}$. By putting it all together and after a bit of manipulation I get the correct formula! Thank you both! – Kiuhnm Nov 20 '15 at 02:19

1 Answers1

1

$w_i=w^*_i-\frac{\beta}{\gamma_i}sign(w_i)$ is correct for $w_i\ne 0$ (at zero the gradient is undefined)

Either $sign(w_i)=sign(w_i^*)$ ( which corresponds to $|w^*_i|>\frac{\beta}{\gamma_i}$)and there is a solution to the gradient=0 equation, or there is no solution to gradient =0.

Since $\tilde J(w)$ is continuous at 0 and the gradient at $0^+$ and $0^-$ changes sign when $|w^*_i|\le\frac{\beta}{\gamma_i}$, you have a minimum at 0 in that case, so put that all together and you get the answer in the text book.

seanv507
  • 4,305
  • 16
  • 25
  • More explicitly, $w_i^* = w_i + \frac{\beta}{\gamma_i}sign(w_i) = sign(w_i)(sign(w_i)w_i + \frac{\beta}{\gamma_i}) = sign(w_i)(|w_i|+\frac{\beta}{\gamma_i})$ therefore $sign(w_i^*)=sign(w_i)$. – Kiuhnm Nov 20 '15 at 13:03