1

I was reading abut the coordinate descent procedure for lasso. I have to write about it. I understood how it works but what I don't get are the formulas. Everywhere I see some additions or something is different.

in these slides page 14 the $\beta$ update is:

$\beta_i = \frac{S_\lambda}{\left \| X_i \right \|_2^2} \left ( \frac{X_i^T (y-X_{-i}\beta_{-i})}{X_i^T X_i} \right )=\frac{S_\lambda}{\left \| X_i \right \|_2^2} \left ( \frac{X_i^T (y-X_{-i}\beta_{-i})}{\left \| X_i \right \|_2^2} \right )$

What I don't really understand here is why is the soft thresholding being divided by $\left \| X_i \right \|_2^2$? Should then the coefficient update be:

$\tilde{\beta}_j(\lambda)\leftarrow \frac{1}{\left \| X_i \right \|_2^2} S \left (\frac{X_i^T (y-X_{-i}\beta_{-i})}{\left \| X_i \right \|_2^2} ,\lambda\right )$ ?

I would very much appreciate any clarification, or any suggestion of what I could read to understand it better.

Ville
  • 739
  • 8
  • 16
  • 1
    Based on the slide you pointed to, I think $||X_i||_2^2$ is in the subscript, i.e. $S_{\frac{\lambda}{\left \| X_i \right \|_2^2}}$ instead of $ \frac{S_\lambda}{\left \| X_i \right \|_2^2}$. I previously derived and implemented L1 from scratch [here](http://nbviewer.jupyter.org/github/zyxue/stanford-cs229/blob/master/previous_cs229/ps3/l1-regularization-for-least-squares-and-coordinate-descent.ipynb). – zyxue May 23 '18 at 17:08
  • @zyxue thank you a lot now it makes more sense. I checked your derivation what I don't get is in the 3d line of your equation, is there a 2 missing? should it be $\frac{1}{2}2(X\bar{\theta}+X_i\theta_i-y)^T(X\bar{\theta}+X_i\theta_i-y) +lambdaTerms$ ? – Ville May 23 '18 at 19:45
  • 1
    No, that first equation just expands $J(\theta)$, not taking its derivative. – zyxue May 23 '18 at 20:03

1 Answers1

1

For the full derivation - have a look at this post

To answer your question shortly

What I don't really understand here is why is the soft thresholding being divided by $‖X_i‖_2^2$? Should then the coefficient update be

This happens when the data you are working with has not been normalized before the coordinate descent. Going through the derivation you find that the optimization problem uses sub-differentials and that the solution is of the form

\begin{aligned} \begin{cases} \theta_j = \frac{\rho_j + \lambda}{z_j} & \text{for} \ \rho_j < - \lambda \\ \theta_j = 0 & \text{for} \ - \lambda \leq \rho_j \leq \lambda \\ \theta_j = \frac{\rho_j - \lambda}{z_j} & \text{for} \ \rho_j > \lambda \end{cases} \end{aligned}

We recognize this as the soft thresholding function $\frac{1}{z_j} S(\rho_j , \lambda)$ where $\frac{1}{z_j}$ is a normalizing constant which is equal to $1$ when the data is normalized, i.e. $z_j = \sum_{i=1}^m (x_j^{(i)})^2 = ‖X_i‖_2^2$

Coordinate descent update rule:

Repeat until convergence or max number of iterations:

  • For $j = 0,1,...,n$
  • Compute $\rho_j = \sum_{i=1}^m x_j^{(i)} (y^{(i)} - \sum_{k \neq j}^n \theta_k x_k^{(i)} ]$
  • Compute $z_j = \sum_{i=1}^m (x_j^{(i)})^2$
  • Set $\theta_j = \frac{1}{z_j} S(\rho_j, \lambda)$

Now in practice, you will find that most people and most books tell you to normalize the data before you perform lasso and coordinate descent. I will tell you that when I have implemented lasso via coordinate descent - I could only make it work with normalized data. But that might be my own failings only.

Xavier Bourret Sicotte
  • 7,986
  • 3
  • 40
  • 72
  • Is there any reason that coordinate decent is prefered? Why people say coordinate desent is fast for L1 problem? – FantasticAI Sep 28 '20 at 20:56