12

I have programmed a logistic regression using the IRLS algorithm. I would like to apply a LASSO penalization in order to automatically select the right features. At each iteration, the following is solved:

$$\mathbf{\left(X^TWX\right) \delta\hat\beta=X^T\left(y-p\right)}$$

Let $\lambda$ be a non-negative real number. I am not penalizing the intercept as suggested in The Elements of. Statistical Learning. Ditto for the already zero coefficients. Otherwise, I subtract a term from the right-hand side:

$$\mathbf{X^T\left(y-p\right)-\lambda\times \mathrm{sign}\left(\hat\beta\right)}$$

However, I am unsure about the modification of the IRLS algorithm. Is it the right way to do?


Edit: Although I was not confident about it, here is one of the solutions I finally came up with. What is interesting is this solution corresponds to what I now understand about LASSO. There are indeed two steps at each iteration instead of merely one:

  • the first step is the same as before : we make an iteration of the algorithm (as if $\lambda=0$ in the formula for the gradient above),
  • the second step is the new one: we apply a soft-thresholding to each component (except for the component $\beta_0$, which corresponds to the intercept) of the vector $\beta$ obtained at the first step. This is called Iterative Soft-Thresholding Algorithm.

$$\forall i \geq 1, \beta_{i}\leftarrow\mathrm{sign}\left(\beta_{i}\right)\times\max\left(0,\,\left|\beta_{i}\right|-\lambda\right)$$

Royi
  • 966
  • 8
  • 20
Wok
  • 1,065
  • 10
  • 22

4 Answers4

13

This problem is typically solved by fit by coordinate descent (see here). This method is both safer more efficient numerically, algorithmically easier to implement and applicable to a more general array of models (also including Cox regression). An R implementation is available in the R package glmnet. The codes are open source (partly in and in C, partly in R), so you can use them as blueprints.

user603
  • 21,225
  • 3
  • 71
  • 135
  • @wok Of note, the scikit.learn package also offers efficient implementation in Python for this kind of stuff. – chl Oct 12 '10 at 12:03
  • The coordinate descent algorithm is interesting. Thanks. Still thinking about it. – Wok Oct 12 '10 at 19:53
5

The LASSO loss function has a discontinuity at zero along each axis, so IRLS is going to have problems with it. I have found a sequential minimal optimisation (SMO) type approach very effective, see e.g.

http://bioinformatics.oxfordjournals.org/content/19/17/2246

a version with MATLAB software is

http://bioinformatics.oxfordjournals.org/content/22/19/2348

the software is available here:

http://theoval.cmp.uea.ac.uk/~gcc/cbl/blogreg/

The Basic idea is to optimise the coefficients one at at time, and test to see if you cross the discontinuity one coefficient at a time, which is easy as you are perfoming a scalar optimisation. It may sound slow, but it is actually pretty efficient (although I expect better algorithms have been developed since - probably by Keerthi or Chih-Jen Lin who are both leading experts in that sort of thing).

Dikran Marsupial
  • 46,962
  • 5
  • 121
  • 178
  • Thanks. I am reading this and thinking about it. However, this would be a huge modification of the current algorithm. – Wok Oct 12 '10 at 19:53
4

You may check the paper: Efficient L1-regularized logistic regression, which is an IRLS-based algorithm for LASSO. Regarding the implementation, the link may be useful for you (http://ai.stanford.edu/~silee/softwares/irlslars.htm).

0

The IRLS for the LASSO problem is as following:

$$ \arg \min_{x} \frac{1}{2} \left\| A x - b \right\|_{2}^{2} + \lambda \left\| x \right\|_{1} = \arg \min_{x} \frac{1}{2} \left\| A x - b \right\|_{2}^{2} + \lambda {x}^{T} W {x} $$

Where $ W $ is a diagonal matrix - $ {W}_{i, i} = \frac{1}{ \left| {x}_{i} \right| } $.
This comes from $ \left\| x \right\|_{1} = \sum_{i} \left| {x}_{i} \right| = \sum_{i} \frac{ {x}_{i}^{2} } { \left| {x}_{i} \right| } $.

Now, the above is just Tikhonov Regularization.
Yet, since $ W $ depends on $ x $ one must solve it iteratively (Also this cancels the 2 factor in Tikhonov Regularization, As the derivative of $ {x}^{T} W x $ with regard to $ x $ while holding $ x $ as constant is $ \operatorname{diag} \left( \operatorname{sign} \left( x \right) \right) $ which equals to $ W x $):

$$ {x}^{k + 1} = \left( {A}^{T} A + \lambda {W}^{k} \right)^{-1} {A}^{T} b $$

Where $ {W}_{i, i}^{K} = \frac{1}{ \left| {x}^{k}_{i} \right| } $.

Initialization can be by $ W = I $.

Pay attention this doesn't work well for large values of $ \lambda $ and you better use ADMM or Coordinate Descent.

Royi
  • 966
  • 8
  • 20