2

Given the Lasso problem $$ min_\beta (Y-X\beta)^\top(Y-X\beta) \quad s.t. \|\beta\|_1\leq\lambda, $$ and assuming that X is orthonormal such that $X^\top X=I$, we know that the closed form solution can be written as $$ \hat{\beta}^{Lasso} = sgn(\hat{\beta}^{LS})(|\hat{\beta}^{LS}|-\lambda)_+, $$ where $\hat{\beta}^{LS}=X^\top Y$.

However, even if X is non-orthonormal, we can use a SVD transformation of $X^\top X$ to generate transformed data $\tilde{X}$ such that $\tilde{X}^\top \tilde{X} = I$, and apply the Lasso to the transformed data. In particular, consider the SVD: $$ X^\top X= Q\Delta Q^\top $$ and the transformed matrix $\tilde{X}=XQ\Delta^{-1/2}$, so that $\tilde{X}^\top \tilde{X} = I$.

Next, using the transformed data $(Y,\tilde{X})$, we get the orthonormalized estimate: $\hat{\tilde{\beta}}^{Lasso} = sgn(\hat{\tilde{\beta}}^{LS})(|\hat{\tilde{\beta}}^{LS}|-\lambda)$, where $\hat{\tilde{\beta}}^{LS} = \tilde{X}^\top Y$.

The non-orthonormalized estimate can be recovered as $\hat{\beta}^{Lasso} = Q\Delta^{-1/2}\hat{\tilde{\beta}}^{Lasso}$.

$\textbf{Question}$: This seems like a valid method of estimating the Lasso and such orthonormalization technique is used in the Group Lasso context (for e.g. in these lecture notes and in Wei, Huang and Li (2011)). However, I am unable to find anything that discusses this as a way to exploit the closed form solution for the Lasso. I am thinking that perhaps the SVD of a large data matrix might be computationally expensive relative to numerical/iterative methods. Are there any reasons why this is not done?

  • How is it that you have $\hat{\beta}^{Lasso} = Q\Delta^{-1/2}\hat{\tilde{\beta}}^{Lasso}$? – Tim Mak Apr 08 '20 at 09:36
  • You're right, it should be the inverse of that. In other words, $\hat{\beta}^{Lasso}=(Q \Delta^{-1/2})^{-1} \hat{\tilde{\beta}}^{Lasso}$. I have edited the original post. – TheNekoMessiah Apr 08 '20 at 10:22
  • Still, I'm not sure where you get that... – Tim Mak Apr 08 '20 at 10:40
  • Okay, I worked it out again; I think I was correct the first time around. Let $Q\Delta^{-1/2} \equiv A$. The objective function $(Y-X\beta)^\top (Y-X\beta) = (Y-XAA^{-1}\beta)^\top (Y-XAA^{-1}\beta) \equiv (Y-\tilde{X}\gamma)^\top (Y-\tilde{X}\gamma)$. Solve the Lasso problem with $(Y,\tilde{X})$ to get $\hat{\gamma}$. Since $\gamma = A^{-1} \beta$, we have $A \hat{\gamma} = \hat{\beta}$. So, I think I was correct the first time round with: $\hat{\beta}^{Lasso} = Q\Delta^{-1/2} \hat{\tilde{\beta}}^{Lasso}$. – TheNekoMessiah Apr 08 '20 at 10:54
  • If you were doing OLS, that would be correct. But you are doing Lasso. The Lasso $\hat{\beta}$ does not have a closed form representation as far as I know. Perhaps I misunderstood you. You are not trying to do LASSO in another way, but proposing a new form of regularization? – Tim Mak Apr 09 '20 at 02:12
  • The closed form representation for the Lasso can be found [here](https://stats.stackexchange.com/questions/17781/derivation-of-closed-form-lasso-solution?rq=1) or on [Wikipedia](https://en.wikipedia.org/wiki/Lasso_(statistics)) (under the Orthonormal covariates section). I am not proposing anything new. We still have $\mathcal{l}$1 regularization. In fact, I don't think this method is particularly novel, as I have mentioned, it seems to be used in the Group Lasso context which will yield a very similar closed form solution post-orthogonalization. I was wondering why this is not used in Lasso. – TheNekoMessiah Apr 09 '20 at 02:43
  • Yes, but I thought you were interested in getting a Lasso estimate for the *general* case, not the orthonormal case, for which I don't think there's a closed form solution. What you are proposing is regularizing after a orthogonol transformation and back transforming, which is fine if that's what you want to do. But it's not the same as doing plain lasso. Have you tested out your method in a simulation? – Tim Mak Apr 09 '20 at 02:57

1 Answers1

0

I found out the problem with the proposed estimation procedure thanks to suggestions from @Tim Mak.

The logic above essentially treats all the variables as belonging to one and only one group. The Lasso estimation above will hence pick either all variables or none. You can see this in that if $\hat{\tilde{\beta}}^{Lasso}$ contains at least 1 non-zero element, $\hat{\beta}^{Lasso}$ will not have any zero elements, and if $\hat{\tilde{\beta}}^{Lasso} = \mathbf{0}$, then $\hat{\beta}^{Lasso}= \mathbf{0}$.

The correct way to approach it, would be to treat each and every variable as belonging to its own group, and to do what is done in the Group Lasso literature.

Let $J$ be the number of variables and $x_j$ be the data column of the j-th variable. Transform the column: $\tilde{x}_j = (\sqrt{x_j^\top x_j})^{-1} x_j$, so that $\tilde{x}_j ^\top \tilde{x_j} = 1$.

Apply the coordinate descent algorithm in Yuan and Lin (2006) or Wei, Huang and Li (2011), to get $\hat{\tilde{\beta}}_j$ and apply the reverse transformation $\hat{\beta}^{Lasso}_j = (\sqrt{x_j^\top x_j})^{-1} \hat{\tilde{\beta}}_j$ for each variable $j$.

$\textbf{Conclusion}$ This still requires an iterative algorithm to calculate the (group) Lasso estimate, and hence it is not possible to exploit the closed form solution for the Lasso even with orthonormalization.