10

If I have a design matrix $X\in\mathcal{R}^{n\times d}$, where $n$ is the number of observations of dimension $d$, what is the complexity of solving for $\hat{\beta}=\text{argmin}_{\beta}\frac{1}{2n} ||X\beta-y||^{2} + \lambda||\beta||_{1}$ with LASSO, wrt $n$ and $d$? I think the answer should refer to how one LASSO iteration scales with these parameters, rather than how the number of iterations (convergence) scales, unless you feel otherwise.

I have read this previous LASSO complexity question, but it seems at odds with the discussion about glmnet here and here. I am aware that there are many algorithms out there, including glmnet's GLM approach, but I am writing a paper about replacing a LASSO component to a parent algorithm and would like to include a discussion about LASSO complexity in general, especially with $d$ and $n$. I would also like to know glmnet's complexity in the basic non-sparse case, but the referenced paper is a little confusing as the entire algorithm complexity is not explicit.

amoeba
  • 93,463
  • 28
  • 275
  • 317
rnoodle
  • 203
  • 2
  • 13
  • 3
    It's not clear why this answer https://stats.stackexchange.com/a/190717/28666 (in the thread that you linked to) does not answer your question. Can you elaborate? What is at odds with what? – amoeba May 22 '17 at 20:22
  • Page 6 in [pdf][1], states "Thus a complete cycle through all d variables costs $O(dn)$". However the question you link to states $O(d^{2}n)$. Am I missing a loop here to get the $d^{2}$ complexity? [1]: https://www.jstatsoft.org/article/view/v033i01 – rnoodle May 24 '17 at 11:21
  • @amoeba The link you provide is for the LARS algorithm - I want to know about the GLM approach. – rnoodle Aug 09 '17 at 10:56
  • The references, $\mathcal{O}(d^2n)$ for least angle regression and $\mathcal{O}(dn)$ for coordinate descent, are correct. The difference is that (1) LARS finds an *exact* solution in $\mathcal{O}(d^2n)$ (and doing so going across the entire path of possible $\lambda$ with complexity equal to the OLS problem to the entire problem, which also scales as $\mathcal{O}(d^2n)$), while (2) coordinate descent is doing "only" a single approximation step in $\mathcal{O}(dn)$, converging/'descending' closer to the minimum of the LASSO problem. LARS uses $d$ steps. With coordinate descent... nobody knows. – Sextus Empiricus Aug 09 '17 at 22:47

1 Answers1

3

The answers from the references,

  • $\mathcal{O}(d^2n)$ for least angle regression
  • $\mathcal{O}(dn)$ for coordinate descent

, are correct.


The difference is that

LARS equations are written in a closed form and finds an exact solution

(and doing so going across the entire path of possible λ while the computational complexity is scaling the same as finding the solution of the ordinary least squares problem, which also scales as $O(d^2n)$)

while

coordinate descent is an iterative scheme to approximate the solution. The referred step (whose computational costs scale as as $\mathcal{O}(dn)$) is "only" a single approximation step, converging/'descending' closer to the minimum of the LASSO problem.


LARS uses (exactly) $d$ steps to find the solution (with the complexity of the k-th step scaling as $\mathcal{O}((d-k)n+k^2)$, first term for finding $d-k$ inner-products in the inactive set and second term for solving the new angle in the $k$ active variables). With coordinate descent, nobody really knows the convergence rate and the number of required/expected steps for 'sufficient' convergence (or at least it has not been described well).

On the other hand the cost $d^2n$ increases a lot for high dimensions (while there is no strong reason to expect that coordinate descent's convergence rate scales similarly, =linear, if $d$ increases). So intuitively coordinate descent will perform better above a certain limit for $d$. This has also been shown by case studies (see also the reference which shows that glmnet performs mostly better than LARS when $d>>100$, while for $d=100$ the algorithms perform similar).


Scaling LARS is a problem involving computational complexity. Scaling coordinate descent is a problem involving computational complexity and convergence.

Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161