3

BFGS and LBFGS algorithms are often seen used as optimization methods for non-linear machine learning problems such as with neural networks back propagation and logistic regression.

My question is why arent they implemented in everything that gradient descent is even remotely related to, LINEAR regression for example?

is it simply a matter of overkill and an unnecessary piece of machinery in this case (but it would still theoretically improve training time) or is there an actual compatibility issue where BFGS and LBFGS needs non-linearity for them to work?

Cliff AB
  • 17,741
  • 1
  • 39
  • 84
AlanSTACK
  • 570
  • 5
  • 13
  • 1
    Closely related, possibly not a duplicate http://stats.stackexchange.com/questions/160179/do-we-need-gradient-descent-to-find-the-coefficients-of-a-linear-regression-mode/164164#164164 – Sycorax Jan 19 '16 at 22:23

1 Answers1

5

With linear regression, BFGS and LBFGS would be a major step backwards. That's because the solution can be directly written as

$\hat \beta = (X^T X)^{-1} X^T Y$

It's worth noting that directly using the above equation to calculate $\hat \beta$ (i.e. inverting $X^T X$ and then multiplying by $X^T Y$) is itself even a poor way to calculate $\hat \beta$.

Also, gradient descent is only recommended for linear regression in extremely special cases, so I wouldn't say gradient descent is "related" to linear regression.

Cliff AB
  • 17,741
  • 1
  • 39
  • 84
  • sorry, I think it was poor phrasing on my part. I know it would be a major step back computationally, but my major concern would be whether or not it would still work (for educational purposes and semantic understanding) the reason why know one uses is it is redundant to the point of being pointless, right? not due to some mathematical underlying incompatibility yes? – AlanSTACK Jan 19 '16 at 22:27
  • @user3810748: yes, it would work. – Cliff AB Jan 19 '16 at 22:31
  • also, can you elaborate on what you mean by "Also, gradient descent is only recommended for linear regression in extremely special cases, so I wouldn't say gradient descent is "related" to linear regression." do you mean that linear regression is weak compared to other better models or do you mean that gradient descent should not be used to train linear regression???(im guessing the former due to the strangeness of the latter) – AlanSTACK Jan 19 '16 at 22:33
  • 1
    @user3810748: gradient descent is a generic algorithm for a local critical point of a function (hopefully the minimum!). It can be used in the specific case of linear regression, but except for very extreme cases, it's relatively **very** inefficient in regards to total required computation time. – Cliff AB Jan 19 '16 at 22:53
  • @CliffAB are there experiments/literature regarding the generally worse performance of gradient descent versus the analytical solution? I think that the (X^T X)^-1 operation could be very costly for moderately large X and in these cases LBFGS could be maybe faster. – Jacques Wainer Feb 03 '16 at 06:30
  • @JacquesWainer: first off, as noted in the original answer, even though $(X^TX)^{-1} X^TY$ is the solution, it can be found *without* explicitly calculating $(X^TX)^{-1}$, although the solution still requires the same complexity. Also, to be clear, "large X" must refer to a very large number of predictors rather than number of samples (complexity will grow linearly with $n$). I'm not aware of any literature regarding evaluating efficiency of sub-optimal solutions. However, I can tell you right now it will be very dependent on the covariance of the predictors: for uncorrelated predictors, SGD.. – Cliff AB Feb 03 '16 at 20:22
  • ...and moreover LBFGS will do relatively well (since ignoring partials will be not so bad) while with heavy correlation they will do very poorly. Finding out the correlation of predictors will take as much time as finding the LS solution directly. – Cliff AB Feb 03 '16 at 20:24