3

Under which conditions is one of the optimization methods offered by VW (SGD, bfgs with/without conjugate gradient, FTRL) expected to be better than others? I am mainly interested in regression and classification problems. Any references would be extremely valuable.

user90772
  • 681
  • 1
  • 8
  • 19
  • I am not familiar with any of this. It does not seem to relate to topics covered on this site (Cross Validated). – Michael R. Chernick Mar 31 '17 at 11:04
  • I'm voting to close this question as off-topic because the topic is not familiar and seems to be totally unrelated to our site. – Michael R. Chernick Mar 31 '17 at 11:06
  • I think this is a general question regarding optimization methods in regression for online learning models and I found a tag on vowpal wabbit. But I am not an expert so I respect your vote. – user90772 Mar 31 '17 at 11:21
  • @Michael Chernick this is a machine learning question, which according to [this discussion](https://meta.stackexchange.com/questions/130524/which-stack-exchange-website-for-machine-learning-and-computational-algorithms) is best suited to Cross Validated compared to other stack exchange sites. – AaronDefazio Mar 31 '17 at 12:48

1 Answers1

4

In general, the default SGD method will suffice in almost all cases. The LBFGS method is more of a backup, it's mainly there when you want to find a high accuracy solution, typically to help debug issues with SGD. It also provides a useful baseline for comparing methods against.

The stochasticity in SGD helps regularize the solution, so if you use a batch method (LBFGS/CG) you need to be more careful with the L2 regularizer you use. I believe L1 regularization is only supported with the stochastic methods as well. Performance wise, the LBFGS implementation will take many more epochs to converge than SGD, depending on your data size this may not be a problem. I've used the SGD option for datasets of 100GB+ without issue, which is not practical with LBFGS.

The FTRL implementation (technically FTRL-proximal, regular FTRL is actually just SGD in this setting) is mainly there for comparison reasons, so they could test their SGD implementation against it. I haven't done a direct comparison, but I believe it's not hugely different in terms of performance. You may see some difference for sparse problems.

AaronDefazio
  • 1,551
  • 7
  • 11
  • Can it be that FTRL proximal is better for sparse problems (high-dimensional categorical predictors) for small (p > n) datasets (in the order of 500-1000 records with 10K dummy variables)? I noticed this in some data (better than batch models) but I could not intuitively understand why – user90772 Apr 05 '17 at 00:06
  • FTRL is designed explicitly to handle sparse problems, so that would make sense. The VW SGD implementation is also designed for sparse problems, but perhaps more for fast run-time than model performance. It's hard to say from the documentation available. The difference shouldn't be large either way. – AaronDefazio Apr 05 '17 at 01:57