4

I have made two solvers to implement neural networks, one is based on stochastic gradient descent (SGD) while the other is based on the BFGS (Broyden-Fletcher-Goldfarb-Shanno) algorithm.

I have read a lot of material and find it is common to use SGD rather BFGS, but I have found that BFGS performs better than SGD.

Can anyone can tell me why people prefer SGD to BFGS?

Glen_b
  • 257,508
  • 32
  • 553
  • 939
maple
  • 279
  • 3
  • 8

2 Answers2

3

Neural networks are successful when you have huge training sets. In these situations training time is a big problem, and sgd is much faster than batch methods (and requires no memory unlike BFGS) see papers of leon bottou. So I think you are seeing good performance on toy problem which is not where nnets excel.

seanv507
  • 4,305
  • 16
  • 25
  • How about replace gradient descent with BFGS in the optimization of each minibatch? And in fact, my training set is 20GB, so I think it is big enough. – maple Aug 27 '15 at 07:37
0

Do all of the nodes in your network perform smooth operations? One reason for using a gradient based method as opposed to a quasi-Newton method is non differentiability which occurs with many common "activation functions" such as ReLU, and also with L1 regularization.

Also I believe there are some arguments for not using quasi-Newton methods in online training but this I don't know enough about.

jlimahaverford
  • 3,535
  • 9
  • 23