In recent years, people build huge neural networks with millions of parameters to learn.
I have seen many discussions about gradient based training, but not too much for Newton's Method / Quasi Newton Method.
Is that true Newton's Method / Quasi Newton Method are not widely used in deep neutral network training?
Is this because the Hessian is too large, so even an approximation of that such as BFGS would not work? But the gradient can be weakly approximated in different way?
Any review papers about the optimization methods used in deep learning neural network?