Questions tagged [stochastic-gradient-descent]

184 questions
139
votes
5 answers

Batch gradient descent versus stochastic gradient descent

Suppose we have some training set $(x_{(i)}, y_{(i)})$ for $i = 1, \dots, m$. Also suppose we run some type of supervised learning algorithm on the training set. Hypotheses are represented as $h_{\theta}(x_{(i)}) = \theta_0+\theta_{1}x_{(i)1} +…
43
votes
2 answers

Who invented stochastic gradient descent?

I'm trying to understand the history of Gradient descent and Stochastic gradient descent. Gradient descent was invented in Cauchy in 1847.Méthode générale pour la résolution des systèmes d'équations simultanées. pp. 536–538 For more information…
DaL
  • 4,462
  • 3
  • 16
  • 27
32
votes
4 answers

How does batch size affect convergence of SGD and why?

I've seen similar conclusion from many discussions, that as the minibatch size gets larger the convergence of SGD actually gets harder/worse, for example this paper and this answer. Also I've heard of people using tricks like small learning rates or…
30
votes
6 answers

For convex problems, does gradient in Stochastic Gradient Descent (SGD) always point at the global extreme value?

Given a convex cost function, using SGD for optimization, we will have a gradient (vector) at a certain point during the optimization process. My question is, given the point on the convex, does the gradient only point at the direction at which the…
27
votes
2 answers

How could stochastic gradient descent save time compared to standard gradient descent?

Standard Gradient Descent would compute gradient for the entire training dataset. for i in range(nb_epochs): params_grad = evaluate_gradient(loss_function, data, params) params = params - learning_rate * params_grad For a pre-defined number of…
22
votes
2 answers

Why second order SGD convergence methods are unpopular for deep learning?

It seems that, especially for deep learning, there are dominating very simple methods for optimizing SGD convergence like ADAM - nice overview: http://ruder.io/optimizing-gradient-descent/ They trace only single direction - discarding information…
19
votes
1 answer

RMSProp and Adam vs SGD

I am performing experiments on the EMNIST validation set using networks with RMSProp, Adam and SGD. I am achieving 87% accuracy with SGD(learning rate of 0.1) and dropout (0.1 dropout prob) as well as L2 regularisation (1e-05 penalty). When testing…
17
votes
4 answers

How can it be trapped in a saddle point?

I am currently a bit puzzled by how mini-batch gradient descent can be trapped in a saddle point. The solution might be too trivial that I don't get it. You get an new sample every epoch, and it computes a new error based on a new batch, so the…
15
votes
2 answers

How to set mini-batch size in SGD in keras

I am new to Keras and need your help. I am training a neural net in Keras and my loss function is Squared Difference b/w net's output and target value. I want to optimize this using Gradient Descent. After going through some links on the net, I have…
13
votes
1 answer

How to choose between SGD with Nesterov momentum and Adam?

I'm currently implementing a neural network architecture on Keras. I would like to optimize the training time, and I'm considering using alternative optimizers such as SGD with Nesterov Momentum and Adam. I've read several things about the pros and…
Clément F
  • 1,717
  • 4
  • 12
  • 13
12
votes
3 answers

Gradient descent on non-convex functions

What situations do we know of where gradient descent can be shown to converge (either to a critical point or to a local/global minima) for non-convex functions? For SGD on non-convex functions, one kind of proof has been reviewed here,…
10
votes
1 answer

What is the difference between VAE and Stochastic Backpropagation for Deep Generative Models?

What is the difference between Auto-encoding Variational Bayes and Stochastic Backpropagation for Deep Generative Models? Does inference in both methods lead to the same results? I'm not aware of any explicit comparisons between the two methods,…
8
votes
1 answer

Does Keras SGD optimizer implement batch, mini-batch, or stochastic gradient descent?

I am a newbie in Deep Learning libraries and thus decided to go with Keras. While implementing a NN model, I saw the batch_size parameter in model.fit(). Now, I was wondering if I use the SGD optimizer, and then set the batch_size = 1, m and b,…
8
votes
2 answers

Dealing with small batch size in SGD training

I am trying to train a large model (deep net using caffe) using stochastic gradient descent (SGD). The problem is I am constraint by my GPU memory capacity and thus cannot process large mini-batches for each stochastic gradient estimation. How can I…
8
votes
2 answers

comparison of SGD and ALS in collaborative filtering

Matrix factorization is widely applied in collaborative filtering, and briefly speaking, it tries to learn the following parameters: $$\min_{q_u,p_i}\sum_{\{u,i\}}(r_{ui} - q_u^Tp_i)^2$$ And we could apply SGD and ALS as the learning algorithm,…
avocado
  • 3,045
  • 5
  • 32
  • 45
1
2 3
12 13