Questions tagged [stochastic-gradient-descent]
184 questions
139
votes
5 answers
Batch gradient descent versus stochastic gradient descent
Suppose we have some training set $(x_{(i)}, y_{(i)})$ for $i = 1, \dots, m$. Also suppose we run some type of supervised learning algorithm on the training set. Hypotheses are represented as $h_{\theta}(x_{(i)}) = \theta_0+\theta_{1}x_{(i)1} +…

user20616
- 1,431
- 3
- 11
- 7
43
votes
2 answers
Who invented stochastic gradient descent?
I'm trying to understand the history of Gradient descent and Stochastic gradient descent. Gradient descent was invented in Cauchy in 1847.Méthode générale pour la résolution des systèmes d'équations simultanées. pp. 536–538 For more information…

DaL
- 4,462
- 3
- 16
- 27
32
votes
4 answers
How does batch size affect convergence of SGD and why?
I've seen similar conclusion from many discussions, that as the minibatch size gets larger the convergence of SGD actually gets harder/worse, for example this paper and this answer. Also I've heard of people using tricks like small learning rates or…

dontloo
- 13,692
- 7
- 51
- 80
30
votes
6 answers
For convex problems, does gradient in Stochastic Gradient Descent (SGD) always point at the global extreme value?
Given a convex cost function, using SGD for optimization, we will have a gradient (vector) at a certain point during the optimization process.
My question is, given the point on the convex, does the gradient only point at the direction at which the…

CyberPlayerOne
- 2,009
- 3
- 22
- 30
27
votes
2 answers
How could stochastic gradient descent save time compared to standard gradient descent?
Standard Gradient Descent would compute gradient for the entire training dataset.
for i in range(nb_epochs):
params_grad = evaluate_gradient(loss_function, data, params)
params = params - learning_rate * params_grad
For a pre-defined number of…

Alina
- 915
- 2
- 10
- 21
22
votes
2 answers
Why second order SGD convergence methods are unpopular for deep learning?
It seems that, especially for deep learning, there are dominating very simple methods for optimizing SGD convergence like ADAM - nice overview: http://ruder.io/optimizing-gradient-descent/
They trace only single direction - discarding information…

Jarek Duda
- 331
- 2
- 14
19
votes
1 answer
RMSProp and Adam vs SGD
I am performing experiments on the EMNIST validation set using networks with RMSProp, Adam and SGD. I am achieving 87% accuracy with SGD(learning rate of 0.1) and dropout (0.1 dropout prob) as well as L2 regularisation (1e-05 penalty). When testing…

Alk
- 291
- 1
- 2
- 3
17
votes
4 answers
How can it be trapped in a saddle point?
I am currently a bit puzzled by how mini-batch gradient descent can be trapped in a saddle point.
The solution might be too trivial that I don't get it.
You get an new sample every epoch, and it computes a new error based on a new batch, so the…

Fixining_ranges
- 171
- 1
- 1
- 5
15
votes
2 answers
How to set mini-batch size in SGD in keras
I am new to Keras and need your help.
I am training a neural net in Keras and my loss function is Squared Difference b/w net's output and target value.
I want to optimize this using Gradient Descent. After going through some links on the net, I have…

Iceflame007
- 161
- 1
- 2
- 5
13
votes
1 answer
How to choose between SGD with Nesterov momentum and Adam?
I'm currently implementing a neural network architecture on Keras. I would like to optimize the training time, and I'm considering using alternative optimizers such as SGD with Nesterov Momentum and Adam.
I've read several things about the pros and…

Clément F
- 1,717
- 4
- 12
- 13
12
votes
3 answers
Gradient descent on non-convex functions
What situations do we know of where gradient descent can be shown to converge (either to a critical point or to a local/global minima) for non-convex functions?
For SGD on non-convex functions, one kind of proof has been reviewed here,…

gradstudent
- 271
- 2
- 9
10
votes
1 answer
What is the difference between VAE and Stochastic Backpropagation for Deep Generative Models?
What is the difference between Auto-encoding Variational Bayes and Stochastic Backpropagation for Deep Generative Models? Does inference in both methods lead to the same results? I'm not aware of any explicit comparisons between the two methods,…

Dionysis M
- 794
- 6
- 17
8
votes
1 answer
Does Keras SGD optimizer implement batch, mini-batch, or stochastic gradient descent?
I am a newbie in Deep Learning libraries and thus decided to go with Keras. While implementing a NN model, I saw the batch_size parameter in model.fit().
Now, I was wondering if I use the SGD optimizer, and then set the batch_size = 1, m and b,…

Rajdeep Dutta
- 195
- 2
- 5
8
votes
2 answers
Dealing with small batch size in SGD training
I am trying to train a large model (deep net using caffe) using stochastic gradient descent (SGD).
The problem is I am constraint by my GPU memory capacity and thus cannot process large mini-batches for each stochastic gradient estimation.
How can I…

Shai
- 258
- 2
- 9
8
votes
2 answers
comparison of SGD and ALS in collaborative filtering
Matrix factorization is widely applied in collaborative filtering, and briefly speaking, it tries to learn the following parameters:
$$\min_{q_u,p_i}\sum_{\{u,i\}}(r_{ui} - q_u^Tp_i)^2$$
And we could apply SGD and ALS as the learning algorithm,…

avocado
- 3,045
- 5
- 32
- 45