Questions tagged [gradient-descent]

Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. For stochastic gradient descent there is also the [sgd] tag.

Gradient descent is based on the observation that if the multi-variable function $F(\mathbf {x} )$ is defined and differentiable in a neighborhood of a point $\mathbf {a}$ , then $F(\mathbf {x} )$ decreases fastest if one goes from $\mathbf {a}$ in the direction of the negative gradient of $F$ at $\mathbf {a}, > - \nabla F(\mathbf {a} )$. It follows that, if

$$ \mathbf {b} =\mathbf {a} -\gamma \nabla F(\mathbf {a} )$$

for $ \gamma $ small enough, then $ F(\mathbf {a} )\geq F(\mathbf {b} > )$. In other words, the term ${\displaystyle \gamma \nabla F(\mathbf > {a} )}$ is subtracted from $\mathbf {a} $ because we want to move against the gradient, namely down toward the minimum. With this observation in mind, one starts with a guess $ \mathbf {x} _{0}$ for a local minimum of $F$, and considers the sequence $\mathbf {x} > _{0},\mathbf {x} _{1},\mathbf {x} _{2},\dots$ such that

$$ \mathbf {x} _{n+1}=\mathbf {x} _{n}-\gamma _{n}\nabla F(\mathbf {x} > _{n}),\ n\geq 0$$

We have

$$ F(\mathbf {x} _{0})\geq F(\mathbf {x} _{1})\geq F(\mathbf {x} > _{2})\geq \cdots$$

so hopefully the sequence $\mathbf {x} _{n}$ converges to the desired local minimum.

-- Wikipedia

903 questions
219
votes
9 answers

Why is Newton's method not widely used in machine learning?

This is something that has been bugging me for a while, and I couldn't find any satisfactory answers online, so here goes: After reviewing a set of lectures on convex optimization, Newton's method seems to be a far superior algorithm than gradient…
Fei Yang
  • 2,181
  • 3
  • 8
  • 4
139
votes
5 answers

Batch gradient descent versus stochastic gradient descent

Suppose we have some training set $(x_{(i)}, y_{(i)})$ for $i = 1, \dots, m$. Also suppose we run some type of supervised learning algorithm on the training set. Hypotheses are represented as $h_{\theta}(x_{(i)}) = \theta_0+\theta_{1}x_{(i)1} +…
111
votes
7 answers

Why use gradient descent for linear regression, when a closed-form math solution is available?

I am taking the Machine Learning courses online and learnt about Gradient Descent for calculating the optimal values in the hypothesis. h(x) = B0 + B1X why we need to use Gradient Descent if we can easily find the values with the below formula?…
Purus
  • 1,213
  • 2
  • 7
  • 6
96
votes
2 answers

Solving for regression parameters in closed-form vs gradient descent

In Andrew Ng's machine learning course, he introduces linear regression and logistic regression, and shows how to fit the model parameters using gradient descent and Newton's method. I know gradient descent can be useful in some applications of…
Jeff
  • 3,525
  • 5
  • 27
  • 38
77
votes
3 answers

Why do neural network researchers care about epochs?

An epoch in stochastic gradient descent is defined as a single pass through the data. For each SGD minibatch, $k$ samples are drawn, the gradient computed and parameters are updated. In the epoch setting, the samples are drawn without…
Sycorax
  • 76,417
  • 20
  • 189
  • 313
73
votes
6 answers

Optimization when Cost Function Slow to Evaluate

Gradient descent and many other methods are useful for finding local minima in cost functions. They can be efficient when the cost function can be evaluated quickly at each point, whether numerically or analytically. I have what appears to me to…
71
votes
4 answers

What's the difference between momentum based gradient descent and Nesterov's accelerated gradient descent?

So momentum based gradient descent works as follows: $v=\beta m-\eta g$ where $m$ is the previous weight update, and $g$ is the current gradient with respect to the parameters $p$, $\eta$ is the learning rate, and $\beta$ is a constant. $p_{new} = p…
applecider
  • 1,175
  • 2
  • 11
  • 13
58
votes
6 answers

Adam optimizer with exponential decay

In most Tensorflow code I have seen Adam Optimizer is used with a constant Learning Rate of 1e-4 (i.e. 0.0001). The code usually looks the following: ...build the model... # Add the optimizer train_op =…
54
votes
1 answer

How large should the batch size be for stochastic gradient descent?

I understand that stochastic gradient descent may be used to optimize a neural network using backpropagation by updating each iteration with a different sample of the training dataset. How large should the batch size be?
49
votes
1 answer

How does the Adam method of stochastic gradient descent work?

I'm familiar with basic gradient descent algorithms for training neural networks. I've read the paper proposing Adam: ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION. While I've definitely got some insights (at least), the paper seems to be too high…
daniel451
  • 2,635
  • 6
  • 22
  • 26
49
votes
1 answer

Difference between GradientDescentOptimizer and AdamOptimizer (TensorFlow)?

I've written a simple MLP in TensorFlow which is modelling a XOR-Gate. So for: input_data = [[0., 0.], [0., 1.], [1., 0.], [1., 1.]] it should produce the following: output_data = [[0.], [1.], [1.], [0.]] The network has an input layer, a hidden…
49
votes
5 answers

How does rectilinear activation function solve the vanishing gradient problem in neural networks?

I found rectified linear unit (ReLU) praised at several places as a solution to the vanishing gradient problem for neural networks. That is, one uses max(0,x) as activation function. When the activation is positive, it is obvious that this is better…
47
votes
1 answer

Neural Networks: weight change momentum and weight decay

Momentum $\alpha$ is used to diminish the fluctuations in weight changes over consecutive iterations: $$\Delta\omega_i(t+1) = - \eta\frac{\partial E}{\partial w_i} + \alpha \Delta \omega_i(t),$$ where $E({\bf w})$ is the error function, ${\bf w}$ -…
43
votes
2 answers

Who invented stochastic gradient descent?

I'm trying to understand the history of Gradient descent and Stochastic gradient descent. Gradient descent was invented in Cauchy in 1847.Méthode générale pour la résolution des systèmes d'équations simultanées. pp. 536–538 For more information…
DaL
  • 4,462
  • 3
  • 16
  • 27
38
votes
5 answers

How is the cost function from Logistic Regression differentiated

I am doing the Machine Learning Stanford course on Coursera. In the chapter on Logistic Regression, the cost function is this: Then, it is differentiated here: I tried getting the derivative of the cost function, but I got something completely…
octavian
  • 909
  • 2
  • 11
  • 18
1
2 3
60 61