Questions tagged [gradient-descent]

Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. For stochastic gradient descent there is also the [sgd] tag.

Gradient descent is based on the observation that if the multi-variable function $F(\mathbf {x} )$ is defined and differentiable in a neighborhood of a point $\mathbf {a}$ , then $F(\mathbf {x} )$ decreases fastest if one goes from $\mathbf {a}$ in the direction of the negative gradient of $F$ at $\mathbf {a}, > - \nabla F(\mathbf {a} )$. It follows that, if

$$ \mathbf {b} =\mathbf {a} -\gamma \nabla F(\mathbf {a} )$$

for $ \gamma $ small enough, then $ F(\mathbf {a} )\geq F(\mathbf {b} > )$. In other words, the term ${\displaystyle \gamma \nabla F(\mathbf > {a} )}$ is subtracted from $\mathbf {a} $ because we want to move against the gradient, namely down toward the minimum. With this observation in mind, one starts with a guess $ \mathbf {x} _{0}$ for a local minimum of $F$, and considers the sequence $\mathbf {x} > _{0},\mathbf {x} _{1},\mathbf {x} _{2},\dots$ such that

$$ \mathbf {x} _{n+1}=\mathbf {x} _{n}-\gamma _{n}\nabla F(\mathbf {x} > _{n}),\ n\geq 0$$

We have

$$ F(\mathbf {x} _{0})\geq F(\mathbf {x} _{1})\geq F(\mathbf {x} > _{2})\geq \cdots$$

so hopefully the sequence $\mathbf {x} _{n}$ converges to the desired local minimum.

-- Wikipedia

903 questions

219

votes

9 answers

Why is Newton's method not widely used in machine learning?

This is something that has been bugging me for a while, and I couldn't find any satisfactory answers online, so here goes: After reviewing a set of lectures on convex optimization, Newton's method seems to be a far superior algorithm than gradient…

asked Dec 29 '16 at 01:00

Fei Yang

2,181
3
8
4

139

votes

5 answers

Batch gradient descent versus stochastic gradient descent

Suppose we have some training set $(x_{(i)}, y_{(i)})$ for $i = 1, \dots, m$. Also suppose we run some type of supervised learning algorithm on the training set. Hypotheses are represented as $h_{\theta}(x_{(i)}) = \theta_0+\theta_{1}x_{(i)1} +…

optimization gradient-descent stochastic-gradient-descent

asked Feb 07 '13 at 19:34

user20616

1,431
3
11
7

111

votes

7 answers

Why use gradient descent for linear regression, when a closed-form math solution is available?

I am taking the Machine Learning courses online and learnt about Gradient Descent for calculating the optimal values in the hypothesis. h(x) = B0 + B1X why we need to use Gradient Descent if we can easily find the values with the below formula?…

regression machine-learning gradient-descent

asked May 10 '17 at 16:52

Purus

1,213
2
7
6

votes

2 answers

Solving for regression parameters in closed-form vs gradient descent

In Andrew Ng's machine learning course, he introduces linear regression and logistic regression, and shows how to fit the model parameters using gradient descent and Newton's method. I know gradient descent can be useful in some applications of…

regression machine-learning logistic gradient-descent

asked Feb 20 '12 at 01:47

Jeff

3,525
5
27
38

votes

3 answers

Why do neural network researchers care about epochs?

An epoch in stochastic gradient descent is defined as a single pass through the data. For each SGD minibatch, $k$ samples are drawn, the gradient computed and parameters are updated. In the epoch setting, the samples are drawn without…

neural-networks gradient-descent

asked Oct 24 '16 at 02:44

Sycorax

76,417
20
189
313

votes

6 answers

Optimization when Cost Function Slow to Evaluate

Gradient descent and many other methods are useful for finding local minima in cost functions. They can be efficient when the cost function can be evaluated quickly at each point, whether numerically or analytically. I have what appears to me to…

gradient-descent optimization bayesian-optimization

asked Jan 31 '16 at 04:04

Jared Becksfort

votes

4 answers

What's the difference between momentum based gradient descent and Nesterov's accelerated gradient descent?

So momentum based gradient descent works as follows: $v=\beta m-\eta g$ where $m$ is the previous weight update, and $g$ is the current gradient with respect to the parameters $p$, $\eta$ is the learning rate, and $\beta$ is a constant. $p_{new} = p…

optimization gradient-descent

asked Nov 03 '15 at 06:51

applecider

1,175
2
11
13

votes

6 answers

Adam optimizer with exponential decay

In most Tensorflow code I have seen Adam Optimizer is used with a constant Learning Rate of 1e-4 (i.e. 0.0001). The code usually looks the following: ...build the model... # Add the optimizer train_op =…

neural-networks deep-learning gradient-descent tensorflow adam

asked Mar 05 '16 at 08:22

MarvMind

votes

1 answer

How large should the batch size be for stochastic gradient descent?

I understand that stochastic gradient descent may be used to optimize a neural network using backpropagation by updating each iteration with a different sample of the training dataset. How large should the batch size be?

machine-learning neural-networks gradient-descent backpropagation

asked Mar 07 '15 at 21:18

Simon Kuang

2,051
3
17
18

votes

1 answer

How does the Adam method of stochastic gradient descent work?

I'm familiar with basic gradient descent algorithms for training neural networks. I've read the paper proposing Adam: ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION. While I've definitely got some insights (at least), the paper seems to be too high…

neural-networks optimization gradient-descent adam

asked Jun 24 '16 at 15:45

daniel451

2,635
6
22
26

votes

1 answer

Difference between GradientDescentOptimizer and AdamOptimizer (TensorFlow)?

I've written a simple MLP in TensorFlow which is modelling a XOR-Gate. So for: input_data = [[0., 0.], [0., 1.], [1., 0.], [1., 1.]] it should produce the following: output_data = [[0.], [1.], [1.], [0.]] The network has an input layer, a hidden…

machine-learning neural-networks error gradient-descent supervised-learning

asked Dec 01 '15 at 13:48

daniel451

2,635
6
22
26

votes

5 answers

How does rectilinear activation function solve the vanishing gradient problem in neural networks?

I found rectified linear unit (ReLU) praised at several places as a solution to the vanishing gradient problem for neural networks. That is, one uses max(0,x) as activation function. When the activation is positive, it is obvious that this is better…

machine-learning neural-networks deep-learning gradient-descent

asked Oct 13 '15 at 20:05

Hans-Peter Störr

votes

1 answer

Neural Networks: weight change momentum and weight decay

Momentum $\alpha$ is used to diminish the fluctuations in weight changes over consecutive iterations: $$\Delta\omega_i(t+1) = - \eta\frac{\partial E}{\partial w_i} + \alpha \Delta \omega_i(t),$$ where $E({\bf w})$ is the error function, ${\bf w}$ -…

neural-networks optimization regularization gradient-descent

asked Sep 16 '13 at 01:56

Oleg Shirokikh

votes

2 answers

Who invented stochastic gradient descent?

I'm trying to understand the history of Gradient descent and Stochastic gradient descent. Gradient descent was invented in Cauchy in 1847.Méthode générale pour la résolution des systèmes d'équations simultanées. pp. 536–538 For more information…

references gradient-descent history stochastic-gradient-descent

asked Nov 14 '17 at 13:49

DaL

4,462
3
16
27

votes

5 answers

How is the cost function from Logistic Regression differentiated

I am doing the Machine Learning Stanford course on Coursera. In the chapter on Logistic Regression, the cost function is this: Then, it is differentiated here: I tried getting the derivative of the cost function, but I got something completely…

regression logistic gradient-descent derivative

asked May 10 '17 at 18:28

octavian

2 3

…

60 61 Next