1

I am having a hard time understanding the Gradient Descent Rule for learning in a feedforward ANN. In particular, how do we determine the initial weight vector, and how is this weight vector adjusted after each epoch?

From what I've read, I know that we first define some error function depending on the weights, and I think that we choose the initial weight to be the minimizer of this error function. Is this right?

DavidSilverberg
  • 739
  • 6
  • 18

1 Answers1

2

Typically neural network weights are initialized at random (for example: Xavier Initialization - Formula Clarification) while the biases are initialized at 0.

Gradient descent applies updates of the form $$x^{(k+1)} = x^{(k)} - \eta \nabla f(x^{(k)})$$ where ${}^{(k)}$ indicates that this is the $k$th iteration of the procedure and $\eta$ is the learning rate. Stochastic gradient descent only uses a fraction of the data to estimate $\nabla f(x^{(k)})$.

Gradient descent is an imperfect tool. Some discussion:

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • So nu is the learning rate, and the weights generally move toward the minimizing weight? – DavidSilverberg May 24 '19 at 16:30
  • The learning rate is $\eta$ (eta); $\nu$ (nu) doesn't appear in that equation. We hope that the update is closer to the minimum than when we started; however, there are lots of ways that this can go wrong. One example: https://stats.stackexchange.com/questions/367397/for-convex-problems-does-gradient-in-stochastic-gradient-descent-sgd-always-p/367459#367459 Another example: https://stats.stackexchange.com/questions/364360/how-can-change-in-cost-function-be-positive – Sycorax May 24 '19 at 16:37