2

I want to update a bias in my Neural Network using the gradient descent optimization algorithm. Unfortunately, the bias has different dimensions than the derivative of the loss function with respect to the bias. For example, the bias in the first hidden layer has dimensions 1 x hidden_size and the delta error (i.e. the derivative of the loss function with respect to the bias) has dimensions train_size x hidden_size. So I can't just subtract one from the other.

I have seen here that the author is summing the delta error over the columns, but then I don't understand why.

Could someone help me with it ?

1 Answers1

3

A loss function for a statistical or machine learning model almost always averages (or sums, as the difference between a sum and an average is just a multiplicative constant) over all the training data:

$$ L(\beta) = \sum_i (y_i - \hat y_i)^2 $$

So when you take the gradient with respect to any parameter, the derivative can be pushed into the sum:

$$ \nabla L(\beta) = - 2 \sum_i (y_i - \hat y_i) \nabla \hat y_i $$

I think what you have computed is the individual gradients $\nabla \hat y_i$, so you just need to sum to account for the fact that you are minimizing the total loss across all your training data.

Matthew Drury
  • 33,314
  • 2
  • 101
  • 132