Can't update bias using gradient descent, because derivative of loss function with respect to bias has different dimensions

Question

I want to update a bias in my Neural Network using the gradient descent optimization algorithm. Unfortunately, the bias has different dimensions than the derivative of the loss function with respect to the bias. For example, the bias in the first hidden layer has dimensions 1 x hidden_size and the delta error (i.e. the derivative of the loss function with respect to the bias) has dimensions train_size x hidden_size. So I can't just subtract one from the other.

I have seen here that the author is summing the delta error over the columns, but then I don't understand why.

Could someone help me with it ?

I'm not sure why you are getting downvotes, this seems like a perfectly good question to me. — Matthew Drury, Aug 26 '17 at 14:52
@MatthewDrury: I'm not sure either. Spelling things out a little more might be helpful. — Scortchi - Reinstate Monica, Aug 26 '17 at 15:07

Matthew Drury · Accepted Answer · 2017-08-26T20:01:44.850

A loss function for a statistical or machine learning model almost always averages (or sums, as the difference between a sum and an average is just a multiplicative constant) over all the training data:

$$ L(\beta) = \sum_i (y_i - \hat y_i)^2 $$

So when you take the gradient with respect to any parameter, the derivative can be pushed into the sum:

$$ \nabla L(\beta) = - 2 \sum_i (y_i - \hat y_i) \nabla \hat y_i $$

I think what you have computed is the individual gradients $\nabla \hat y_i$, so you just need to sum to account for the fact that you are minimizing the total loss across all your training data.

Can't update bias using gradient descent, because derivative of loss function with respect to bias has different dimensions

1 Answers1