0

I've looked at a few threads about this but they've not been exactly what I'm after. When back-propagating the quadratic cost function, you first find the output error from $\delta_L = \bigtriangledown_a C \odot\sigma\prime (z_L)$.

You then can backpropagate this error through the network, where: $\delta_l = ((w_{l+1})^T\delta_{l+1})\odot\sigma\prime (z_l)$. The gradients, and therefore how the weights and biases change, are then based on these errors, so that $\frac{\partial C}{\partial b_l} = \delta_l$ and $\frac{\partial C}{\partial w_l} = a_{l-1}\delta_l$.

Source for this

However, the gradients for the cross-entropy cost function are not based on $\delta_l$, being $\frac{\partial C}{\partial b_l} = \sigma(z)-y$ and $\frac{\partial C}{\partial w_l} = x_j(\sigma(z)-y)$.

Source for this

It's obvious then that the gradient of the bias of the output neuron(s) is $\sigma(z)-y$ and the gradient of the weights connecting the final hidden layer to the output layer is $x_j(\sigma(z)-y)$. However, what would the gradients be on other layers please? How do you actually back-propagate this?

Thanks for any help!

GMSL
  • 133
  • 7

1 Answers1

1

Firstly, note that $\delta_l$ is nothing else than $\frac{\partial C}{\partial z_l}$, which can be expanded via chain rule to $\frac{\partial C}{\partial a_l}\frac{\partial a_l}{\partial z_l}$.

Moreover, you know that the derivative of each bias can be computed as $\frac{\partial C}{\partial b_l} = \delta_l$ and derivative of each weight is $\frac{\partial C}{\partial w_l} = a_{l-1}\delta_l$. This also holds for the last layer: $\delta_L$ is $\sigma(z)-y$. For derivation of $\delta_L = \frac{\partial C}{\partial z_L} = \sigma(z)-y$, see for example this answer.

You can simply plug it as $\delta_{l+1}$ to compute any preceding $\delta_l$.

Jan Kukacka
  • 10,121
  • 1
  • 36
  • 62
  • Thanks for the help. So am I right in saying that the derivations $\frac{\partial C}{\partial b_l}=\delta_l$ and $\frac{\partial C}{\partial w_l}=a_{l-1}\delta_l$ still hold true when using cross-entropy, except that now $\delta_l=\sigma(z)-y$? – GMSL Jun 13 '18 at 08:33
  • $\delta_L = \sigma(z) - y$. To compute any other $\delta_l$, you need to use the iterative formula you posted above. – Jan Kukacka Jun 13 '18 at 08:49
  • So, to backpropagate you first calculate $\delta_L$ from $\delta_L=\sigma(z)-y$. You then calculate $\delta_l$ for each other layer via $\delta_l=((w_{l+1})^T\delta_{l+1})\odot\sigma\prime(z_l)$. What do you change the weights and bias by then please? Sorry that I'm not getting this. – GMSL Jun 13 '18 at 09:13
  • You update them according to the learning rate $\eta$: $w \leftarrow w + \eta \frac{\partial C}{\partial w}$ – Jan Kukacka Jun 13 '18 at 09:16
  • 1
    Ahhh, and we can derive $\frac{\partial C}{\partial w}$ and $\frac{\partial C}{\partial b}$ from the chain rule in the way it explains in the source I linked to in the original question. I understand now. Thanks Jan, you have been very patient and helpful! – GMSL Jun 13 '18 at 09:30
  • For other people having the same issue as me: we can change the weights by $w\leftarrow w+\eta\frac{\partial C}{\partial w}$ and bias by $b\leftarrow b+\eta\frac{\partial C}{\partial b}$. The source in my question shows that $\frac{\partial C}{\partial w}=x_j(\sigma(z)-y)$. But notice how $\delta_l=\sigma(z)-y$. Therefore we change weights by $w\leftarrow w+\eta x_j\delta_l$. Similarly, $\frac{\partial C}{\partial b} = \sigma(z)-y=\delta_l$. Therefore we change biases by $b\leftarrow b+\eta\delta_l$. – GMSL Jun 13 '18 at 10:03