1

I have a neural network with one hidden layer and an activation function for the output layer: $$f_k(w,u)(x) = \sum_{i = 1}^{m_1} w_{ki} \max\left(0,\sum_{j=1}^d u_{ij}x_j\right)$$

I am supposed to implement gradient descent with two update rules:

$$w \rightarrow w - \alpha \frac 1 B \sum_{i=1}^B \nabla_w L(y_i, f(w,u)(x_i))$$

$$u \rightarrow u - \alpha \frac 1 B \sum_{i=1}^B \nabla_u L(y_i, f(w,u)(x_i))$$

$$L(y,f(w,u)(x)) = -\ln \left( \frac{e^{f_y(w,u)(x)}}{\sum_{k=1}^K e^{f_k(w,u)(x)}} \right) $$

Simplifying the loss with respect to the logarithm I get

$$L(y,f(w,u)(x)) = -f_y(w,u)(x)\ln(e) + \ln\left(\sum_{k=1}^Ke^{f_k(u,w)(x)}\right)$$

Calculating the gradient:

\begin{align} & \nabla_w L(y,f(w,u)(x)) \\[10pt] = {} & -\nabla_w f_y(w,u)(x)\ln(e) + \frac 1 {\sum_{k=1}^K e^{f_k(u,w)(x)}} \, \nabla_w \sum_{k=1}^Ke^{f_k(u,w)(x)} \\[10pt] = {} & -\nabla_w f_y(w,u)(x)\ln(e) + \frac{1}{\sum_{k=1}^Ke^{f_k(u,w)(x)}} \sum_{k=1}^Ke^{f_k(u,w)(x)} \, \nabla_w f_k(u,w)(x) \end{align}

However, I am stuck here as I do not know how to calculate the gradient $\nabla_w f_k(u,w)(x)$ or $\nabla_u f_k(u,w)(x)$ as both $u$ and $w$ are matrices.

My main problem lays with $$\nabla_w f_k(u,w)(x)$$ as here I have to derive with respect to $w$ in $f_k$. However, when transforming the matrix w to a vector to compute the derivative the result becomes independent from k, which does not make sense. So the overall question would be: How can I compute the gradient for a function if not the whole vector is used in the function body/even exists for that function (as only every kth value of w would be used after vectorization of w)?

As this is a selfstudy question I would greatly appreciate any pointers and no actual solutions.

Sim
  • 160
  • 5
  • Please see http://stats.stackexchange.com/a/257616/919 for a general account of differentiation of functions of matrices, then let us know if there is anything else you need cleared up. – whuber Feb 09 '17 at 14:57
  • @whuber yes that helped and I was able to get a solution. However, my solution seems off as the neural net does not converge. Additionally the gradientmatrix for w is the same for all rows except row y. This does not seem right as not all output neurons are equally badly off. Are there any similiar problems online I can use as orientation for solving? – Sim Feb 09 '17 at 17:49
  • Part of your problem might be that $f$ is not everywhere differentiable. If your formula for its derivative (with respect to its second argument) does not make that plain, then the formula is likely incorrect. Indeed, since $f$ is piecewise multilinear (in all three arguments $w$, $u$, and $x$) its derivatives will be remarkably like $f$ itself. :) – whuber Feb 09 '17 at 18:09
  • no that is not really the problem. I figured that I will ignore that for now and try to do the simpler problem. However, I can for the life of me not figure out how to derive $\nabla_w f_k(w,u)(x)$ as the subscript k stays and influences w relative to which I want to calculate the gradient. So vectorizing the matrix does not really help as I do not know how to deal with that $k$ which obviously stays regardless of what I do. – Sim Feb 10 '17 at 00:57

0 Answers0