I have a neural network with one hidden layer and an activation function for the output layer: $$f_k(w,u)(x) = \sum_{i = 1}^{m_1} w_{ki} \max\left(0,\sum_{j=1}^d u_{ij}x_j\right)$$
I am supposed to implement gradient descent with two update rules:
$$w \rightarrow w - \alpha \frac 1 B \sum_{i=1}^B \nabla_w L(y_i, f(w,u)(x_i))$$
$$u \rightarrow u - \alpha \frac 1 B \sum_{i=1}^B \nabla_u L(y_i, f(w,u)(x_i))$$
$$L(y,f(w,u)(x)) = -\ln \left( \frac{e^{f_y(w,u)(x)}}{\sum_{k=1}^K e^{f_k(w,u)(x)}} \right) $$
Simplifying the loss with respect to the logarithm I get
$$L(y,f(w,u)(x)) = -f_y(w,u)(x)\ln(e) + \ln\left(\sum_{k=1}^Ke^{f_k(u,w)(x)}\right)$$
Calculating the gradient:
\begin{align} & \nabla_w L(y,f(w,u)(x)) \\[10pt] = {} & -\nabla_w f_y(w,u)(x)\ln(e) + \frac 1 {\sum_{k=1}^K e^{f_k(u,w)(x)}} \, \nabla_w \sum_{k=1}^Ke^{f_k(u,w)(x)} \\[10pt] = {} & -\nabla_w f_y(w,u)(x)\ln(e) + \frac{1}{\sum_{k=1}^Ke^{f_k(u,w)(x)}} \sum_{k=1}^Ke^{f_k(u,w)(x)} \, \nabla_w f_k(u,w)(x) \end{align}
However, I am stuck here as I do not know how to calculate the gradient $\nabla_w f_k(u,w)(x)$ or $\nabla_u f_k(u,w)(x)$ as both $u$ and $w$ are matrices.
My main problem lays with $$\nabla_w f_k(u,w)(x)$$ as here I have to derive with respect to $w$ in $f_k$. However, when transforming the matrix w to a vector to compute the derivative the result becomes independent from k, which does not make sense. So the overall question would be: How can I compute the gradient for a function if not the whole vector is used in the function body/even exists for that function (as only every kth value of w would be used after vectorization of w)?
As this is a selfstudy question I would greatly appreciate any pointers and no actual solutions.