Why cannot one find the zero in the delta rule for sigmoid? (No closed form to find weights in one-layer perceptron neural network?)

Question

I know that finding the weights of a neural network requires gradient descent as there is no closed form available. I know this from the books, and not knowing exactly why the derivative w.r.t. the weights is not zero-able led me to try to do it.

Let's consider the traditional sigmoid MLP, with just one layer and just one datapoint $<\mathbf{x},t>$. The gradient vector of the MSE loss function w.r.t. the weights is:

$$\frac{\partial}{\partial\mathbf{w}} \frac{1}{2}\left( t - f(\mathbf{w}\cdot\mathbf{x}) \right)^2$$

which becomes:

$$ = -(t - s(\mathbf{w}\cdot\mathbf{x}))s(\mathbf{w}\cdot\mathbf{x})(1-s(\mathbf{w}\cdot\mathbf{x}))\mathbf{x}$$

with $s(\mathbf{h})$ being defined as the sigmoid function:

$$s(\mathbf{h}) = \frac{1}{1+e^{-\mathbf{h}}}$$

Now, how to solve (finding the zero) of the gradient expression?

$$-(t - s(\mathbf{w}\cdot\mathbf{x}))s(\mathbf{w}\cdot\mathbf{x})(1-s(\mathbf{w}\cdot\mathbf{x}))\mathbf{x} = 0$$

What I could do is to analyze the various factors and see where they individually zero. The sigmoid zeroes at least one position of the $\mathbf{w}$ vector set to $-\infty$, as well as the $1-s(\mathbf{w}\cdot\mathbf{x})$ zeroes with one position of $\mathbf{w}$ set to $+\infty$. This is not useful.

A few questions:

does the gradient expression even have a zero, even if it cannot be found in a closed-form?
what would the canonical procedure to solve for the zeroes in this function? (I have experience for linear and polynomial cases, but I am ignorant of more complicated cases such as this).
can it be demonstrated that there is no closed form?
what about other non-linear activation functions? Might some lead to closed-form zeroes?
might be the case that there is a closed form for an individual datapoint but not for an expression that consider multiple datapoints?
does a dimensionality of $\mathbf{x}$ and $\mathbf{w}$ set to 1 (scalars) make a difference to find a closed form?
(are my derivations correct?)

I am grateful for any eventual answers even if only tangentially related, and for any corrections to my procedure and terminology.

related: http://stats.stackexchange.com/questions/181629/why-use-gradient-descent-with-neural-networks — Sycorax, Aug 29 '16 at 16:08
This is only tangentially related. If you were trying to solve the dynamical system given by gradient descent you would only be able to do so currently for single neuron layers. The paper "Learning in the Machine" https://www.sciencedirect.com/science/article/pii/S0893608017301983 does some ODE analysis on their learning rules, and you might find the methods useful for looking at traditional backprop dynamics. — Robert, Jan 15 '18 at 23:28
A little more related - the linear network case also has infinitely many roots, but all of them have equivalent loss values. Maybe take a look at solving the linear network case and see if that builds your intuition? — Robert, Jan 15 '18 at 23:32

Why cannot one find the zero in the delta rule for sigmoid? (No closed form to find weights in one-layer perceptron neural network?)

0 Answers0