Setup
I found a paper on that has a varient on normal auto-encoders (contractive) which for its gradient uses the following regularization penalty:
$$\left|\left|J_f(x)\right|\right|^2_F = \sum_{ij}{\left( \frac{\partial h_j(x)}{\partial x_i} \right)}^2$$
where $\left|\left|\cdot\right|\right|_F^2$ is the Frobenius norm, $h$ is the hidden units, and $x$ is the input. The paper also gives an alternative form (when a sigmoid is used for $f$) to the equation that looks like:
$$\left|\left|J_f(x)\right|\right|^2_F = \sum_{i=1}^{d_h}(h_i(1-h_i))^2\sum_{j=1}^{d_x}W^2_{ij}$$
Question 1
As per usual, no actual derivation is given to get the second form of the equation in the paper. I attempted to derive it myself, but would greatly appreciate it if someone could check my work and let me know what mistakes I might have made.
$$ a(x) = W^T x + b $$ $$ h(x) = f(a(x)) $$ $$ f(x) = \frac{1}{1+e^{-x}} $$
Then, using the chain rule:
$$ \frac{\partial h(x)}{\partial x} = \frac{\partial h(x)}{\partial a(x)} \frac{\partial a(x)}{\partial x} $$
Using the standard derivation of the sigmoid: $$ \frac{\partial h(x)}{\partial a(x)} = f(a(x))(1 - f(a(x))) = h(1- h) $$
and:
$$ \frac{\partial a(x)}{\partial x} = W $$
Thus, finally:
$$ \frac{\partial h(x)}{\partial x} = h(1-h)W $$
Question 2
The problem I run into is that I'm not entirely sure of how to take the derivative of $\left(\frac{\partial h(x)}{\partial x}\right)^2$ with respect to $W$ in order to be able to get the gradient. If I square my previous result, I should get:
$$ \frac{\partial \left[h(1-h)\right]^2W^2}{\partial W} $$
which I would think would give:
$$ 2\left[h(1-h)\right]^2 W $$
And I'm having difficulty in interpreting this. I am a bit confused about whether h(1-h) is then an dot product. If so, does that just give me a scalar multiplied by W? If not, and it's an element-wise multiplication, then I think the dimensionality would be all wrong.
Or perhaps I did all of this incorrectly. Any help would be greatly appreciated!