How to calculate derivative of the contractive auto-encoder regularization term?

Question

Setup

I found a paper on that has a varient on normal auto-encoders (contractive) which for its gradient uses the following regularization penalty:

$$\left|\left|J_f(x)\right|\right|^2_F = \sum_{ij}{\left( \frac{\partial h_j(x)}{\partial x_i} \right)}^2$$

where $\left|\left|\cdot\right|\right|_F^2$ is the Frobenius norm, $h$ is the hidden units, and $x$ is the input. The paper also gives an alternative form (when a sigmoid is used for $f$) to the equation that looks like:

$$\left|\left|J_f(x)\right|\right|^2_F = \sum_{i=1}^{d_h}(h_i(1-h_i))^2\sum_{j=1}^{d_x}W^2_{ij}$$

Question 1

As per usual, no actual derivation is given to get the second form of the equation in the paper. I attempted to derive it myself, but would greatly appreciate it if someone could check my work and let me know what mistakes I might have made.

$$ a(x) = W^T x + b $$ $$ h(x) = f(a(x)) $$ $$ f(x) = \frac{1}{1+e^{-x}} $$

Then, using the chain rule:

$$ \frac{\partial h(x)}{\partial x} = \frac{\partial h(x)}{\partial a(x)} \frac{\partial a(x)}{\partial x} $$

Using the standard derivation of the sigmoid: $$ \frac{\partial h(x)}{\partial a(x)} = f(a(x))(1 - f(a(x))) = h(1- h) $$

and:

$$ \frac{\partial a(x)}{\partial x} = W $$

Thus, finally:

$$ \frac{\partial h(x)}{\partial x} = h(1-h)W $$

Question 2

The problem I run into is that I'm not entirely sure of how to take the derivative of $\left(\frac{\partial h(x)}{\partial x}\right)^2$ with respect to $W$ in order to be able to get the gradient. If I square my previous result, I should get:

$$ \frac{\partial \left[h(1-h)\right]^2W^2}{\partial W} $$

which I would think would give:

$$ 2\left[h(1-h)\right]^2 W $$

And I'm having difficulty in interpreting this. I am a bit confused about whether h(1-h) is then an dot product. If so, does that just give me a scalar multiplied by W? If not, and it's an element-wise multiplication, then I think the dimensionality would be all wrong.

Or perhaps I did all of this incorrectly. Any help would be greatly appreciated!

fabee · Accepted Answer · 2011-08-30T08:52:14.687

3

when I interpret your equations correctly, the $W$ is supposed to be a matrix. This means that $a(x)$ is a vector and the then chain rule actually reads: $$\frac{\partial h_i}{\partial x_j} = \sum_k \frac{\partial h_i}{\partial a_k} \frac{\partial a_k}{\partial x_j}.$$

In matrix notation, the second term is $\frac{\partial a}{\partial x} = W^\top$. If I interpret your equations correctly, $f(x)$ is applied to each element of $a$ individually. Therefore, $\frac{\partial h_i}{\partial a_j} = \delta_{ij}h(a_j)(1-h(a_j))$ which means that it is a diagonal matrix with the term $h(a_j)(1-h(a_j))$ as the $j$th entry. Therefore $$\frac{\partial h_i}{\partial x_j} = \sum_k \delta_{ik} h(a_k)(1-h(a_k)) W_{kj} = h(a_i)(1-h(a_i)) W_{ij}.$$ Thus, $$\sum_{ij}\left(\frac{\partial h_i}{\partial x_j}\right)^2 =\sum_{ij} h(a_i)^2(1-h(a_i))^2 W_{ij}^2 = \sum_{i} h(a_i)^2(1-h(a_i))^2 \sum_{i} W_{ij}^2$$

Edit (second derivative): Since I am guessing that you want the derivative of the regularizer w.r.t to $W$, here is what I get (please check it numerically for correctness)

$$\frac{\partial}{\partial W_{kl}}\|J_{f}\|_{F}^{2}=\frac{\partial}{\partial W_{kl}}\sum_{i}h(a_{i})^{2}(1-h(a_{i}))^{2}\sum_{j}W_{ij}^{2}$$ $$=\sum_{i}\left(h(a_{i})^{2}(1-h(a_{i}))^{2}\sum_{j}\delta_{ik}\delta_{jl}2W_{ij}+\left(\sum_{j}W_{ij}^{2}\right)\frac{\partial}{\partial W_{kl}}h(a_{i})^{2}(1-h(a_{i}))^{2}\right)$$ $$=h(a_{k})^{2}(1-h(a_{k}))^{2}2W_{kl}+\sum_{i}\left(\sum_{j}W_{ij}^{2}\right)\left(2h(a_{i})h'(a_{i})\frac{\partial a_{i}}{\partial W_{kl}}\cdot(1-h(a_{i}))^{2}-h(a_{i})^{2}2(1-h(a_{i}))h'(a_{i})\frac{\partial a_{i}}{\partial W_{kl}}\right)$$ $$=2h(a_{k})^{2}(1-h(a_{k}))^{2}W_{kl}+\sum_{i}\left(\sum_{j}W_{ij}^{2}\right)2\left(h(a_{i})^{2}(1-h(a_{i}))^{3}\delta_{ik}x_{l}-h(a_{i})^{3}(1-h(a_{i}))^{2}\delta_{ik}x_{l}\right)$$ $$=2h(a_{k})^{2}(1-h(a_{k}))^{2}W_{kl}+\left(\sum_{j}W_{kj}^{2}\right)2h(a_{k})^{2}(1-h(a_{k}))^{3}x_{l}-\left(\sum_{j}W_{kj}^{2}\right)2h(a_{k})^{3}(1-h(a_{k}))^{2}x_{l}$$ $$=2h(a_{k})^{2}(1-h(a_{k}))^{2}\left(W_{kl}+x_{l}\left(1-2h(a_{k})\right)\left(\sum_{j}W_{kj}^{2}\right)\right)$$

Sory for the mess but I wanted you to be able to follow my calculations in case the numerical derivative turns out to be not correct.

edited Aug 30 '11 at 08:52

answered Aug 26 '11 at 11:21

fabee

2,403
13
18

I'm sure you're right, but I'm not sure I follow why h_i' should be the sum over h_i/a_k * a_k/x_j due to a(x) being a vector (Which it is) versus the non-summed version I have above. Also, any advice on what the final derivative looks like? Thanks! – Ranon Aug 26 '11 at 16:56
Oh, okay, I'm guessing that's because the number is being taken so you have to sum the activation values. I was originally hoping to understand it without the summation because it's easier to see how to take its derivative.. – Ranon Aug 26 '11 at 22:16
Hi Ranon, the summation is because of the chain rule: Say you have a function g from R to R^n and a function f from R^n to R, then the derivative of f(g(x)) w.r.t. x is sum_i df/dg_i dg_i/dx. I think the confusion might be that you look at your function h as from R to R but the way it is used it is really R^n to R where h is applied to each component of a. That's also the reason why the Jacobian is diagonal. – fabee Aug 27 '11 at 15:20
What do you mean by final derivative? It should be all there, or did I forget anything? – fabee Aug 27 '11 at 15:21
Thanks for the explanation, it's not something I've thought about before. By final derivative I meant the derivative of $\sum_{ij}(\frac{\partial h_i}{\partial x_j})^2$. – Ranon Aug 27 '11 at 15:26
The formula just means that you take the first derivatives (which is a matrix - the Jacobian) and sum up their squared entries. The formula for that is the last one in my answer. – fabee Aug 27 '11 at 21:31
Sorry, I mistyped. I meant to say the second derivative. The paper gave the first derivative, so my first question was simply on how they got that particular form. However, that is the penalty of my cost function, so I actually need to get the second derivative to find the gradient. – Ranon Aug 27 '11 at 22:30
Hi Ranon, I added the gradient w.r.t to $W$ which is I think what you want. – fabee Aug 30 '11 at 08:52
Thank you! Sorry for taking so long to respond .. I had to find some uninterrupted time to go through. Okay, I think I get what's happening. I'll go back and double check to see everything works out. Thanks again. – Ranon Sep 06 '11 at 17:13

score 1 · Answer 2 · answered Jan 30 '16 at 05:42

fabee's calculations are correct. You can also express the terms without assuming a sigmoid activation function.

To simplify the notation, let $h'_i = \frac{\partial h_i}{\partial a_i}$.

The term is: $$ \| J_f \|_F^2 = \sum_{i} \left( h'_i \right)^2 \sum_j W_{i,j}^2 $$

The gradient is: $$ \frac{\partial }{\partial W_{k,l}} \| J_f \|_F^2 = 2 (h'_k)^2 W_{kl} + \left(\sum_j W_{kj}^2\right) \frac{\partial (h'_k)^2}{\partial a_k} x_l $$

You can then use any activation functions. Let's see some examples.

Sigmoid

$ h_k = \text{sig}(a_k) $

$ (h'_k)^2 = h_k^2 (1 - h_k)^2 $

$ \frac{\partial (h'_k)^2}{\partial a_k} = 2 h_k^2 (1 -h_k)^2 (1 - 2 h_k)$

You can verify that you get the same term as fabee's.

SoftPlus

$ h_k = \ln (1 + \exp(a_k))$

$ (h'_k)^2 = \text{sig}(a_k)^2 $

$ \frac{\partial (h'_k)^2}{\partial a_k} = 2 \text{sig}(a_k)^2 - 2 \text{sig}(a_k)^3$

Tanh

$ h_k = \tanh(a_k) $

$(h'_k)^2 = (1 - h_k^2)^2$

$ \frac{\partial (h'_k)^2}{\partial a_k} = -4 h_k (1 - h_k^2)^2 $

How to calculate derivative of the contractive auto-encoder regularization term?

2 Answers2

Sigmoid

SoftPlus

Tanh