How GRU solves vanishing gradient

Question

I am learning the GRU model in deep learning and reading this article where details of BPTT are explained. Towards the end the author explained the values of the partial derivative $\frac{\partial h_i}{\partial h_{i-1}}$ in 3 cases: (1). $z= 1$. (2). $z=0, r=0$, and (3). $z=0, r=1$.

Here $z,r$ are update and reset gates respectively.

To clarify the notations and provide some details, the loss gradient at step $t$ with respect to the input-to-hidden weight in the update gate $z_t$, $U_z$ has form \begin{align} \frac{\partial L_t}{\partial U_z} &= \frac{\partial L_t}{\partial h_t} \sum_{k=1}^t \{ \left(\prod_{i=k+1}^t \frac{\partial h_i}{\partial h_{i-1}} \right) \frac{\partial^+ h_k}{\partial U_z} \} \end{align} To keep the notations simple I will drop the matrix transposes and Hadamard products in the partial derivative below: \begin{align} \frac{\partial h_i}{\partial h_{i-1}} &= \frac{\partial h_i}{\partial \tilde{h}_i} \frac{\partial \tilde{h}_i}{\partial h_{i-1}} + \frac{\partial h_i}{\partial z_i} \frac{\partial z_i}{\partial h_{i-1}} + \frac{\partial^{+} h_i}{\partial h_{i-1}} \\ &= (1-z_i) \{ W_r W_h (1- \tilde{h}_i^2) h_{i-1} r_i (1- r_i) + W_h(1- \tilde{h}_i^2 r_i\} \\ &+ W_z (h_{i-1}- \tilde{h}_i) z_i (1-z_i) + z_i \end{align} where $h_i$ is the hidden state at step $i$, $\tilde{h}_i$ is the current memory of the network at step $i$.

The 3 cases mentioned earlier is explained in more details here. On slide 4 the author mentioned that units with short-term dependencies will have active reset gates $r_t$; and units with long term dependencies have active update gates $z_t$.

I would like some help with (1). Understanding the status of the gates in relation to short/long-term dependencies.
(2). In the 2nd case , when $z=r=0$, $\frac{\partial h_i}{\partial h_{i-1}}$ is $0$. Why $\frac{\partial L_t}{\partial U_z}$ still avoids shrinking to zero

score 2 · Accepted Answer · answered Dec 14 '21 at 20:35

2

(1). Understanding the status of the gates in relation to short/long-term dependencies.

Active update ($z_t$) gate means that the information inside $h_{t-1}$ is propagated to $h_t$, because the update equation is $h_t=z_t h_{t-1} + (1-z_t)\tilde h_t$. So, the information travels through the time steps. This means long-term dependency.

Active reset ($r_t$) gate means the $\tilde h$ has information integrated into it from the previous time step due to the equation $\tilde h_t=\tanh(Wx_t+r_tUh_{t-1})$. At the same time, if $z_t$ is not very active, this information will be conveyed to the next steps, but will die out quickly. That is short-term dependency.

(2). In the 2nd case , when $z=r=0$, $\frac{\partial h_i}{\partial > h_{i-1}}$ is $0$. Why $\frac{\partial L_t}{\partial U_z}$ still avoids shrinking to zero

The derivative is not zero. Even in your formula, it looks like $W_h$, though I'm not sure the exact form. However, since $z_t$ and $r_t$ also depend on $h_{t-1}$, the derivative is not zero, at least not in a simple way.

answered Dec 14 '21 at 20:35

gunes

49,700
3
39
75

Thanks for the answer.
I have thought about the reset gate, when the update gate is not active. When it is active, $r_t= 1$, info in $h_{t-1}$ and new info in $x_t$ are both passed on to $\tilde{h}_t$. With $z_t = 0$ they are further passed on to $h_t$. How do they 'die out quickly'?
When the reset gate is not active, only new info $x_t$ is passed on to $\tilde{h}_t$, and $h_t$ received only the new info. I find this sounds more like 'capturing short-term dependency' – siegfried Dec 15 '21 at 11:41
Dependencies aside, does GRU alleviate vanishing gradients with the additive partial derivatives instead of the multiplicative ones in BPTT? I have read in LSTM that addition is more well-behaved and the learnable gates will adjust to avoid the loss gradient shrinking to 0 – siegfried Dec 15 '21 at 11:45
1

Depending on inputs is neither short nor long term dependency, because it's the input not the memory. When $z_t=1$, the effect of the past is much more prevalent on the current state, compared to $z_t=0$. If $z_t$ is active, most of the information is conveyed through the steps, but this is not the case for $z_t=0$. Yes, the past state is blended with the input, but over the time steps, it's been blended with the inputs so many times that it's not possible to talk about any long term dependency. – gunes Dec 16 '21 at 22:10
About the learnable gates, yes. If the gates were not learnable, you wouldn't have the second term in your differentiation, since $z_t$ was constant. And, the whole expression would be $0$ if $z_t=0, r_t=1$ (vanilla RNN), but it's not when $z_t$ depends on something. – gunes Dec 16 '21 at 22:12

How GRU solves vanishing gradient

1 Answers1