What happens to the initial hidden state in an RNN layer?

Question

I thought I knew how RNNs work, however, when I tried to actually implement it myself, I faced some issues. For one, how do we deal with the initial hidden state?
At the very beginning we just create a vector of zeros with some length which is then used to create the next hidden state and this goes on until we traverse all timesteps. Now this is for one iteration, What does happen in the next iterations?
When we get new inputs, should we still be feeding the same vector of zeros to the network? this doesn't seem right! since in the backprop stage, it seems we don't update the h0! I'm confused here.
If we always feed the same zero vector, it just nullifies all the previous updates we have done on all hidden states so far! so what is it that needs to be done to the initial state?

You can just learn the initial hidden state. So it should be randomly initialized like all the other weights and updated during backpropagation. — sjw, Mar 03 '19 at 16:36
There are two common RNN strategies. (1) You have a long sequence that's always contiguous (for example, a language model that's trained on the text of *War and Peace*); because the novel's words all have a very specific order, you have to train it on consecutive sequences, so the hidden state at the *last* hidden state of the *previous* sequence is used as the *initial* hidden state of the *next* sequence. (2) You have lots of related, but distinct sequences (such as discrete tweets); it can make sense to start each sequence with hidden states of all 0s. Which applies to you? — Sycorax, Mar 03 '19 at 17:32
Thank you very much guys I really appreciate it. @Sycorax : it makes sense now! thanks a lot :) so both strategies are OK! I dont have any examples at the moment, I'm just trying to implement one, so that I can run and play with different examples when its finished. — Hossein, Mar 03 '19 at 18:11

Sycorax · Accepted Answer · 2019-03-03T20:05:55.860

9

There are two common RNN strategies.

You have a long sequence that's always contiguous (for example, a language model that's trained on the text of War and Peace); because the novel's words all have a very specific order, you have to train it on consecutive sequences, so the hidden state at the last hidden state of the previous sequence is used as the initial hidden state of the next sequence.

The way most people do this is that you'll have to traverse the sequences in order, and not shuffle. Suppose you use mini-batch size of 2. You can cut the book in half, and the first sample will always have text from the first half of War and Peace and the second sample will always have text from the second half. Instead of using samples at random, the text is always read in order, so the first sample in the first mini-batch has the first words of the text, and the second sample in the first mini-batch has the first words after the mid-point of the text.

Purely abstractly, I suppose you could do something more complicated where you shuffle the data but can compute the initial hidden state for each position in the sequence (e.g. by computing the text up until that point, or else saving & restoring states) but this sounds expensive.
You have lots of distinct sequences (such as discrete tweets); it can make sense to start each sequence with hidden states of all 0s. Some people prefer to train a "baseline" initial state (user0's suggestion). I read an article advocating doing this if your data has lots of short sequences but I can't find the article now.

Which strategy is appropriate depends on the problem, and specific choices about how to represent that problem.

From the perspective of developing software, an ideal implementation would somehow expose functionality for both options to users. This can be tricky, and different software (pytorch, tensorflow, keras) achieves this in different ways.

edited Mar 03 '19 at 20:05

answered Mar 03 '19 at 18:17

Sycorax

76,417
20
189
313

Thanks a lot. it was really informative. I really appreciate it :) – Hossein Mar 03 '19 at 18:53
One more question. when using mini batches, the second scenario becomes a bit complex. Suppose I have a batch of 3. at first they all use h0 as a vector or 0s. for the second round, the next 3 inputs, will have the h_t of the previous inputs. sample #4 gets the h_t of sample #1, and not #3. sample #5 gets the h_t of sample #2 and etc. does this not make a problem? should we be interleaving the sequences so they get in the right order? what about shuffling? should we not use it since it destroys the exact order of sequences? – Hossein Mar 03 '19 at 19:05
1

In the second scenario, each sequence (e.g. tweet) its its own atomic item, unrelated to the items before and after, so *every* item starts with an initial state that's 0s. – Sycorax Mar 03 '19 at 19:22
yeah, exactly, but I was referring to your 'War and Peace' example where all of those sequences are closely related. – Hossein Mar 03 '19 at 19:34
3

@Breeze The way most people solve this is that you'll have to traverse the sequences *in order*, and not shuffle. Suppose you use minibatch size of 2. You can cut the book in half, and the first sample will always have text from the first half of *War and Peace* and the second sample will always have text from the second half. Instead of using samples at random, the text is always read in order, so the first sample in the first minibatch has the first words of the text, and the second sample in the first minibatch has the first words *after the mid-point* of the text. – Sycorax Mar 03 '19 at 19:52
2

Purely abstractly, I suppose you could do something more complicated where you shuffle the data but can compute the initial hidden state for each position in the sequence (e.g. by computing the text up until that point, or else saving & restoring states) but this sounds expensive. – Sycorax Mar 03 '19 at 19:55
2

Thanks a gazillion times :) I get it now. God bless you :) – Hossein Mar 03 '19 at 19:56
1

All of this knowledge was garnered because I wanted to make a Twitter bot to automatically generate almost-intelligible tweets. – Sycorax Mar 03 '19 at 19:57
A very valuable experience indeed:) – Hossein Mar 03 '19 at 20:00
"and the second sample in the first mini-batch has the first words after the mid-point of the text." - I THINK you mean the first _sample_ of the _second_ mini-batch has the first words after the mid-point of the text, is that right? – Creatron Mar 07 '19 at 19:33
1

@Creatron No, I mean what I wrote. The way I keep this straight in my head is to imagine that the data are laid out in a matrix $M$ of shape $(b, l)$ where $b$ is the batch and $l$ is the length of each contiguous segment of the time series. If you have batch size 4 and 4 years of data, then $l$ is one year. So you make mini-batches by taking contiguous slices of the columns of $M$, starting at the first column and moving some number of time-steps "to the right." – Sycorax Mar 07 '19 at 19:43
@Creatron If it helps to think in terms of pointers, you can think of the series as having total length $L$, but you're shoving it into a matrix with $b \times l \le L$ elements, so you're indexing the time steps using the modulus $l$. – Sycorax Mar 07 '19 at 19:50
Ah! Yes you are right, sorry I had misunderstood what you meant! :) – Creatron Mar 07 '19 at 19:51

What happens to the initial hidden state in an RNN layer?

1 Answers1

Linked