0

To clarify the slightly ambiguous language in the title. I have an RNN (actually 2 stacked RNN layers) that take input X of size

X [batch_size, sequence_length, features]

the model is trying to use the sequence_length number of timesteps to predict a single value output y_hat of size

y_hat [batch_size]

My sequence_length is set to a fixed size, but I have some sequences bigger than this value. I have seen people solving this by splitting the sequence into chunks of size sequence_length and then passing them in one after the other with the RNN states initialized with the final state of the previous chunk. This I understand. What is not clear to me is what is the target (y) for all but the final chunk, how is error backpropagated, and how weights are updated.

If I give an example just to make this clear, in the case of a language model I may have an input like the following:

hello i am generating an example and our target is interest

If our max length was 5, then we'd split up the sequence into:

chunk 1: X_1 = hello i am generating an, y_1 = ??

chunk 2: X_2 = example and our target is, y_2 = interest

If both the first and the second chunk are passed the final input as y (interest) we'd effectively be training the network that interest is the next word in the first subsequence above, as well as in the second?

Usherwood
  • 101
  • 2

1 Answers1

0

With recurrent neural networks you don't need to split the data into chunks. RNN processes the sequence sequentially, so the model theoretically can process sequences of any and varying sizes. In some cases you can split the data into chunks to simplify the code and for computational reasons, but not in this case. This can be done in cases where your aim is to predict next value given previous ones, so by processing data in chunks you shorten the history that you take into consideration for making predictions. In your case you seem to be using whole sequence to predict single value, so the only thing you could do is to truncate the series and ignore the beginning (use only the final chunk). So you are right that there's nothing to back-propagate if there is no target variable.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • Thanks for the reply. So I may have oversimplified my example somewhat, I am actually implementing a transfer learning approach that starts with a language model predicting the next word, then moves to taking input documents and performing some other classification task (e.g. sentiment analysis). I'm following the work in this paper https://arxiv.org/abs/1801.06146. – Usherwood Dec 03 '18 at 18:59
  • They mention using what they refer to as BPTT for Text Classification (BPT3C) but give very little detail on the topic and the maths isn't clear to me. In the use case here only using the last part of the sequence could lead to poor results as the key words in a long document that represent the sentiment could be near the start. – Usherwood Dec 03 '18 at 18:59
  • @Usherwood how splitting data to chunks would solve this? If you're gonna predict sentiment, then by splitting data to chunks you'd predict sentiment per chunk while *ignoring* other chunks. – Tim Dec 03 '18 at 19:23
  • That's the entire crux of the question. I have to pass sequences in in chunks as that's how the underlying language model is trained. However the point is that all of the chunks are related so we want to calculate error on the final point and backpropagate and update weights accordingly. – Usherwood Dec 03 '18 at 19:32
  • As per the paper "We divide the document into fixed length batches of size b. At the beginning of each batch, the model is initialized with the final state of the previous batch; we keep track of the hidden states for mean and max-pooling; gradients are back-propagated to the batches whose hidden states contributed to the final prediction. " – Usherwood Dec 03 '18 at 19:33
  • 1
    @Usherwood I didn't go that deep into ULMFiT. From the quote it follows that the batches are processed in such way that the information propagates. This is not the case if you just splitted your data into chunks and trained like this, since then you'd train on single chunk, so at each step you need information for updating the parameters, so you can't have chunks with no labels. – Tim Dec 03 '18 at 21:01