5

I've seen something like this a lot in literature : "we used x lstms cells in our implementation". I don't understand the point of using several stacked lstms : indeed, why isn't a single cell enough as it already takes the cell state and the hidden state from the previous time step ?

For example page 4 of this paper : https://arxiv.org/pdf/1612.04928.pdf

I see the advantage of parallelizing two cells but not the one of stacking.

Tiffany
  • 53
  • 5
  • Your question is automatically flagged as low-quality because it is so short. Can you extend your question please? – Ferdi Sep 07 '17 at 12:07
  • Thank you for extending your question. Now it looks much better. If you still remember the paper where you read this sentence it would be awesome if you provide a link. – Ferdi Sep 07 '17 at 12:20
  • 1
    yes no problem, I edited again. – Tiffany Sep 07 '17 at 12:31

1 Answers1

1

One layer only has one cell. For more information read this. And the stacked multi-layer LSTM model is for extracting more abstract information. I think this question and this answer have explained this issue in detail.

Lerner Zhang
  • 5,017
  • 1
  • 31
  • 52
  • 1
    Also might want to point to Graves' [seminal paper on stacked LSTMs for speech recognition](https://arxiv.org/pdf/1303.5778.pdf): "If LSTM is used for the hidden layers we get deep bidirectional LSTM, the main architecture used in this paper. As far as we are aware this is the first time *deep* LSTM has been applied to speech recognition, and we find that it yields a dramatic improvement over single-layer LSTM." ([Graves et al., 2013](http://ieeexplore.ieee.org/abstract/document/6638947/)) – fnl Sep 08 '17 at 09:38