I've seen something like this a lot in literature : "we used x lstms cells in our implementation". I don't understand the point of using several stacked lstms : indeed, why isn't a single cell enough as it already takes the cell state and the hidden state from the previous time step ?
For example page 4 of this paper : https://arxiv.org/pdf/1612.04928.pdf
I see the advantage of parallelizing two cells but not the one of stacking.