5

I was trying to figure out how to estimate the number of parameters in an LSTM layer. What is the relationship of number of parameters with the num lstm-cells, input-dimension, and hidden output-state dimension of the LSTM layer?

If the LSTM input is 512-d (word embedding dimension), output hidden dimension is 256, and there are 256 lstm units (bidirectional layer) in each of the bidirectional LSTM layers, what's the params per cell and total in the layers?

I came across this link https://stackoverflow.com/questions/38080035/how-to-calculate-the-number-of-parameters-of-an-lstm-network, and it seems to suggest that hidden output state dimension = number of lstm cells in the layer. Why is that?

Joe Black
  • 299
  • 1
  • 10
  • 1
    Each cell has its own hidden state. – Sycorax May 27 '20 at 18:21
  • 1
    @SycoraxsaysReinstateMonica I'm not sure I follow. what does that have to do with the question in the OP? – Joe Black May 27 '20 at 18:26
  • You write "[...] it seems to suggest that `hidden output state dimension = number of lstm cells in the layer`. Why is that?" And I wrote "Each cell has its own hidden state." If each of $n$ cells has its own hidden state, the dimension is $n$ when you collect those states together. – Sycorax May 27 '20 at 18:28
  • 1
    @SycoraxsaysReinstateMonica but each cell's hidden state is 256-dimensional, i.e. it has 256 floating-point numbers in it. So if you collected each cell's hidden state, it'd be 256*256, for 256 cells in the layer. It's not `n` but `n*n`. – Joe Black May 27 '20 at 18:33
  • 1
    I don't think that's correct. Each cell's hidden state is 1 float. The reason you'd have output dimension 256 is because you have 256 units. Each unit produces 1 output dimension. For example, see https://pytorch.org/docs/stable/nn.html . If we look at the **output**, is has shape `(num_layers * num_directions, batch, hidden_size)`. For1 layer, 1 direction, and batch size 1, we have `hidden_size` floats in total. – Sycorax May 27 '20 at 18:47
  • I believe this answers your question https://stats.stackexchange.com/questions/226593/how-can-calculate-number-of-weights-in-lstm – Sycorax May 27 '20 at 18:47
  • @SycoraxsaysReinstateMonica "Each cell's hidden state is 1 float." i don't think that's true, and each unit doesn't produce 1 output dimension. Every literature that you see for LSTM has the hidden-dimension produced by each cell (which in this case is 256-d). Each cell produces a hidden-state. Each hidden state is 256-d. – Joe Black Jun 06 '20 at 23:25
  • That’s not what the documentation says. Perhaps you could share a resource that supports your claim. – Sycorax Jun 06 '20 at 23:29
  • @SycoraxsaysReinstateMonica re link pytorch.org/docs/stable/nn.html "For1 layer, 1 direction, and batch size 1, we have hidden_size floats in total." it doesn't prove anything, it only illustrates the issue. `"output of shape (seq_len, batch, num_directions * hidden_size): tensor containing the output features (h_t) from the last layer of the LSTM, for each t. ."` output total size is `seq_len*num_directions*hidden_size` = `256*2*256` which doesn't conflict with the question in the OP. – Joe Black Jun 06 '20 at 23:32
  • Another way to say the point of confusion in the OP is: why should there be relationship between the number of cells in the LSTM layer (in this case 256) with the size of hidden-dimension? in other words, one could easily design an LSTM layer so that they have say 10,000 cells in the layer with each hidden-state only 256-d. what's stopping one from doing that? – Joe Black Jun 06 '20 at 23:35
  • Ok, so your entire question boils down to the fact that you think that sequence length and hidden size are the same thing. They are not. A different input sequence length will yield a different size of the output. This is because each LSTM cell processes one time step at a time, and make 1 float for each time step. You write " it'd be 256*256, for 256 cells in the layer. It's not n but n*n" but clearly this is solely because both sequence length and number of cells are equal in your particular case, but there's no reason for this to be true in general. – Sycorax Jun 07 '20 at 00:15
  • @SycoraxsaysReinstateMonica no that's not what i'm saying. there are 3 different factors that i think can be independent of each other: hidden_size (dimension of hidden output), number_of_units of lstm cells in the lstm layer, and third sequence_length. Issue is why is `hidden output state dimension = number of lstm cells in the layer`? they should have no relation. You assume `each LSTM cell processes one time step at a time, and make 1 float for each time step.` what's the reason/basis for this? – Joe Black Jun 07 '20 at 02:04
  • every lstm-cell you see (incl at the link you posted https://i.stack.imgur.com/MLGXm.png) shows h_t is of size hidden-dim (which is 256) in this case. it's not producing 1 float, it's producing h_t sized floats, i.e. 256 floats. – Joe Black Jun 07 '20 at 02:05
  • Hidden size and the number of units are the same thing: a layer is composed of units, sometimes called "neurons." This is why the pytorch documentation gives the output size as having shape `(num_layers * num_directions, batch, hidden_size)`. It's also why the number of cells is the same as the size of the hidden state's output when you have 1 layer, 1 direction and 1 time-step. // The image uses the symbol $h_t$ to denote the hidden state at time $t$. Nothing in that image describes the size of an input or an output, so I don't understand what bearing the image has on your question. – Sycorax Jun 07 '20 at 03:10
  • The documentation also states that the sequence of LSTM computations are applied at each time step (scroll up), so this is why I state that LSTMs process a sequence 1 step at a time. Alternatively, you can read Schmidhuber's LSTM article. It uses slightly different language, but the material is all there. – Sycorax Jun 07 '20 at 03:16
  • I'm honestly baffled that this is a controversy. The documentation tells you exactly how large the hidden output is. If you don't believe it, that's a personal choice, but it doesn't bear any relation to what an LSTM is or the shapes of its outputs. – Sycorax Jun 07 '20 at 03:49
  • which class/output you're looking at in "if we look at the output, is has shape (num_layers * num_directions, batch, hidden_size)" as there are several outputs listed at the link? going back a bit to my main issue, why is "Hidden size and the number of units are the same thing:". i think one could design an lstm so they're not the same, so may be the issue is getting misunderstood. – Joe Black Jun 07 '20 at 04:39
  • My answer explains the matrix arithmetic involved. You’d have to change how an LSTM works to make a unit vs size distinction meaningful. – Sycorax Jun 07 '20 at 04:59
  • Similar questions https://stackoverflow.com/questions/58233567/lstm-layer-output-size-vs-hidden-state-size-in-keras and https://ai.stackexchange.com/questions/15621/what-is-the-relationship-between-the-size-of-the-hidden-layer-and-the-size-of-th – toliveira Nov 12 '20 at 21:04
  • @JoeBlack, have you found an answer? – toliveira Nov 12 '20 at 21:05

2 Answers2

3

I came across this link https://stackoverflow.com/questions/38080035/how-to-calculate-the-number-of-parameters-of-an-lstm-network, and it seems to suggest that hidden output state dimension = number of lstm cells in the layer. Why is that?

Each cell's hidden state is 1 float. As an example, the reason you'd have output dimension 256 is because you have 256 units. Each unit produces 1 output dimension.

For example, see this documentation page for Pytorch https://www.pytorch.org/docs/stable/nn.html. If we look at the output entry for an LSTM, the hidden state has shape (num_layers * num_directions, batch, hidden_size). So for a model with 1 layer, 1 direction (i.e. not bidirectional), and batch size 1, we have hidden_size floats in total.

You can also see this if you keep track of the dimensions used in the LSTM computation. At each timestep (element of the input sequence) the layer of an LSTM carries out these operations, which are just compositions of matrix-vector products and activation functions. $$ \begin{aligned} i_t &= \sigma(W_{ii} x_t + b_{ii} + W_{hi} h_{t-1} + b_{hi}) \\ f_t &= \sigma(W_{if} x_t + b_{if} + W_{hf} h_{t-1} + b_{hf}) \\ g_t &= \tanh(W_{ig} x_t + b_{ig} + W_{hg} h_{t-1} + b_{hg}) \\ o_t &= \sigma(W_{io} x_t + b_{io} + W_{ho} h_{t-1} + b_{ho}) \\ c_t &= f_t \odot c_{t-1} + i_t \odot g_t \\ h_t &= o_t \odot \tanh(c_t) \\ \end{aligned} $$ We're focused on the hidden state, $h_t$, so look at the operations involving $h_{t-1}$ because this is the hidden state at the previous time-step. The hidden-to-hidden connections must have have size hidden_size by hidden_size because they're matrices which must by conformable in a matrix-vector product where the vector has size hidden_size. The input-to-hidden connections must have size hidden size by input size because this is a matrix-vector product where the vector has size input size.

Importantly, your distinction between hidden size and number of units never makes an appearance. If hidden size and number of units were different, then this matrix-vector arithmetic would, somewhere, not be conformable because it wouldn't have compatible dimension.

As for counting the number of parameters in an LSTM model, see How can calculate number of weights in LSTM

I believe the confusion arises because OP has confused the hidden output state, which is an output of the model, with the weights of the hidden state. I think this is the case because you insist that the hidden state has shape (n,n). It’s not, but the hidden weights are square matrices. LSTM cells have memory, which is returned as a part of the output. This is used together with the model weights and biases to yield the prediction for the next time step. The difference between hidden state output and the hidden weights is that the model weights are the same for all time steps, while the hidden state can vary. This “memory” component is where LSTMs get their name.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • i posted this in comment in op. i don't there is any reason for this assumption of 1 float produced by a lstm-cell. " Each cell's hidden state is 1 float. As an example, the reason you'd have output dimension 256 is because you have 256 units. Each unit produces 1 output dimension." i.stack.imgur.com/MLGXm.png shows h_t is of size hidden-dim (which is 256) not 1 float. – Joe Black Jun 07 '20 at 02:07
  • i don't think i'm actually conflating, and i think we are talking past each other. Looking at LSTM pytorch link, "h_n of shape (num_layers * num_directions, batch, hidden_size): tensor containing the hidden state for t = seq_len", isn't h_n here the tensor _only_ for the last hidden-state in the sequence at time _t_, _not_ for all hidden-states? – Joe Black Jun 07 '20 at 05:21
  • 1
    @JoeBlack yes, it’s the last time step. But that seems immaterial since we agree sequence length and hidden size are different. Are you sure you’re asking about the size of the hidden **output** and not asking about the number of parameters in the weights matrices? – Sycorax Jun 07 '20 at 11:50
1

Regarding the question: "why the dimension of the hidden state is related to the number of cells in a LSTM layer"?, what I understand, a layer of 4 cells would be represented as the picture I attached.

It is clear with the picture that the state H has dimension 4, which is directly related to the number of cells (hidden states) of the layer. I hope that clarifies the original question, and please correct me if I'm wrong. LSTM layer with 4 cells