I'm looking to clarify a couple of rules of thumb for exploring neural network architectures - specifically, choosing the number of hidden units (in a bi-directional LSTM). Disclaimer: I know at the end of the day you just have to try and test; I just want to do that sensibly.
Consider a natural language sequence labelling task, where each token receives a label.
- Samples (input sequences) = 13, 000
- Total labels to be predicted = 39,249
- Inputs = fixed-length sequence embeddings from Word2Vec, ELMo or BERT.
- Input dimension = (, length, n), where length = 37 and n = [300, 1024, 3072], depending on which kind of embedding it is.
- Output = 72 entity-type labels
How would you apply the formula in this answer to this task? If you use the embedding dimension as Ni, the maths doesn't seem to work, e.g. α = 2, Ni = 300, No = 72 results in only ca. 17 units as the upper limit for the W2V model. Using Ni = length = 37 instead doesn't take into account the difference in amount of information stored by the respective embeddings, which intuitively seems like an important thing to consider. Furthermore, this violates another rule of thumb on the same page, that "the optimal size of the hidden layer is usually between the size of the input and size of the output layers". It makes me wonder, is this heuristic even appropriate for models which draw on the learned knowledge (via embeddings) of other, much bigger models?
In yet another answer on the same thread, the second and third rules of thumb make more sense and agree with each other, e.g. the maximum hidden units would be 274 and 1, 608 for W2V and ELMo models, and both those sums are less than double the input layer size. But as one commenter asked, shouldn't we consider training data size?
I've already checked this and this and this and this. A lot of these links repeat the same information, but don't mention one thing which I thought was also a heuristic: should the number of units be a multiple of the float size?