Train a RNN with unknown vocabulary size

Question

I'm new to deep learning and i'm trying to code a Visual Question answering network.

I studied and (i think) understood how RNN and LSTM work.

From what i'he understood, i need to train my network with a sequence of inputs which, in my case, are string questions.

So what i should do is set a sequence max length (based on the longest question) and then encode each word of the sentence as One hot vector, then my network will "embedd" the text using proper embeddings in order to improve performances ecc...

The main problem is: can i train my model without knowing the actual size of my input vocabulary?

I have a pretty big dataset, about 250000 questions of variable length, so (based on what i've learnt so far) i should compute all the unique words in my dataset in order to one hot encode them properly.

Is my reasoning right? Are there any better options than calculate the size of my vocabulary by "hand"?

score 1 · Answer 1 · answered Jan 09 '20 at 18:07

1

If you one-hot encode all the words, you would probably run out memory (or funds for the cloud services) fast.

Instead of encoding all words, you need to encode $k$ most frequent words, for some $k$ that is large enough for making accurate predictions, and small enough that you can afford it. In such case, all the other words would be encoded as the "other" (or UNK using token that is popular in natural language processing). Also any new word that was not present in train set would be encoded like this.

Alternatively, you can use hashig trick, i.e. use some hashing function that could map any word, to a fixed set of codes.

answered Jan 09 '20 at 18:07

Tim

108,699
20
212
390

Can i use Tokenizer by Keras for my purpose? It seems what i need – Mattia Surricchio Jan 10 '20 at 09:21
@MattiaSurricchio you can, but need to remember that this does the first thing, i.e. dropping the infrequent words. For hashing trick, you can use the https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/hashing_trick, but beware that you should change the default hashing function to something else https://github.com/keras-team/keras-preprocessing/issues/25 – Tim Mar 18 '21 at 08:39

Train a RNN with unknown vocabulary size

1 Answers1