I'm new to deep learning and i'm trying to code a Visual Question answering network.
I studied and (i think) understood how RNN and LSTM work.
From what i'he understood, i need to train my network with a sequence of inputs which, in my case, are string questions.
So what i should do is set a sequence max length (based on the longest question) and then encode each word of the sentence as One hot vector, then my network will "embedd" the text using proper embeddings in order to improve performances ecc...
The main problem is: can i train my model without knowing the actual size of my input vocabulary?
I have a pretty big dataset, about 250000 questions of variable length, so (based on what i've learnt so far) i should compute all the unique words in my dataset in order to one hot encode them properly.
Is my reasoning right? Are there any better options than calculate the size of my vocabulary by "hand"?