4

I have trained a translation seq2seq model. In my model, I have kept vocabulary size to 100,000. This constraint limits my model from generating any words which are not in this 100,000.

So how does Google Translate or Bing Translate works for any word in their input?

Basically, my question is how to make my model work with infinite vocabulary.

Franck Dernoncourt
  • 42,093
  • 30
  • 155
  • 271

2 Answers2

5

Basically, my question is how to make my model work with infinite vocabulary.

It would be unwise (how are you going to do optimization with such data?).

But you don't have to. Basically you're asking how to deal with unknown words.

One answer for that is to just use some other representation for words - instead as representing them as one-hot vectors from some vocabulary, you can use subword features (like characters or character n-grams) - you can find papers using this terminology, they're also called character-level features.

For intuition you could look into lingustic knowledge - most words aren't actually completely unrelated to other words, but they're formed from more basic parts, or morphemes.

Jakub Bartczuk
  • 5,526
  • 1
  • 14
  • 36
  • Are you suggesting to break words into `chars` or similar smaller subunits ? For ex : `"stackexchange --> stach + ex + change"` ? In that case which segmentation should I use ? – pseudo_teetotaler Jan 23 '18 at 07:02
2

Basically, my question is how to make my model work with infinite vocabulary.

In addition to the technique (subword features) that Jakub Bartczuk mentions in his answer to handle unknown words, seq2seq model may use a copy mechanism that copies a token from the input sequence to a token in the output sequence.

Example of seq2seq models using the copy mechanism: {1}


References:

Franck Dernoncourt
  • 42,093
  • 30
  • 155
  • 271