How to make a seq2seq model work with infinite vocabulary?

Question

I have trained a translation seq2seq model. In my model, I have kept vocabulary size to 100,000. This constraint limits my model from generating any words which are not in this 100,000.

So how does Google Translate or Bing Translate works for any word in their input?

Basically, my question is how to make my model work with infinite vocabulary.

score 5 · Answer 1 · answered Jan 22 '18 at 22:41

Basically, my question is how to make my model work with infinite vocabulary.

It would be unwise (how are you going to do optimization with such data?).

But you don't have to. Basically you're asking how to deal with unknown words.

One answer for that is to just use some other representation for words - instead as representing them as one-hot vectors from some vocabulary, you can use subword features (like characters or character n-grams) - you can find papers using this terminology, they're also called character-level features.

For intuition you could look into lingustic knowledge - most words aren't actually completely unrelated to other words, but they're formed from more basic parts, or morphemes.

Are you suggesting to break words into `chars` or similar smaller subunits ? For ex : `"stackexchange --> stach + ex + change"` ? In that case which segmentation should I use ? — pseudo_teetotaler, Jan 23 '18 at 07:02

Franck Dernoncourt · Answer 2 · 2018-04-17T02:38:59.570

2

Basically, my question is how to make my model work with infinite vocabulary.

In addition to the technique (subword features) that Jakub Bartczuk mentions in his answer to handle unknown words, seq2seq model may use a copy mechanism that copies a token from the input sequence to a token in the output sequence.

Example of seq2seq models using the copy mechanism: {1}

References:

{1} Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, Nazli Goharian. A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents. NAACL HLT 2018

edited Apr 17 '18 at 02:38

answered Jan 22 '18 at 22:45

Franck Dernoncourt

42,093
30
155
271

Are you suggesting to use `Attention Weights` for copying? Do you think those are effective ? – pseudo_teetotaler Jan 23 '18 at 07:03
@random_28 See the edited answer, I have added one reference detailing a summarization model that has a copy mechanism. – Franck Dernoncourt Apr 17 '18 at 02:33

How to make a seq2seq model work with infinite vocabulary?

2 Answers2

Linked