In Sequence to Sequence learning, how can large amounts of missing/special words in a sentence be compensated for?

Question

I'm currently working on a Seq2Seq model for a chatbot and I'm converting every sentence to numerical vectors with word embeddings, i.e. GloVe.

My problem is that training doesn't progress; the model starts with around 0.0055 loss with mean-squared error as the loss function, and at the end of training it's still a number close to 0.0055, like 0.0054.

I was suspicious of the vocabulary used in the dataset so I checked the first 20000 sentences for non-conventional vocab(names, jumbled words like "whaaat", and sound effects like "mmmmmm") in each sentence, and it turns out that around 1960 words were not in the GloVe dictionary, out of the 14320 unique words in the 20000 sentences.

Does this ratio(1960:14320) have any significant effect on the training, like the inability to learn like my situation?

Also, how do I compensate for large amounts of such special vocabulary?

Here's some details on the datasets/word embedding vocab I'm using:

Dataset: Cornell Movie Dialogs Corpus
Word embedding: glove.6B.200d.txt in the zip file downloaded from this link

score 1 · Answer 1 · answered Feb 09 '18 at 15:34

There has been research on dealing with this kind of issues. Specifically in your case I'd outline the paper "Character-Aware Neural Language Models" by Kim at al, which encodes the input on the character level, but predicts the output on the word level.

This turns out to work very well on languages which use a lot of forms of the same word (e.g. prefixes, suffixes, endings) depending on the context. But it also shows great results in English, where training sentences contain rare or misspelled words. For example, the learned embeddings allow to match the word looooook with look, and thus handle a lot of OVV words.

If your data contains a lot of words like that, you might benefit a lot from the character-aware model.

In Sequence to Sequence learning, how can large amounts of missing/special words in a sentence be compensated for?

1 Answers1