I have a model that outputs short sentences and want to compare the quality of its outputs for different configurations by computing their perplexities using another model.
I tried to use the gpt-2 model from https://github.com/huggingface/pytorch-transformers but I get perplexities of over 1000 so I am not sure if it makes sense to use these results.
I noticed that when I feed in short sentences from Wikitext-2, the perplexities are also very high. It seems that the language model has a hard time predicting the next word, when only a small context is available.
My questions are:
Should I use the perplexities from gpt-2 or does anyone know a language model that works better with short sequences?
Does it make sense to only take outputs with a specified length into account (say: sentences with 20 words) in order to remove the bias towards long sentences?
I would be happy about any suggestions :)