4

My question is : How can I compare Language Model(LM) score for two sentences with different lengths ?

Probabilities are < 1, and since LM scores for a sentence are multiple of probability of bigram or trigram, depending upon it's a bigram or trigram model, the probability of scores of longer sentences will mostly be smaller.

So, how should I normalize the value of scores according to length ?

I am pretty sure, atmost everyone after reading LM would have had same doubt. But I couldn't find much on internet.

Would appreciate for any leads on this.

1 Answers1

5

As you noticed, it's good idea to have some kind of averaging. Since in LM probabilities get multiplied, geometric average seems like a good fit.

From Speech and Language Processing

In practice we don’t use raw probability as our metric for evaluating language models, but a variant called perplexity. The perplexity (sometimes called PP for short) of a language model on a test set is the inverse probability of the test set, normalized by the number of words.

$PP((w_1, ...,w_N)) = \sqrt[N]{\dfrac{1}{P(w_1, ...,w_N)}}$

Vil
  • 3
  • 2
Jakub Bartczuk
  • 5,526
  • 1
  • 14
  • 36