Calculating perplexity with smoothing techniques (NLP)

Question

This question is about smoothed n-gram language models. When we use additive smoothing on the train set to determine the conditional probabilities, and calculate the perplexity of train data, where exactly is this useful when it comes to the test set?

Which of these two things do we do?

apply the conditional probabilities calculated using smoothing in the train set to the same n-grams that we might see in the test set, then calculate the perplexity for the test set separately?
apply smoothing on the test set as well? If that's the case, what's the point of having a train set and a test set?

Arya McCarthy · Accepted Answer · 2021-06-14T02:46:29.010

Even though you asked about smoothed n-gram models, your question is more general. You want to know how the computations done in a model on a training set relate to computations on the test set.

Training set computations.
You should learn the parameters of your (n-gram) model using the training set only. In your case, the parameters are the conditional probabilities. For instance, you may find that $p(\text{cat})=\frac{7+\lambda}{1000+\lambda V}$ if your vocabulary size is $V$. These numbers are the ones you’d use to compute perplexity on the training set.

Test set computations.
When you compute the perplexity of the model on the test set, you reuse the same learned parameters from before. You don’t recompute $p(\text{cat})$. You still use $\frac{7+\lambda}{1000+\lambda V}$, regardless of how often “cat” appears in the test data. (One notable problem to beware of: if a word is not in your vocabulary but shows up in the test set, even the smoothed probability will be 0. To fix this, it’s a common practice to “UNK your data”, which you can look up separately.)

The point.
The point of this is to see how well your model generalizes. The test data is a surrogate for real-world data that you’ll see when deploying your model. You ignore it when fitting the model. You then compute perplexity on the test data, as an estimate of how you’d do on that real world data.

Thank you for the clear explanation! – Janani K Jun 01 '21 at 13:36 — Janani K, Jun 01 '21 at 13:36

Calculating perplexity with smoothing techniques (NLP)

1 Answers1