1

Let's have corpora

$X = x_1...x_N$

in which every word can be represented using subwords (from a fixed size vocabulary of subwords)

$x_i = x_{i,0}...x_{i,M(x_i)}$

where $M(x_i)$ is number of subwords to which the word is divided.

For a word language model we would calculate perplexity using formula:

$\exp\left(\frac{{\sum_{i=1}^N \log \left(\dfrac{1}{q(x_i)}\right)}}{N}\right)$

where $q(x_i)$ is probability of a word from language model.

My questions are:

  1. Is it valid to calculate word perplexity on a subword language model, where $q'(x_i) $ would equal to $\prod_{j=1}^{M(x_i)}r(x_{i,j})$ and $r(x_{i,j})$ is probability from subword language model. The whole formula would look like this:

$\exp\left(\frac{{\sum_{i=1}^N \log \left(\dfrac{1}{\prod_{j=1}^{M(x_i)}r(x_{i,j})}\right)}}{N}\right) = \exp\left(\frac{{\sum_{i=1}^N \sum_{j=1}^{M(x_i)}\log \left(\dfrac{1}{r(x_{i,j})}\right)}}{N}\right)$

  1. If not is there other way to calculate word perplexity on subword model and do two language models with different vocabulary can be even compared?

  2. The probability is actually conditional probability $q(x_i|x_{0...i-1})$, does it change something in that manner?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
wswin
  • 21
  • 1

1 Answers1

0

No, I thought you don't need to multiply the probabilities of the subwords of a word, and all you need to do is just treat each subword as a word, and hence the formula is OK if you change the $x_i$ in your first formula to $x_i\in \{generated\_subwords\}$.

References:

  1. https://twitter.com/lmthang/status/1222398272427347968
  2. https://arxiv.org/pdf/2001.09977.pdf
Lerner Zhang
  • 5,017
  • 1
  • 31
  • 52