Accuracy of PMI (Pointwise Mutual Information) calculation from co-occurrence matrix

Question

Background

When calculate PMI or PPMI from a co-occurrence matrix (COM), it sums each row (co-occurrences) of the COM e.g. 2 for pineapple as in the formula in the snapshot. For this question, it is about words co-occurrences in a corpus text sequence.

Question

Can the PMI formula calculate correct PMI from a COM? I think this part in the formula does not give the number of co-occurred words for the row i.

$\sum_{j=1}^Cf_{ij}$

Example

Creating a COM from text sequence using N-Gram(N=5).

Each word between or and Q has four co-occurrence words (COW), e.g. (to, be, is, the) for the word that. However, the word matters at the end only has two COW (Q, that).

Therefore, $\sum_{j=1}^Cf_{ij}$ for each words are:

6 for to (i = 1) --> cannot calculate from dividing by COW
7 for be (i = 2) --> cannot calculate from dividing by COW
4 for or (i = 3) --> or occurred (4 / COW) times = 1
4 for not (i = 4) --> not occurred (4 / COW) times = 1
7 for that (i ==5) --> cannot calculate from dividing by COW
4 for is (i = 6) --> is occurred (4 / COW) times = 1
4 for the (i = 7) --> the occurred (4 / COW) times = 1
4 for Q (i = 8) --> Q occurred (4 / COW) times = 1
2 for matter (i = 9) --> cannot calculate from dividing by COW

For those words at the ends of the corpus, $\sum_{j=1}^Cf_{ij}$ does not represents how many times the word at row i occurred.

Workadound

By padding both ends of the corpus with NIL, the number of times when a word occured can be calculated via $\sum_{j=1}^Cf_{ij} / COW$ at each row i. However, a COM will not include such dummy NIL word counts.

Then the formula to calculate PMI/PPMI seems not to be completely accurate regarding those words at ends, although it could be negligible when the corpus size or vocabulary size is big.

Please help to clarify the accuracy of PMI, or if I misunderstand COM such as COM is different from my understanding and there is no such issue.

Will this instruct how to compute PMI? https://stats.stackexchange.com/a/256704/3277 — ttnphns, Jan 05 '21 at 03:03
@ttnphns, sorry but I am afraid it does not because my issue is specific to co-occurrence matrix and the words at the both ends of the corpus sequence. Probably my question was not clear enough. Update the question in an effort to be more clear. — mon, Jan 08 '21 at 02:02

Accuracy of PMI (Pointwise Mutual Information) calculation from co-occurrence matrix

Background

Question

Example

Workadound

0 Answers0