Cross Co-occurrence between two corpora

Asked Mar 11 '16 at 18:52

Active Mar 11 '16 at 18:52

Viewed 459 times

I've looked around for a solution to this problem specifically in nltk, quite a bit but couldn't find much help either on SO or elsewhere.

My problem is as follows:

I have a set of aligned pairs of sentences:

[(p1, q1), (p2, q2),....,(pn,qn)]

Each p and q are corresponding sentences with different number of words. Typically p is much longer than q, although this information is not critical.

p1 can be split into multiple words, and so can q1 be.

What I want is some sort of a co-occurrence probability of words (w, w'), where w is selected from p_i and w' is selected from q_i.

Eventually I am trying to estimate, what is the probability of seeing a word in p_i given that a word was observed in q_i.

How to do this using nltk?

I know I could code the logic in python, but would like to know if nltk has something that would handle more edge cases, and cut-offs on frequencies and related issues.

Thank you!

asked Mar 11 '16 at 18:52

user1669710

Cross Co-occurrence between two corpora

0 Answers0