I've looked around for a solution to this problem specifically in nltk, quite a bit but couldn't find much help either on SO or elsewhere.
My problem is as follows:
I have a set of aligned pairs of sentences:
[(p1, q1), (p2, q2),....,(pn,qn)]
Each p and q are corresponding sentences with different number of words. Typically p is much longer than q, although this information is not critical.
p1 can be split into multiple words, and so can q1 be.
What I want is some sort of a co-occurrence probability of words (w, w'), where w is selected from p_i and w' is selected from q_i.
Eventually I am trying to estimate, what is the probability of seeing a word in p_i given that a word was observed in q_i.
How to do this using nltk?
I know I could code the logic in python, but would like to know if nltk has something that would handle more edge cases, and cut-offs on frequencies and related issues.
Thank you!