Comparing two word distributions

Question

I want to compare two empirical distributions of words in two different texts to see if they are reasonably similar. So for each text I perform the usual steps like stopword removal and stemming, and then count word occurences in both texts to acquire the two discrete distributions.

For getting the same region for both distributions, only the interesection of both underlying regions (word sets) is taken, the rest is discarded. The problem I'm facing now is which method to choose for measuring similarity or distance between my two discrete distributions X and Y.

From my Data Analysis classes I remember the Kolmogorov-Smirnow-Test, which doesn't seem to be applicable, since my data is discrete (integers!). However, if overly conservative results are acceptable, it seems to be applicable yet.

To my knowledge, the $\chi$²-Test is not applicable either, since I'm not interested in the dependency of X and Y. I can safely assume that a word's frequency in one text doesn't depend on the word's frequency in the other text.

What's keeping me from simply using the Kullback-Leibler-Divergence is that it is not symmetric, and I cannot assume either text's distribution as the model or data distribution. However, I could perform the KL-calculations twice, for $KL(X,Y)$ and $KL(Y,X)$ and take the larger/smaller value if I want a more/less conservative similarity estimation. It's not perfect, for the same reasons as performing a "single pass" is.

Then there's the Bhattacharyya distance, which seems to do the job just fine and is free of assumptions regarding the nature of the data. It also seems to be pretty straightforward to implement.

I'm a computer science grad student, and my statistical background sadly is not as strong as I wish at this moment. Are there any good reasons for not proceeding with the Bhattacharyya distance? Or is there another good candidate for the task?

Thank you.

Edit: As was pointed out, my first approach at getting the same region for both distributions is heavily biased. A nice example for this would be two distributions with only one common word (stem). This would result in 100% similarity, which is obviously very wrong.

A better idea should be discarding every word stem that occurs less than $threshold times and using the union of both remaining word sets as a common region for the two distributions.

I think you discarded the $\chi^2$ test too quickly. This is very commonly used to compare histograms (which you basically have since your data is discrete). Here's a "pro-chi-square" answer: http://stats.stackexchange.com/questions/76898/test-whether-two-multinomial-samples-come-from-the-same-distribution. However, a chi-square test may not be the most powerful tool and there are other options; see the accepted answer here: http://stats.stackexchange.com/questions/88764/test-for-difference-between-2-empirical-discrete-distributions — jwimberley, Sep 21 '16 at 17:40
With much hesitation I would suggest taking another look at the $\chi^2$ test, because it (or some relative of it) are the appropriate ones for this analysis. The fact that you threw away words that did not appear in both texts indicates the entire enterprise might be misleading, because you are analyzing selective, biased, partial information about the similarities between the two texts. — whuber, Sep 21 '16 at 17:40
@whuber You're absolutely right regarding the discarded words, thanks for pointing this out! I'll definitely take a different approach there. — P.C.B., Sep 21 '16 at 19:35
You [c|sh]ould stick to chi-square, as others pointed out. Another option is to use a test that makes no assumption about the underlying distribution. For your case, the rank-sum test (Mann-Whitney-U, Wilcoxon, ...) would also be an ideal way of comparing your two word frequency distributions. Just keep in mind that strictly speaking your model is always faulty, because the underlying assumption (independence of counts) doesn't hold. https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U — fnl, Sep 21 '16 at 20:44
@whuber To my current knowledge, $\chi²$ is not useful because events must be mutually exclusive. This is not the case here, because a text contains more than one word, which means that accounting for a text causes multiples counters in the contingency table to increase. — P.C.B., Sep 24 '16 at 22:29
I don't understand that at all. The post states you "count word occurrences." Could you explain how any particular occurrence would get counted more than once? — whuber, Sep 24 '16 at 22:53

Harald Thomson · Answer 1 · 2016-09-21T21:12:20.013

You could use a customization of the Min-Hash algorithm. The problem then falls down to comparing hash-values of your distributions by their Hamming-Distance.

You start by seeding your random number generator (PRG) with $c$.
You draw $n$ samples from the first distribution $p_A$
You seed your PRG again with $c$
you sample $n$ words from your other distribution $q_A$.
You compare both samples by means of the Hamming-Distance. (i.e 500 our of 700 samples agree --> smilarity of 0.7)

If all words $a$ from your distributions $p_A(a)$ would have the same propability $p_A(a) = \frac{1}{|A|}$ the process would exactly result in Min-Hash. And Min-Hash is an approximation to the Jaccard Similarity of both sets of words. However, by enlarging $n$ you are able to get the exact value of the Jaccard-Index for $n \rightarrow \infty$.

With this scheme you can assign arbitrary probabilities to your words and nontheless are able to get a similarity score.

Anyhow, the process is not that simple, because typically the Alias Method of Vose is used to draw samples from discrete distributions. Then the drawn samples are not independent from each other (because sampling can use up a variable number of random numbers depending on $p_A(x)$), which is bad when comparing numbers by Hamming Distance later.

Hence, you are required to reseed after every drawn sample with a number $c+i$ where $i$ grows up to $n$.

Comparing two word distributions

1 Answers1