I want to compare two empirical distributions of words in two different texts to see if they are reasonably similar. So for each text I perform the usual steps like stopword removal and stemming, and then count word occurences in both texts to acquire the two discrete distributions.
For getting the same region for both distributions, only the interesection of both underlying regions (word sets) is taken, the rest is discarded. The problem I'm facing now is which method to choose for measuring similarity or distance between my two discrete distributions X and Y.
From my Data Analysis classes I remember the Kolmogorov-Smirnow-Test, which doesn't seem to be applicable, since my data is discrete (integers!). However, if overly conservative results are acceptable, it seems to be applicable yet.
To my knowledge, the $\chi$²-Test is not applicable either, since I'm not interested in the dependency of X and Y. I can safely assume that a word's frequency in one text doesn't depend on the word's frequency in the other text.
What's keeping me from simply using the Kullback-Leibler-Divergence is that it is not symmetric, and I cannot assume either text's distribution as the model or data distribution. However, I could perform the KL-calculations twice, for $KL(X,Y)$ and $KL(Y,X)$ and take the larger/smaller value if I want a more/less conservative similarity estimation. It's not perfect, for the same reasons as performing a "single pass" is.
Then there's the Bhattacharyya distance, which seems to do the job just fine and is free of assumptions regarding the nature of the data. It also seems to be pretty straightforward to implement.
I'm a computer science grad student, and my statistical background sadly is not as strong as I wish at this moment. Are there any good reasons for not proceeding with the Bhattacharyya distance? Or is there another good candidate for the task?
Thank you.
Edit: As was pointed out, my first approach at getting the same region for both distributions is heavily biased. A nice example for this would be two distributions with only one common word (stem). This would result in 100% similarity, which is obviously very wrong.
A better idea should be discarding every word stem that occurs less than $threshold times and using the union of both remaining word sets as a common region for the two distributions.