I am computing the significance of an overlap between two subgroups, each in two related datasets 1 and 2. For instance:
Dataset1 total: 500
Dataset1 subgroup: 100
Dataset2 total: 300
Dataset2 subgroup: 50
Intersection between subgroups: 25
Union 1 and 2 (no duplicates): 600
I want to compute how significant is the overlap between subgroups against what would be gotten by chance. How would you do this in Python? I was looking at Fisher's exact test or hypergeometric test but have some problems putting my data into the analyses.
From what I understand, the contingency table would be:
Dataset1 Dataset2
In_subgroup 100 50
Not_subgroup 400 250
Total 500 300
Here, note that the universe is comprised by 600 unique elements, and not 500+300 (as there are duplicates within dataset1 and dataset2). Given this, and based on another post, I would do this in R:
phyper(24, 100, 500, 50, lower.tail = FALSE)
[1] 9.15e-19
Translating this into Python I would use in scipy:
>>> scipy.stats.hypergeom.cdf(24, 600, 500, 300)
0.0
Can I assume that the difference between both results is numeric error?