Computing significance of overlap between two sets in Python

Question

I am computing the significance of an overlap between two subgroups, each in two related datasets 1 and 2. For instance:

Dataset1 total: 500
Dataset1 subgroup: 100

Dataset2 total: 300
Dataset2 subgroup: 50

Intersection between subgroups: 25
Union 1 and 2 (no duplicates): 600

I want to compute how significant is the overlap between subgroups against what would be gotten by chance. How would you do this in Python? I was looking at Fisher's exact test or hypergeometric test but have some problems putting my data into the analyses.

From what I understand, the contingency table would be:

                Dataset1    Dataset2
In_subgroup     100         50
Not_subgroup    400         250
Total           500         300

Here, note that the universe is comprised by 600 unique elements, and not 500+300 (as there are duplicates within dataset1 and dataset2). Given this, and based on another post, I would do this in R:

phyper(24, 100, 500, 50, lower.tail = FALSE)
[1] 9.15e-19

Translating this into Python I would use in scipy:

>>> scipy.stats.hypergeom.cdf(24, 600, 500, 300)
0.0

Can I assume that the difference between both results is numeric error?

I mean that I have two populations, each with a subgroup. At the population level (and possibly at the subgroup) there are some terms in common between populations. What I want to test is whether the subgroup vs subgroup overlap is statistically significant, given each of the populations' compositions. — Sos, Apr 23 '20 at 08:17

BruceET · Accepted Answer · 2020-04-15T22:18:03.660

Your proposed table does not seem right: First, because it is not clear how it counts overlaps. Second, because its grand total 800 does not match the total number 600 of subjects.

For a correct table, you cannot expect the P-value of fisher.test (2-sided as shown), to match results from phyper, which would be for a one-sided test.

Suppose you have 600 subjects, of which 150 are in subgroups 450 are not. Then I guess you have 25 + 175 = 200 subjects involved in overlaps. Perhaps you want to know if 25 'subgroup' subjects involved in overlaps (1/5 of them) is surprisingly small--compared to the 175 subjects outside of subgroups involved on overlaps ($175/450\approx .39$ of them/)

If so, then you want the data table shown below. If not, then please revise your Question to show a data table with the correct grand total, which is directly relevant to what you want to test. And explain how it is relevant.

In or Out \     Unique   Overlap       TOTAL
--------------------------------------------
Subgroup           125        25         150
Remainder          275       175         450
--------------------------------------------
TOTAL              400       200         600

DTA = rbind(c(175, 25), c(275,175));  DTA

     [,1] [,2]
[1,]  175   25
[2,]  275  175

fisher.test(DTA, alt="g")  # one-sided test

        Fisher's Exact Test for Count Data

data:  DTA
p-value = 1.493e-12
alternative hypothesis: 
   true odds ratio is greater than 1
...

The 2-sided test has P-value 2.381e-12. A two-sided chi-squared test without the Yates correction gives the following result.

chisq.test(DTA, corr=F)

        Pearson's Chi-squared test

data:  DTA
X-squared = 45.264, df = 1, p-value = 1.722e-11

Output from Minitab 17 statistical software (for comparison):

        C1   C2  All

1      125   25  150
2      275  175  450
All    400  200  600

Cell Contents:      Count

Pearson Chi-Square = 25.000, DF = 1, P-Value = 0.000
Likelihood Ratio Chi-Square = 27.225, DF = 1, P-Value = 0.000

Fisher’s exact test: P-Value =  0.0000003

Computing significance of overlap between two sets in Python

1 Answers1