I have a large dataset of around 20 million 2x2 contingency tables, as below:
Y=1 Y=0
E=1 a b
E=0 c d
I want to measure the effect that exposure (E) has on the expression of a trait (Y). The trait under test is always unique for each exposure. For this I'm calculating the absolute risk (AR) for each contingency table:
$$ AR = \frac{a}{a+b} - \frac{c}{c+d} $$
Here's a sample of the data:
group a b c d AR
A 3 0 1 55559 .999
B 566 1799 1683 51515 .208
C 33 55 85 55390 .373
D 9 5 13 55534 .643
E 1155 4282 3596 46540 .141
F 1 0 1 55561 .999
The problem I'm having can be clearly seen in samples from groups A and F. These groups obtain an extremely high AR-value, with only marginally small sample sizes. Manual checking shows that these samples should be considered noise, and obtained their very high AR by accident. Of course, this is bound to happen with ~20 million samples.
To resolve this issue, I've tried using Fisher's Exact Test and Pearson's/Neyman's Chi Squared Test to obtain p-values for the null hypothesis that there is no association between exposure and expression, the reasoning being that with such a small sample size of the exposed population, we could not conclude with a very large probability that there is an association. I subsequently apply the Benjamini–Hochberg procedure to control for false discovery rates (FDR) to obtain q-values, and filter out all groups with a q-value < 0.05.
Unfortunately, this does not yet resolve my issue. I would have to set my alpha much lower than 0.05 to control for the 'noise' I'm getting, which seems like an invalid practice. I do know the distributions of a, b, c and d but I'm at a loss about whether I can use these, and if so, how.
So, the question is: how do I control for groups which have a low exposed fraction, and make sure they either don't obtain a high AR by accident, or are filtered out.
Thanks!