Filtering out noise/insignificant data when testing millions of contingency tables for association

Question

I have a large dataset of around 20 million 2x2 contingency tables, as below:

     Y=1  Y=0
E=1   a    b
E=0   c    d

I want to measure the effect that exposure (E) has on the expression of a trait (Y). The trait under test is always unique for each exposure. For this I'm calculating the absolute risk (AR) for each contingency table:

$$ AR = \frac{a}{a+b} - \frac{c}{c+d} $$

Here's a sample of the data:

group     a     b     c     d    AR
  A       3     0     1 55559  .999
  B     566  1799  1683 51515  .208
  C      33    55    85 55390  .373
  D       9     5    13 55534  .643
  E    1155  4282  3596 46540  .141
  F       1     0     1 55561  .999

The problem I'm having can be clearly seen in samples from groups A and F. These groups obtain an extremely high AR-value, with only marginally small sample sizes. Manual checking shows that these samples should be considered noise, and obtained their very high AR by accident. Of course, this is bound to happen with ~20 million samples.

To resolve this issue, I've tried using Fisher's Exact Test and Pearson's/Neyman's Chi Squared Test to obtain p-values for the null hypothesis that there is no association between exposure and expression, the reasoning being that with such a small sample size of the exposed population, we could not conclude with a very large probability that there is an association. I subsequently apply the Benjamini–Hochberg procedure to control for false discovery rates (FDR) to obtain q-values, and filter out all groups with a q-value < 0.05.

Unfortunately, this does not yet resolve my issue. I would have to set my alpha much lower than 0.05 to control for the 'noise' I'm getting, which seems like an invalid practice. I do know the distributions of a, b, c and d but I'm at a loss about whether I can use these, and if so, how.

So, the question is: how do I control for groups which have a low exposed fraction, and make sure they either don't obtain a high AR by accident, or are filtered out.

Thanks!

AR might not be the best measure. If D is always much larger than A, B, and C, then the second fractional term in the formula will always be close to 0 and you will really just be calculating a/(a+b). Since a and b are often very small, you're going to get a huge variation. — Peter Flom, Nov 23 '17 at 12:53
@PeterFlom You're right that I'm getting mostly small values for A, B and C compared to D, with some exceptions (for example, there are quite a few data points which have D's < 30000). However, when leaving out noisy data by explicitly filtering out anything with A < 5, B < 5 and C < 5, AR performs much better in my evaluation than similar measures such as relative risk (RR) or odds ratio (OR). This hints that AR is the correct measure, but we'll just need to do something about the filtering. — joost, Nov 23 '17 at 13:08
I think you will need to figure this out from substantive knowledge. — Peter Flom, Nov 23 '17 at 13:38

Filtering out noise/insignificant data when testing millions of contingency tables for association

0 Answers0