3

I've a dataset of 492 samples, for each sample I've information regarding if gene X has a germline mutation and a somatic mutation. I would like to test co-occurence of germline and somatic mutation in the same gene.

What I did is that I counted for each gene the number of patient harboring a germline and somatic mutation and then computed a fisher exact test.

Here is my final dataset :

# A tibble: 716 x 8
   germ_gene somatic_gene only_germ only_som germ_and_som no_germ_no_som f.odd f.pvalue
   <chr>     <chr>            <int>    <int>        <int>          <int> <dbl>    <dbl>
 1 A2M       A2M                 73        9            2            408 1.24    0.519 
 2 ABCA1     ABCA1               89       12            1            390 0.366   0.930 
 3 ABCA12    ABCA12              39       12            4            437 3.72    0.0424
 4 ABCA13    ABCA13              58       23            4            407 1.22    0.450 
 5 ABCA2     ABCA2               68        9            1            414 0.677   0.783 
 6 ABCB11    ABCB11              16       10            0            466 0       1     
 7 ABCB5     ABCB5               47        9            1            435 1.03    0.645 
 8 ABCC1     ABCC1               45        8            2            437 2.42    0.246 
 9 ABCC2     ABCC2               36       14            0            442 0       1     
10 ABCC3     ABCC3               32       11            0            449 0       1     

The fisher exact test is computed as this (example with the gene A2M) :

contingency <- matrix(c(germ_and_som,only_som,only_germ,no_germ_no_som),nrow=2,ncol=2,dimnames=list(c("germ","no_germ"),c("som","no_som")))

#            som no_som
# germ         2     73
# no_germ      9    408    

f.test <- fisher.test(contingency,alternative = "greater")

#   Fisher's Exact Test for Count Data

# data:  
# p-value = 0.5191
# alternative hypothesis: true odds ratio is greater than 1
# 95 percent confidence interval:
#  0.1890338       Inf
# sample estimates:
# odds ratio 
#   1.241402 

Is this strategy correct ? and if yes is that correct to put alternative = "greater" or should I let two-tails ?

Thank you

1 Answers1

1

The statement "I would like to test co-occurrence of germline and somatic mutation in the same gene." doesn't imply a direction for the null hypothesis of independence. So a two-sided alternative would seem more appropriate on the face of it. You should also consider adjusting the p-values for multiple hypothesis testing (since you are testing 716 genes at the same time according to the tibble above). For example, with stats::p.adjust.

dipetkov
  • 261
  • 1
  • 3
  • Ok thank you. Other comment : Most of my contingency tables are imbalanced i.e. one field contains most of the events ( no_germ_no_som ) . Is that an issue for the test ? – Nicolas Rosewick Feb 19 '19 at 08:25
  • Shouldn't be a problem. The Fisher exact test can be used (to test for independence between two categorical variables) on a 2-by-2 contingency table in which some of the cells contain 0s or small numbers. – dipetkov Feb 19 '19 at 09:23