5

My question is similar to this other one, but I would like to know if it is possible to correct the p-values for multiple testing (and how).

I have a large 42 gene expression values which are correlated with 25 clinical variables. So I end with a matrix of 42x25 correlations coefficients and p-values. It is possible to correct the p-values for the multiple comparisons done?

My understanding is that the FDR id for family wise corrections and in that case I have tested each gene with 25 clinical variables, but at the same time I have tested each clinical variables with 45 genes. In this case if I use p.adjust from R which n should I set? If I left is as default it corrects for more test than what I have done, I can select either 45 or 25, but I don't have any special reasoning except that selecting 45 would be more conservative (although not as conservative as setting it to n=45*25).

llrs
  • 525
  • 5
  • 25
  • I know that there are specialized methods used in genetics. You will have to do some research in the literature. You can always use Bonferroni bounds, correlated or not. – David Smith Jun 28 '17 at 15:17
  • What are the Bonferroni bounds ? I can't use p.adjust with "BH" because I don't know which are the family of test for each p-value – llrs Jun 28 '17 at 21:18
  • 1
    Bonferroni doesn't work well with correlation matrices because the correlation estimates are strongly correlated whereas Bonferroni assumes they are not. – whuber Mar 21 '18 at 14:09
  • @whuber Yes, I was precisely reading this. – llrs Mar 21 '18 at 14:14
  • @whuber Bonferroni does not assume the estimates are uncorrelated, but the Bonferroni corrections are conservative when the tests are correlated; i.e. the fail to reject the null often enough. – AdamO Mar 21 '18 at 14:17
  • 3
    @AdamO Yes, that's the sense in which I meant "assumption." All quibbling aside, the point is that a Bonferroni correction is *far* too conservative when applied to correlation matrices to be of much use. – whuber Mar 21 '18 at 14:19

2 Answers2

3

The biggest problem you have is that your analysis of pairwise correlations could severely limit your ability to understand the underlying issues. Seldom in biomedical research does one variable depend on exactly one other without influences from any of the rest. Yet that is all your pairwise correlations will illustrate. Also, if you have RNAseq data you presumably have information on expression of about 20,000 genes, not just 45. It might not be wise to throw out the information about the other 19,955. In terms of your intended applications, it is seldom true that a drug will affect expression of only a single gene, and altering/curing a clinical phenotype will similarly affect expression of hundreds of genes.

There is a large and well developed set of methods for assessing the relations of biologic phenotypes with large-scale gene-expression data, dating back to the dark ages of microarray methods. For binary phenotypes, Gene Set Expression Analysis (GSEA), in the nearly 15 years since its introduction, has received a large amount of attention with respect to statistical significance testing in terms of FDR and FWER. For relations of continuous biologic variables or survival data to gene expression, LASSO, ridge regression, or their combination in elastic net can be very useful, with many questions about those approaches answered on this site.

If you for some reason need to restrict analysis to these 45 genes, consider multiple-predictor rather than pairwise approaches. For example, set up a separate model for each of your 25 phenotypes with all 45 genes considered as predictors in a linear or logistic regression or another model structure appropriate to the nature of the phenotype. Apply your favorite FDR or FWER criterion to the set of 25 models and their overall p-values to restrict false-positive models/phenotypes. Then focus on validating the models of the phenotypes that pass that first testing hurdle.

I understand that this doesn't address the question that you asked, but sometimes on this site the best solution to the underlying problem is to suggest a different approach rather than to answer the question as posed.

EdM
  • 57,766
  • 7
  • 66
  • 187
  • Thanks for your feedback. However the reason why I reached those 45 genes (in this example) is using one of those multivariate techniques you describe. However I am pressed to focus in fewer genes, and to show the relationship between genes and other variables with correlations, because it is easier to understand by my advisors. – llrs Mar 26 '18 at 20:29
  • @Llopis then use the multiple-predictor approach I suggest in the 3rd paragraph of my answer. I think your advisers should be able to understand that multiple regression is a better approach than single-variable correlations. – EdM Mar 27 '18 at 01:24
  • I'll propose one multiple regression, but I don't think my advisor will understand why it is better. – llrs Mar 27 '18 at 07:18
1

There are many lines of reasoning one could take with this type of analysis. What makes a "family of tests" is vague, and for good reason! In some cases, a "family" can be grouped in terms of tests with a common regressor or exposure. In other cases, a "family" can be grouped in terms of tests with a common outcome. I think you are right in saying it does not make sense to consider all $25 \times 45$ tests as having equal bearing to each other.

Methods for multiple testing control either family wise error rate (FWER) or false discovery rate (FDR). FWER corrects the significance level of each test. FDR ranks tests by their statistical significance and adjusts for the overall number of statistically significant findings. Each has its plusses and minuses.

Joining the statistical reasoning with the medical/clinical application is the key here. One question you might ask: who is the audience and what will they do with this information? If they are geneticists, they may be keen to know the impact of a single gene on any number of clinical variables. I think of the apo-e allele and its overarching impact on cardiovascular health. On the other hand, clinicians may be interested in knowing if one or more genes is associated with a certain trait so that, e.g. they can use family history or risk counseling to assess the patient's capacity to manage exposure through noninvasive means.

Unfortunately, I think that's the best I can answer with the information given. There's some detail about experimental design, which is important, but the choice of FWER/FDR control and what make a family of tests boils down to the study's purpose.

AdamO
  • 52,330
  • 5
  • 104
  • 209
  • Many thanks for your answer. The purpose of these kind of analysis I do is usually finding (true) relationships between the two group of variables. I personally wouldn't care if it is FWER or FDR the correction, but at least I would like to perform an better correction than just providing the raw p-values. What other kind of information do you need to have a more concrete answer? – llrs Mar 21 '18 at 14:51
  • @Llopis I think a statement of how the information will be used. Put very plainly: is the study moreover about the genotype or is it about the phenotype? – AdamO Mar 21 '18 at 15:09
  • About both! In some cases we are interested in modify a gene expression (if relevant) via a drug, in other to modify the clinical variable such as a diet (if relevant) in order to cure a disease. Here I am using RNA-seq data not genomes (otherwise I couldn't make a correlation). – llrs Mar 21 '18 at 15:48
  • @Llopis well consider for the moment it may be too ambitious. The less targeted we are with the tests, the less confirmatory our data analysis becomes. On the other hand, a free-for-all might be desirable. In that case, consider just applying the FDR to the 25 $\times$ 45 tests. Think how many false-positives you'd expect if there were no association at all, then FDR will cherry pick the top, say, 5 (or more or less) tests. – AdamO Mar 21 '18 at 16:01
  • Why is it too ambitious? That's the kind of analysis I am tasked to do it, I'm merely trying to do my best. I'll think about the false-positive I'd get if there weren't associations. – llrs Mar 21 '18 at 16:14