0

I am looking into how differing brain tumor genetics affects patient survival. I have a gene dataset with around 4600 predictors, which are often strongly correlated with each other. Now I want to compute a Cox model using R's survival package, that combines the best genes for overall survival prediction. How should I approach the feature elimination process? So far I thought about using PCA or clustering approaches as a preprocessing step. However maybe there is an established feature similar to Ridge/LASSO for cox proportional hazards models?

So far I found but there seems to be no R implementation? https://pubmed.ncbi.nlm.nih.gov/17661175/

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
florian
  • 511
  • 1
  • 4
  • 12
  • 1
    Don't select features, the process is too noisy. Instead, use some principle components to determine if anything in the genetic data can help with prediction. – Demetri Pananos Oct 11 '21 at 23:57
  • 1
    Does this answer your question? [How to choose the best combination of covariates in Cox multiple regression?](https://stats.stackexchange.com/questions/411804/how-to-choose-the-best-combination-of-covariates-in-cox-multiple-regression) – EdM Oct 12 '21 at 07:50
  • @EdM thank you, this is certainly relevant and provides some good pointers, however, the questions do not fully overlap. – florian Oct 12 '21 at 08:26

1 Answers1

3

This is covered somewhat in this answer. Chapter 4 of Frank Harrell's class notes provides much more useful advice on working with multiple predictors.

If you want to evaluate all genes together, ridge regression is a useful choice. You can think of this like PCA in that correlated predictors tend to be in the same principal components, but the components are weighted continuously instead of selected in-versus-out.

If you want to identify a small subset of genes, LASSO will tend to select one out of a set of correlated predictors. Yes, that's a very noisy process in that the particular gene selected from a correlated set might vary from data sample to data sample. But that can work OK in practice for prediction, and it allows you to do things like find genes to develop practical tests that are less expensive than whole-transcriptome analysis. There's also a hybrid between ridge and LASSO called the elastic net. Chapter 6 of An Introduction to Statistical Learning provides background on these and other methods.

You do not do that directly in the R survival package. These methods are implemented for example in the glmnet package for a wide variety of regression models including Cox.

Finally, make sure to include relevant clinical predictors along with gene expression in your model. There's a risk that your gene-expression values will just be serving as a proxy for clinical status as it's evaluated in the standard of care. Thus you need to show that the gene-expression data add something useful for prognostication or for understanding disease progression or therapy resistance.

EdM
  • 57,766
  • 7
  • 66
  • 187
  • Feature selection is a bad idea in general, and when there are collinearities it's a disaster due to instability. – Frank Harrell Oct 12 '21 at 13:40
  • 1
    @FrankHarrell a counter-example is if gene-expression analysis is a first step in developing a practical test based on a few dozen genes, as in [breast cancer](https://www.cancer.org/cancer/breast-cancer/understanding-a-breast-cancer-diagnosis/breast-cancer-gene-expression.html). The particular genes selected for Oncotype DX, MammaPrint, and Prosigna tests might well have been selected arbitrarily from sets of correlated genes, but insofar as they represent underlying biological processes they work in practice. Claims that one has found the "most important genes" this way are of course wrong. – EdM Oct 12 '21 at 14:01
  • 1
    Not so fast. If you could show that the best available non-parsimonious method yields $R^{2}=0.2$ and the small set of genes yields $R^{2}=0.03$ then you're fooling yourself about the value of the small gene set. And if the gene set is arbitrary isn't that also a problem? – Frank Harrell Oct 12 '21 at 20:58
  • Thanks, I appreciate the discussion. "There's a risk that your gene-expression values will just be serving as a proxy for clinical status as it's evaluated in the standard of care." yes, this is one of the reasons I am actually using a cox model and don't simply look into correlations for example. – florian Oct 13 '21 at 13:20
  • That's a great point. The probability that the gene set adds new information not already measured by clinical parameters is adversely affected by poor quality of statistical analysis to find that gene set, and by attempts at parsimony. – Frank Harrell Oct 13 '21 at 20:39
  • @FrankHarrell the Prosigna clinical breast cancer test shows the potential of well-implemented feature selection, not as an end in itself but as a first step. In 2001, a Cox-model score-based method was used to find genes whose [expressions in 8000-gene microarrays were associated with survival](https://dx.doi.org/10.1073%2Fpnas.191367098). This was condensed into a [50-gene list](https://dx.doi.org/10.1200%2FJCO.2008.18.1370), which was refined into a [Nano String based](https://dx.doi.org/10.1186%2Fs12920-015-0129-6) FDA approved test, used to improve prognostication in early-stage disease. – EdM Oct 14 '21 at 00:41
  • I challenge you to demonstrate that this result does not fall into the "$R^{2}=0.03$" vs. "$R^{2}=0.2$" setting described above. Furthermore, it may be possible to demonstrate that the non-"selected" features in total have more predictive information than the 50-gene list, if parsimony is not forced upon the non-selected features. – Frank Harrell Oct 14 '21 at 11:41
  • @FrankHarrell if it's $R^2 = 0.2$ versus $R^2=0.03$, that's one thing. If it's $R^2 = 0.2$ versus $R^2=0.17$, it's another. If it hasn't been done already, The Cancer Genome Atlas (TCGA) should have enough breast cancer gene-expression and clinical data to evaluate all ~20000 genes against the few dozen genes used in each of the 3 approved clinical tests. So, challenge accepted, although it might take a few months of spare time to do it. My fear is opposite yours: that the extra genes in these tests add little to what's available from the few routinely evaluated by immunohistochemistry. – EdM Oct 14 '21 at 13:49