Method for selecting training data and identify predictive genes in a survival model

Question

I am performing a coxph survival analysis and want to split my data into a training and test set. During the training I would like to identify the "best" genes that predict survival. Since the test.set has to be completely blind I wonder how I select the training set? I can perform a random sampling of the rows, then use the rows that are not part of the first sampling as the test set. However, If I perform the random sampling several times, then the all the rows in the dataset will eventually be used, and the test set is not unique anymore. Suggestions how to do this?

df.t <- structure(list(hsa_miR_105_5p = c(3.58497328179801, 5.73145238130165, 
1.19037294682376, -1.28586123284671, 1.27004401721869, 0.958088884635556
), hsa_miR_17_3p = c(1.21345556145455, 4.71642723353062, 5.87616915208789, 
0.776249937585565, 4.86437477300888, 1.71876771352689), hsa_miR_3916 = c(6.74863569372315, 
3.23155618956527, -0.105259761381448, -1.28586123284671, 4.60953338597123, 
2.95060221832751), hsa_miR_1295a = c(-1.35668910756094, 0.147551018264645, 
2.44220202218853, -1.28586123284671, 5.47367734142336, -0.135507425889107
)), row.names = c("86", "175", "217", "394", "444", "618"), class = "data.frame")

Time <- structure(c(1796, 1644.04166666667, 606.041666666667, 1327.04166666667, 
665, 2461), class = "difftime", units = "days")

Status <- c(0L, 0L, 1L, 0L, 1L, 0L)


cox.out <- capture.output(for(i  in colnames(df.t)){
  print(summary(coxph(as.formula(paste0("Surv(Time, Status)~", i )),  data=as.data.frame(df.t))))
})

score 0 · Answer 1 · answered Dec 05 '20 at 15:46

Unless you have about 15,000 or more cases, a single fixed test/train split poses dual problems of (1) not using all of your data to identify the "best" genes* and (2) using small test sets that inherently provide high-variance estimates of model performance. You seem to sense this, given your interest in doing repeated test/train splits to take advantage of all your data. What you describe is related to cross validation, in which all of the data are used ultimately for both training and test sets. There's nothing wrong with that, but it's important to acknowledge, as you do, the limitations of any data modeling process that you use.

A better approach than fixed test/train splits with data sets of the size that you typically have in gene-expression studies is to use all of your data to build the original model, then use a resampling scheme to evaluate the reliability of the entire modeling process. Cross validation is one resampling method for that purpose. If you choose cross validation, you should do multiple rounds of cross-validation with different data splits, to get more reliable error estimates than you would with a single 5-fold or 10-fold cross validation.

I prefer bootstrap resampling for that evaluation, as bootstrapping directly mimics the process of taking repeated independent samples from the full population. This is one outline of how to use bootstrapping for model evaluation.

*Your one-gene-at-a-time approach to finding the "best" genes might not be the best choice; it's typically better to identify sets of genes that together are related to outcome. LASSO is one principled way to find such sets of genes, and can be implemented for survival models with the R glmnet package, for example. The rest of the answer applies regardless of which method you use to find the "best" genes.

Thanks for a detailed answer, appreciated. I think a x-fold cross validation is a way to start. Do I need a cross validation method developed for coxph models, or does any do the job? Any function to suggest? — user2300940, Dec 05 '20 at 17:25
@user2300940 either cross validation or bootstrapping requires a measure of model quality. So the function needs to provide a measure appropriate for a Cox model. For general information on this, I strongly recommend looking into the [R `rms` package](https://cran.r-project.org/package=rms), which includes several measures for Cox model validation and calibration. The [`glmnet`](https://cran.r-project.org/package=glmnet) package uses partial likelihood deviance as a measure for hyper-parameter selection in LASSO, a modeling approach I strongly recommend for your application. — EdM, Dec 05 '20 at 17:38
I looked at the censboot function. Don't know if it suited, but seems to be developed for this purpose. — user2300940, Dec 05 '20 at 17:42
@user2300940 you have to write code for your own evaluation `statistic` if you use that function. I sense that you are moving very quickly at looking for particular functions. Based on my own experience, I think that it will be in your best interest to step back a bit and first devote some serious study to the issues of evaluating Cox models and resampling. Then you will be better equipped to determine just what functions you need to implement to meet the goals of your project. You will also be able to respond intelligently to questions from reviewers when you go to publish. — EdM, Dec 05 '20 at 17:56

Method for selecting training data and identify predictive genes in a survival model

1 Answers1