I am performing a coxph
survival analysis and want to split my data into a training and test set. During the training I would like to identify the "best" genes that predict survival. Since the test.set has to be completely blind I wonder how I select the training set? I can perform a random sampling of the rows, then use the rows that are not part of the first sampling as the test set. However, If I perform the random sampling several times, then the all the rows in the dataset will eventually be used, and the test set is not unique anymore. Suggestions how to do this?
df.t <- structure(list(hsa_miR_105_5p = c(3.58497328179801, 5.73145238130165,
1.19037294682376, -1.28586123284671, 1.27004401721869, 0.958088884635556
), hsa_miR_17_3p = c(1.21345556145455, 4.71642723353062, 5.87616915208789,
0.776249937585565, 4.86437477300888, 1.71876771352689), hsa_miR_3916 = c(6.74863569372315,
3.23155618956527, -0.105259761381448, -1.28586123284671, 4.60953338597123,
2.95060221832751), hsa_miR_1295a = c(-1.35668910756094, 0.147551018264645,
2.44220202218853, -1.28586123284671, 5.47367734142336, -0.135507425889107
)), row.names = c("86", "175", "217", "394", "444", "618"), class = "data.frame")
Time <- structure(c(1796, 1644.04166666667, 606.041666666667, 1327.04166666667,
665, 2461), class = "difftime", units = "days")
Status <- c(0L, 0L, 1L, 0L, 1L, 0L)
cox.out <- capture.output(for(i in colnames(df.t)){
print(summary(coxph(as.formula(paste0("Surv(Time, Status)~", i )), data=as.data.frame(df.t))))
})