Training, testing, validating in a survival analysis problem

Question

I've been browsing various threads here, but I don't think my exact question is answered.

I have a dataset of ~50,000 students and their time to dropout. I am going to be performing proportional hazards regression with a large number of potential covariates. I am also going to do logistic regression on dropout/stay in. The main goal will be prediction for new cohorts of students, but we have no reason to believe they will vary much from last year's cohort.

Usually, I don't have such luxury of data and do model fitting with some sort of penalization, but this time I thought splitting int training and test data sets and then doing the variable selection on the training set; then using the test data set for estimating parameters and predictive capacity.

Is this a good strategy? If not, what is better?

Citations welcome but not necessary.

score 8 · Answer 1 · answered Apr 05 '14 at 12:57

8

With a similar outcome frequency I have found that data splitting can work if $n > 20,000$. And it provides an unbiased estimate of model performance, properly penalizing for model selection (if you really need model selection; penalization is still more likely to result in a better model) if you only use the test sample once. BUT don't use the test sample for any re-estimation of parameters. Data splitting relies on the model built using the training sample to be put into "deep freeze" and applied to the test sample without tweaking.

answered Apr 05 '14 at 12:57

Frank Harrell

74,029
5
148
322

Thanks. Would you recommend 80-20? 90-10? Something else? Any references on this? – Peter Flom Apr 05 '14 at 13:04
2

I have not kept up with the literature regarding optimum split configuration. But some general principles apply. For the validation sample you need $n$ large enough so that you can estimate the calibration curve with great precision, then you need to see that what's left is more than adequate for reliable model fitting (using, say a 20:1 ratio of events:candidate parameters if you don't penalize). – Frank Harrell Apr 05 '14 at 13:22

score 5 · Answer 2 · answered Apr 05 '14 at 02:22

5

I've been looking at this paper myself for the similar task of cross-validating survival prediction. The good bits start at Chapter 2.

answered Apr 05 '14 at 02:22

Cam.Davidson.Pilon

11,476
5
47
75

This appears to compare 5 fold to model CV based estimation (and it concludes that 5 fold is better). But I was more interested in just splitting the data in 2 parts and using one to validate the other. – Peter Flom Apr 05 '14 at 11:16
1

The take-away I found from this, and why I originally was attracted to this paper, was how to deal with censorship in survival predictions, i.e. what loss function to use (though rereading your question, you may not have censorship). – Cam.Davidson.Pilon Apr 05 '14 at 17:40
I do have censorship and the dissertation is interesting, but it's not an answer to my question, I don't think. – Peter Flom Apr 05 '14 at 18:35

score 2 · Answer 3 · answered Apr 05 '14 at 13:33

2

I have since found this paper which not only answers my question, but provides a method for figuring out the optimal split for particular data sets. I found this thanks to @FrankHarrell 's use of the term "optimum split configuration" which I then Googled.

answered Apr 05 '14 at 13:33

Peter Flom

94,055
35
143
276

2

Peter I think that paper used an improper scoring rule. Different results may be obtained when using proper scoring rules. Also, the paper did not address "volatility" of the analysis. With small total sample sizes considered there, repeating the process using a different random split will result in much different models and much different accuracy when compared to the first split. I see that is very undesirable. – Frank Harrell Apr 05 '14 at 14:09
@FrankHarrell: I see your point and it is indeed a very good point. What then do you recommend doing? Peform Monte Carlo runs of train/test splits and then on each run do i x k-folds CV (or bootstrapping)? But then this would contaminate the entire dataset.... I see no better solution than to find an appropriate way to split the dataset into train and test sets (what would the criteria be?) I'm just not confortable in using all the dataset to train and validate (using CV or boot) the models (from which one (or several) will be used to predict unknown output values based on some input data). – jpcgandre Sep 13 '14 at 02:17
I addressed that in the post you just put on another topic page. – Frank Harrell Sep 13 '14 at 13:32

Training, testing, validating in a survival analysis problem

3 Answers3