I've been browsing various threads here, but I don't think my exact question is answered.
I have a dataset of ~50,000 students and their time to dropout. I am going to be performing proportional hazards regression with a large number of potential covariates. I am also going to do logistic regression on dropout/stay in. The main goal will be prediction for new cohorts of students, but we have no reason to believe they will vary much from last year's cohort.
Usually, I don't have such luxury of data and do model fitting with some sort of penalization, but this time I thought splitting int training and test data sets and then doing the variable selection on the training set; then using the test data set for estimating parameters and predictive capacity.
Is this a good strategy? If not, what is better?
Citations welcome but not necessary.