Objective: Build a screening tool to identify people at risk of X.
Approach: Using data from contexts A and B, we explored logistic regression models to predict X. We did forward & backward variable selection, and selected the model based on AIC & BIC (using an elbow plot). We developed a scoring tool with integer values for ease of use based on coefficient scores. The final tool had 4 variables, from about 15-20 candidates. There are about 800 samples from each context, and X is around 10% prevalence.
Validation: In addition to internal validation (performance on data from A and B), we assessed performance based on another context C. It was slightly lower, but similar. Performance being: AUC, and sensitivity & specificity for various score thresholds.
Questions: With the goal of getting best screening performance (sensitivity/specificity trade-off) in the field, shouldn't we make better use of the data from context C? Specifically, should we:
- Do leave-one-context-out cross validation, and check: a) whether the same variables are selected? b) whether the performance changes?
- For the final actual tool, build the model from all available data. Yes, we can no longer estimate external validity (though we got a sense in 1), but we can still estimate internal validity. And, isn't it highly unlikely that the true performance would decrease by adding more data? Given the number of variables, data points, and model, I would think overfitting is unlikely.
Note: I'm a supporting author, so my understanding of exact methods details to date is not perfect.
Thanks,