I am building a logistic regression (gym
function in r) to predict whether a patient is diseases or not. The sample I have now contains 111 diseased and 682 non-diseased. From here, I understand that it is not a good idea to seperate the 793 patients I have into training and validation sets if I have such small number of cases (diseased). My model is likely to have at least five continuous variables and three categorical variables (2 levels, 4 levels and 5 levels respectively). I would probably need to need collapse some levels arising from small counts. How should I go about validating and testing the model given the circumstances I have?
Asked
Active
Viewed 38 times
1

tatami
- 695
- 1
- 7
- 24
1 Answers
0
So having not looked at your data, the sample size and dimensions does not look too small, assuming you will be fitting a regularized model instead of vanilla logistic regression.
Then, you can first try stratified 10-fold CV, which should give some good coverage. However, if this does not work (model fits poorly across all CV's), you can try leave-one-out cross validation, although it is known to have high variance in estimates.

won782
- 63
- 7
-
I will be using vanilla logistic regression (glm function in r) to fit the model. Would CV still work in this case? Interpretation would also be of interest even though it is predictive model. AUC would also used to assess accuracy. Not sure if regularised model would affect interpretation. What are the benefits of using regularized model? – tatami Jan 20 '18 at 17:34
-
CV can work if you do 100 repeats of 10-fold CV. Otherwise it's too imprecise. Bootstrapping is better, with fewer than 1000 model re-fits. You are right to avoid data splitting. – Frank Harrell Jan 20 '18 at 19:25