Cross-validation for low event rate

Question

I have 1000 observations with 20 events (2%). Splitting them into 10 will get me only 2 events per fold. Splitting the training folds into 10 sub-folds for model building and optimisation will get me even lower event rate.

Question:

Is it correct to say that 10-fold cross-validation is not appropriate for these data?
What are the alternatives? Is repeating 2-fold validations 1000 times a better option?

score 1 · Accepted Answer · edited Apr 13 '17 at 12:44

Is it correct to say that 10-fold cross-validation is not appropriate for these data?

No but you have to tweak the procedure, an example is ~~described~~ linked below.

What are the alternatives? Is repeating 2-fold validations 1000 times a better option?

You can look at Stratified Cross-validation here. Since the data is highly unbalanced consider other options such as: SMOTE. But be sure to sample only in training data and not in the test data, i.e., sample within each fold as described here. Links below have some good discussion:

a. Dealing with imbalanced data: undersampling, oversampling and proper cross-validation

b. 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset

Cross-validation for low event rate

1 Answers1