2

I have 1000 observations with 20 events (2%). Splitting them into 10 will get me only 2 events per fold. Splitting the training folds into 10 sub-folds for model building and optimisation will get me even lower event rate.

Question:

  1. Is it correct to say that 10-fold cross-validation is not appropriate for these data?

  2. What are the alternatives? Is repeating 2-fold validations 1000 times a better option?

ayol
  • 75
  • 4

1 Answers1

1

Is it correct to say that 10-fold cross-validation is not appropriate for these data?

No but you have to tweak the procedure, an example is described linked below.

What are the alternatives? Is repeating 2-fold validations 1000 times a better option?

You can look at Stratified Cross-validation here. Since the data is highly unbalanced consider other options such as: SMOTE. But be sure to sample only in training data and not in the test data, i.e., sample within each fold as described here. Links below have some good discussion:

a. Dealing with imbalanced data: undersampling, oversampling and proper cross-validation

b. 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset

discipulus
  • 726
  • 4
  • 14