0

I'm currently working on a classification problem. The variable Y in 70% of cases is 0 and in 30% of cases is 1.

Does my validation set have to have this same proportion?

I ask because after using random forest and training my model I get this prediction values with the training set:

               precision    recall  f1-score   support

         0       0.92      0.92      0.92      1485
         1       0.88      0.88      0.88       949

avg / total       0.91      0.91      0.91      2434

and these with the validation set:

            precision    recall  f1-score   support

         0       0.88      0.68      0.77       890
         1       0.09      0.25      0.13       110

avg / total       0.79      0.64      0.70      1000

That is to say there is overfitting in the label 1. The only thing that occurs to me that can happen is that this badly built validation set. I already tried to modify the hyperparameter but always presents the same phenomenon

ozo
  • 115
  • 4

1 Answers1

0

Unbalanced data (specifically unbalanced labels) can certainly be a problem. But whether it is in your case depends on your application and what you are trying to do. Here is a nice discussion on when it can be a problem from another SO thread. Potential strategies can be weighting, stratification, evaluation metrics that balance precision-recall, etc.. but it will depend on your objective.

  • thank you for your answer. My first strategy will be to adjust the validation set. Should the validation set have the same ratio in the categories as the whole data? – ozo Dec 24 '18 at 20:50
  • Yes, it is likely that you should stratify on your label when you create your partions for training and validation. But there could still be worries about external validity to think about as well. – Noah Hammarlund Dec 25 '18 at 01:13