I have labelled data composed of 10000 positive examples, and 50000 negative examples, giving a total of 60000 examples. Obviously this data is imbalanced.
Now let us say I want to create my validation set, and I want to use 10% of my data to do so. My question is as follows:
Should I make sure that my validation set is ALSO imbalanced, (as a nod to the true distribution of the training set), or should I make sure that my validation set is balanced? So for example, should my validation set be made from:
- 10% positive example + 10% negative, giving 1000+ and 5000- examples. (This validation set reflects the original data imbalance).
- Or should the validation set be made from say, 10% positive, giving 1000+, and (10/5 = 2%) negatives, also giving 1000- examples?
(Same question for the test set).
There seem to be plenty of methods on how to train with imbalanced data, but no where can I see to find best practices on whether or not my validation set should ALSO reflect the original imbalance, or not. Finally, I am NOT doing cross-validation, I will be using a single validation set, and a neural network.
Thanks!