Training data is imbalanced - but should my validation set also be?

Question

I have labelled data composed of 10000 positive examples, and 50000 negative examples, giving a total of 60000 examples. Obviously this data is imbalanced.

Now let us say I want to create my validation set, and I want to use 10% of my data to do so. My question is as follows:

Should I make sure that my validation set is ALSO imbalanced, (as a nod to the true distribution of the training set), or should I make sure that my validation set is balanced? So for example, should my validation set be made from:

10% positive example + 10% negative, giving 1000+ and 5000- examples. (This validation set reflects the original data imbalance).
Or should the validation set be made from say, 10% positive, giving 1000+, and (10/5 = 2%) negatives, also giving 1000- examples?

(Same question for the test set).

There seem to be plenty of methods on how to train with imbalanced data, but no where can I see to find best practices on whether or not my validation set should ALSO reflect the original imbalance, or not. Finally, I am NOT doing cross-validation, I will be using a single validation set, and a neural network.

Thanks!

I don't have a specific reference for this but I'd guess there is one because: a) I have published papers that involved training/validation splits and I have always constructed the split so that the overall positive rate was equivalent in both sets; b) I recall reviewer comments indicating this is what you should do. — gammer, Jan 30 '17 at 04:01
@gammer Yeah, see that's what I thought at first, but it seems somewhat weird the more I think about it, since if we trust our training set, then surely we should also respect it's distribution, and try to replicate that in the validation set too... — Spacey, Jan 30 '17 at 04:21
I guess maybe it depends on how the original data set was collected. Is it a random sample? If so, the balanced split makes sense because it makes the validation set more representative of the population. If it's retrospective (e.g. case-control), then it could be different. I'm not sure. I've told you my experience. Hopefully you get a definitive answer. If you figure it out, post an answer. — gammer, Jan 30 '17 at 04:31

score 12 · Answer 1 · answered Jan 30 '17 at 16:09

12

The point of the validation set is to select the epoch/iteration where the neural network is most likely to perform the best on the test set. Subsequently, it is preferable that the distribution of classes in the validation set reflects the distribution of classes in the test set, so that performance metrics on the validation set are a good approximation of the performance metrics on the test set. In other words, the validation set should reflect the original data imbalance.

answered Jan 30 '17 at 16:09

Franck Dernoncourt

42,093
30
155
271

1

I suspected the same Franck. Now with that said, (that validation set should reflect the original skewed data distribution), would you agree that in the training set, we de-skew the data? – Spacey Jan 30 '17 at 20:08
@Spacey One way of dealing with this problem is called "importance weighting", and it essentially means: don't resample or create synthetic samples, but instead simply *weight* samples according to their relative importance for the test distribution. See my answer [here](https://stats.stackexchange.com/a/365708/131402) for details. – jhin May 13 '20 at 11:33

score 3 · Answer 2 · answered May 29 '19 at 14:19

Using a naturally unbalanced training and test data I ran into a scenario where my model appeared to be improving over time but was actually just memorizing the minority class samples in the training set and learning to always predict the majority class for everything else.

I diagnosed this problem by balancing my test data and observing that the accuracy and loss for the test data got worse with more epochs (which was not the case with the unbalanced test data).

Another way to diagnose issues like this would be to use metrics like sensitivity, specificity, or accuracy for a single class instead of looking at the overall accuracy/loss. This blog post goes into more detail on this and gives a sample implementation of these metrics in Keras http://www.deepideas.net/unbalanced-classes-machine-learning/

I think the right solution depends on the real world application of your model and how important the accuracy of minority classes is. E.g. if you are training something like imagenet and notice that it has low accuracy for it's "sea slug" class that is probably ok. But if you were training a model to predict if someone has cancer then the accuracy of the minority class would be very important.

score 2 · Answer 3 · answered May 13 '20 at 11:28

It depends on what you're optimizing for: what is your target distribution, over which you would like to minimize the expected loss? If it is the same as the training distribution, then the validation data should follow the same distribution. If it is a different distribution (e.g. you would like your algorithm to perform well on balanced data), you actually want to minimize the expected loss over that distribution, and your validation set should correspondingly be sampled according to that distribution. This situtation is called "covariate shift", and there is a whole branch of research on what's called "covariate shift adaption," i.e., dealing with this problem. There is a book by Sugiyama / Kawanabe on the subject from 2012, called "Machine Learning in Non-Stationary Environments".

Training data is imbalanced - but should my validation set also be?

3 Answers3

Linked