Should my test set be balanced or imbalanced?

Question

I have an imbalanced dataset (90% class 0 10% class 1), should I first split it in training and test set, then balance my training set (my test set would still be imbalanced), or could I randomly downsample the majority class in my dataset then split in training and test set (my test set will not contain any observations used in the training set because I down sampled)? I am getting very different results for each approach.

score 3 · Answer 1 · answered May 31 '13 at 22:32

3

I would say neither of the options you suggested: use all the data you have and try to ensure the distributions of training and test set match. This will give you the most realistic assessment of the model's performance. If your classification algorithm has trouble dealing with imbalanced data, use something else.

So in short: split in training and testing, keeping both of them imbalanced. Don't throw data away by subsampling.

answered May 31 '13 at 22:32

Marc Claesen

17,399
1
49
70

Hi Marc, would appreciate any insight you could give [here](http://stats.stackexchange.com/questions/258853/training-data-is-imbalanced-but-should-my-validation-set-also-be). – Spacey Jan 30 '17 at 05:57

Should my test set be balanced or imbalanced?

1 Answers1