I have an imbalanced dataset (90% class 0 10% class 1), should I first split it in training and test set, then balance my training set (my test set would still be imbalanced), or could I randomly downsample the majority class in my dataset then split in training and test set (my test set will not contain any observations used in the training set because I down sampled)? I am getting very different results for each approach.
Asked
Active
Viewed 2,673 times
1 Answers
3
I would say neither of the options you suggested: use all the data you have and try to ensure the distributions of training and test set match. This will give you the most realistic assessment of the model's performance. If your classification algorithm has trouble dealing with imbalanced data, use something else.
So in short: split in training and testing, keeping both of them imbalanced. Don't throw data away by subsampling.

Marc Claesen
- 17,399
- 1
- 49
- 70
-
Hi Marc, would appreciate any insight you could give [here](http://stats.stackexchange.com/questions/258853/training-data-is-imbalanced-but-should-my-validation-set-also-be). – Spacey Jan 30 '17 at 05:57