So I have a data set of size 250k, and my minority class is of size 5000. This is a pretty imbalanced dataset.
I did not apply sampling in my model, and it turns out when split it into train, test, and validation sets, less than 300 samples were included in training.
Would it make sense to split my data into groups that are class 0, and groups that are class 1. Shuffle randomly and select 60% from each grouping (.6 * 245k) and (.6 * 5k) and concatenate them for the training sets, then 20% from each for validation, and then the last 20% for my test?
What are some drawbacks to this approach?
I am also considering built in sampling methods such as SMOTE-TOMEK, but it takes a very long time to run.
I've also read online that cross validation should be done prior to sampling. While it does make sense to do so, how would one validate the accuracy after sampling? Would it be nested cross validation?