I am working on a binary random forest using R. mu data set consists of 300 cases classes 1 and 2100 cases class 0. I am planning to evaluate my model using the model prediction and the AUC and for that I need a test date set. I would be able to create a test data set with approximately 300 samples but all the samples will fall on the class 0. Would this influence the performance evaluation of my model?
Another approach to this problem is to create a training data set with 70% of class 1 cases and 70% of class 0 and them when running my random forest code I will use sampsize
command to get the same number of samples per tree in my model and eliminate the imbalance problem. The vector would be c( 210, 210). My test set for evaluating the model and calculating the AUC would be the remaining 30% cases of class 1- 90 samples- plus 90 unseen cases from class 0.
I am not sure about this as I am wasting a lot of data. But , I think it is better than creating a test set with 0 instances of class 1.