Creating a test set with imbalanced data

Question

I am working on a binary random forest using R. mu data set consists of 300 cases classes 1 and 2100 cases class 0. I am planning to evaluate my model using the model prediction and the AUC and for that I need a test date set. I would be able to create a test data set with approximately 300 samples but all the samples will fall on the class 0. Would this influence the performance evaluation of my model?

Another approach to this problem is to create a training data set with 70% of class 1 cases and 70% of class 0 and them when running my random forest code I will use sampsize command to get the same number of samples per tree in my model and eliminate the imbalance problem. The vector would be c( 210, 210). My test set for evaluating the model and calculating the AUC would be the remaining 30% cases of class 1- 90 samples- plus 90 unseen cases from class 0. I am not sure about this as I am wasting a lot of data. But , I think it is better than creating a test set with 0 instances of class 1.

score 1 · Accepted Answer · edited Apr 13 '17 at 12:44

If your data set is unbalanced and separation is not perfect. Your RF-model will very sensibly predict all samples to be of the most prevalent class of the training set. Your RF model essentially implement a 'belief' of class distribution of future predictions equal to your training set being 2100:300. If you have reason to disagree(domain knowledge) or if the costs of false prediction are very unequal for the classes you may correct this belief. See the first link to a thread below, where this 'belief' is graphically depicted for a 3-class problem.

You are not "throwing data away" by setting sampsize = c(210,210) as all samples will be present in at least some trees. Perhaps increase ntree=1000 to make sure that happens. Sampsize = c(300, 300) will work fine also because of sample replacement in bootstrap. As you model the difference between the two classes, excessive extra examples of one class only, will not provide extra information.

When using stratification(sampsize) or classwight you're are probably only slightly improving prediction performance described as AUC of ROC. Instead the you essentially modify the prior belief of class distribution, such that the rare class will win more majority votes on the expense of many(not all) of these predictions being wrong.

To validate: Compute AUC of and plot the ROC of out_of_bag votes distribution vs. true class If to compare with other models, embed all models in a repeated 10-fold CV and compute the AUC and ROC for all models.

Here's thread on the coding and plotting.

Here's a thread on classweights which I do not cover

Soren your answer is fab. Many thanks. Do you think that I don't need a test set to evaluate the performance of the model? You mentioned that c(300,300) will work just fine. I am not sure if it is a good idea to use the predictions from my training data to calculate the AUC.Would this not cause overfitting? — Rita A. Singer, Aug 21 '15 at 11:43
I run the code for above and as I was expecting the AUC very close to 1. I have decided to keep the test set for the performance evaluatetion but my question is:should I use the probabilities from the training and test set together to calculate the AUC? or is there any way to evaluate my test set using somehow the probalities from the training set? I would really appreciate any feedback on these two approches. — Rita A. Singer, Aug 22 '15 at 10:25
Thx :) You could use OOB-CV or a outer test-set validation as you like. If you also perform some feature selection or signal processing, you should wrap your entire pipeline in a CV of some kind. Otherwise OOB is just fine. Check out the definition of the output value "votes" in the randomForest, these are cross-validated already. I would plot these in a ROC plot as last lines of the referenced code-example. Such plot depicts your expected trade-of of specificity and sensitivity. An AUC could be calculated also. — Soren Havelund Welling, Aug 22 '15 at 18:59

Creating a test set with imbalanced data

1 Answers1

Linked