4

I am dealing with an imbalanced dataset with the R package randomForest. Some one has suggested that, Bootstrap your data while over-sampling the rare class and under-sampling the typical class. But I found that with the resampling size increasing, the OOB error decreasing to zero, which showed severe overfitting, I wonder why?
This also happens with tree model(rpart).

Here is an example, although the data is balanced, just for testing of the effect of resampling size:

require(randomForest)
set.seed(0)
iris500=iris[sample(1:nrow(iris),size=500,replace=TRUE),]
iris2000=iris[sample(1:nrow(iris),size=2000,replace=TRUE),]
formula="Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width"
(rf0=randomForest(as.formula(formula),data=iris)) #OOB estimate of  error rate: 4%
(rf1=randomForest(as.formula(formula),data=iris500)) #OOB estimate of  error rate: 0.4%
(rf2=randomForest(as.formula(formula),data=iris2000)) #OOB estimate of  error rate: 0%

table(iris[["Species"]]) 
#setosa versicolor virginica 
#  50       50        50
Karel Macek
  • 2,463
  • 11
  • 23
earclimate
  • 163
  • 4
  • Can you show the code you used to resample (augment) your rare class? I would say that generally you should not be altering your data set too much, assuming that it reflects reality. – Tim Biegeleisen Jul 31 '15 at 04:02
  • How many instances do you have in `iris`? Is it possible that the same instance exists in both training and testing set because of the resampling? – Ping Jin Jul 31 '15 at 08:39
  • 1
    There is 50 instances for each class, may be that is a reason,but it did result a 100% accuracy. if so, how should I balance my data through Bootstrap Resampling? –  Jul 31 '15 at 09:09
  • You basicly will have every sample both inbag and outbag. Instead use sampsize= and strata= parameters during training to form a correct bootstrap. Avoid over-sampling. – Soren Havelund Welling Jul 31 '15 at 17:40
  • 1
    This example shows how to down-sample correctly: http://stats.stackexchange.com/questions/157714/r-package-for-weighted-random-forest-classwt-option/158030#158030 – Soren Havelund Welling Jul 31 '15 at 17:47
  • Many thanks! I have tested that examples. It looks like that sampsize and strata parameters are for down-sample, but I have no more than 25 instances in reality even for the majority class, so may be I would refer to the classwt parameter, thanks for your suggestion. – earclimate Aug 01 '15 at 00:07

1 Answers1

1

I assume, you introduce bias by sampling with replacements at the first step. Many observations probably will not be included in your training set.

There are 50 instances of each class (3 classes $\rightarrow$ 150 samples). Let's define the number of samples as $n = 150$. Let's define the number of "new samples" obtained by bootstrapping as $k$.

The probability that a sample will be included into your new set will be equal to (according to combinations with replacements): $$ \frac{{{n+k-2} \choose {k-1}}}{{{n+k-1} \choose {k}}} = \frac{k}{n+k-1} $$

For $k = 500$ this probability equals to $\frac{500}{649} = 0.77$.

For $k = 2000$ this probability equals to $\frac{2000}{2149} = 0.93$.

Seems quite cool. But, I think we want all the samples to be included into newly created set.

However the probability for this will be very low:

For $k = 500$ this probability will be $0.77^n = 0.77^{150} = 9.4e-18$.

For $k = 2000$ this probability will be $0.93^n = 0.93^{150} = 1.87e-5$.

I guess you can increase the value of $k$ till the probability of including all the samples in a newly created set will be close to 1. However, some of the points will have bigger weight because they will be sampled more times. And as a consequence, they will rule the performance of a classifier.

Kirill
  • 622
  • 1
  • 7
  • 15