I am dealing with an imbalanced dataset with the R package randomForest. Some one has suggested that, Bootstrap your data while over-sampling the rare class and under-sampling the typical class. But I found that with the resampling size increasing, the OOB error decreasing to zero, which showed severe overfitting, I wonder why?
This also happens with tree model(rpart).
Here is an example, although the data is balanced, just for testing of the effect of resampling size:
require(randomForest)
set.seed(0)
iris500=iris[sample(1:nrow(iris),size=500,replace=TRUE),]
iris2000=iris[sample(1:nrow(iris),size=2000,replace=TRUE),]
formula="Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width"
(rf0=randomForest(as.formula(formula),data=iris)) #OOB estimate of error rate: 4%
(rf1=randomForest(as.formula(formula),data=iris500)) #OOB estimate of error rate: 0.4%
(rf2=randomForest(as.formula(formula),data=iris2000)) #OOB estimate of error rate: 0%
table(iris[["Species"]])
#setosa versicolor virginica
# 50 50 50