Hello Crossvalidated!
I have a question that I can't figure out.
I am working on building a classifier for a dichotomous outcome (0, 1). I use R for this. I used a Random Forest algorithm from the randomForest package.
The outcome variable is very skewed with the "1" outcome only being 7% of the cases, with n = 12000. The results of the analysis resulted in a misclassification rate for the "1" of 97% and for the "0" below 0.1%
Since random forest is essentially a from of bootstrapping, I tried stratifying on the outcome to get better classifications. The misclassification rates changed to "1" 60% and "0" 20%.
My gut feeling tells me that stratifying on outcome is not good practice, but I can't seem to find anything specifically on the subject.
The answer below relates to this, but the situation is different. Most interesting statistical paradoxes
In the answer above a "reverse" Simpsons paradox is explained, but this is when one unknowingly stratifies on outcome.
So the question is: Are there any negative effects from stratifying on outcome when using a bootstrap/random forest?