1

I have to create a classification model where my dataset contains 697 observations which only 18 are from the group of interest. As usual, I split data the into a training and test set stratified by the positive class.

I tried doing 10-fold CV with SMOTE on the training data to select the best model, but on average none were better than chance on the CV folds. Now I'm left wondering what is the best approach to the problem, and even thought on doing some things:

  1. Utilizing bootstrap instead of CV, however I read that I might need a big number of repetitions, but given the size of my data, I wonder if my resamples will be too similar;

  2. Just ignore any form of resampling and try fitting a complex model on the whole training data;

  3. Try a different approach to the problem, maybe as an anomaly detection utilizing one-svc SVM.

Are any of these alternatives valid or are there any more "validated" approach to this situation?

luizgg
  • 111
  • 4
  • You say "none were better than chance". How are you measuring _better_ ? If you are using accuracy, you get 97.4% correct by simply saying everything is majority class. If what you really care about is identifying the smaller class, you might consider a different metric such as F. – G5W Jan 05 '18 at 00:37
  • @GSW I reran the modeling process, but now looking at precision, recall and F1 values and the problem is that at some folds it is not able to calculate the metrics because it fails to predict the positive class entirely. Anyway, I think you make a great point, but I believe that at this point my modeling strategy is wrong. – luizgg Jan 05 '18 at 02:27
  • Yes. I do believe your strategy is wrong. Another option is that some classification methods allow you to assign weights to the instances. You could use accuracy if you weighted the positive class more heavily. Also, you might want to use _stratified_ cross validation to eliminate really bad CV folds. – G5W Jan 05 '18 at 02:42
  • I am using stratified CV. Also, I'll search some of these methods. Thanks! – luizgg Jan 05 '18 at 03:11

1 Answers1

2

I have a short simulation study paper that I wrote for a class on this issue. You can find a link to the .pdf as well as the GitHub here. I think there are some good citations there that you can check out if nothing else. But I think you are on the right path:

  1. I agree your resamples will likely be too similar.
  2. See the link above—I do not think this will work. If it does, I worry that it will be due to overfitting.
  3. With a positive class this small, I agree that anomaly detection would be perhaps the best approach. However, it is going to be difficult with only 18 cases.

Is there any way you can get more data?

Mark White
  • 8,712
  • 4
  • 23
  • 61
  • How did you set the classification thresholds in your study? I've done a similar thing, using AUC to compare models, with the result that resampling methods have no effect on the AUC (really no effect on the ROC curves themselves). – Matthew Drury Jan 04 '18 at 20:22
  • Took a look at your paper and the results gave me some ideas, might read it again to fully understand it. Finally, unfortunately, I can't acquire more data – luizgg Jan 04 '18 at 20:50
  • @MatthewDrury probability thresholds were all at .50. So if it was a predicted probability of .500001 in the positive class, it was classified positively. If predicted probability was the same for the negative class, it was classified negatively. – Mark White Jan 04 '18 at 21:07
  • Yah, so the resampling is really just adjusting the prior class probability so that the 0.5 threshold is a bit more appropriate. I would argue that resampling it's a poor solution to the thresholding issue. – Matthew Drury Jan 04 '18 at 21:09
  • Do you mean resampling as in bootstrap resampling, or sampling techniques like under/over/SMOTE? – Mark White Jan 04 '18 at 21:40
  • Sampling techniques like under/over/SMOTE. https://stats.stackexchange.com/questions/285231/what-problem-does-oversampling-undersampling-and-smote-solve – Matthew Drury Jan 04 '18 at 21:53