Dealing with classes in an imbalanced dataset

Question

I have a dataset of continuous features and 4 classes. The classes counts are 1793, 246, 103 and 102. Adding data is quite difficult now. I've done classification with a random forest on the entire data and got f1 values of 0.97, 0.67, 0.69, 0.86 for the respective classes, which is not bad since the mistakes also went to an adjacent class and it's not a bad result in my case. The train and test had similar proportions of the classes.

However, I thought about balancing the classes counts by dropping some instances of the dominant classes. I've run a random forest on data after I took only every 8th instance from the first class which gave me over 0.9 f1 score on every class. After this I also tried cutting down more instances to completely balance the 4 classes and got a little lower scores than the second try.

Which of the three is the way to go? the one that got the best scores or is there something I don't know?

[Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) — Stephan Kolassa, Jan 07 '21 at 03:26
Well that is kind of my intuition, more data is good-I should use it all and the model will deal with it well. Seeing the better results on the undersampling data surprised me. Should I trust it and go with the undesampled data model or will the full data model be more reliable when I use it in real time on new data? — Noam_I, Jan 07 '21 at 09:31
I would be very careful about any conclusions you draw from the F1 score, especially in the context of imbalanced classes, because all the criticisms against accuracy at [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) apply equally to the F1 score. I recommend that you use probabilistic class membership predictions instead, and assess these using proper scoring rules. More info at my answer at that thread. — Stephan Kolassa, Jan 07 '21 at 10:30
Isn't the prediction of the random forest based on the class membership probability? How can I adjust those predictions using different thresholds for the probabilities? — Noam_I, Jan 07 '21 at 10:48
Random Forests can be used as probabilistic classifiers. [I do not think that thresholds are useful in a classification context](https://stats.stackexchange.com/a/312124/1352), only in the subsequent *decision* step. — Stephan Kolassa, Jan 07 '21 at 13:04
If in the end I need a decision, what benefit would it be to use the Random Forest as a probabilistic classifier and then make the decision? How can it help me evaluate the model? — Noam_I, Jan 07 '21 at 13:21
Have you looked at my answer in the linked thread on thresholds? I advocate separating the probabilistic *model* from the *decision*. A probabilistic model and prediction can be evaluated on its own (using proper scoring rules). The decision should use the predictions - but also other inputs, mainly costs. Someone may have a low probability of having a disease, but the best (lowest expected cost) course of action may still be to treat him as if he *did* have the disease. But this decision uses inputs (like costs) that are irrelevant to the prediction model. — Stephan Kolassa, Jan 07 '21 at 18:35

Dealing with classes in an imbalanced dataset

0 Answers0