I'm running random forests for imbalanced multiclass classification. Because of this, I'm trying many variations of RF: basic RF, balanced RF, weighted RF, undersampled RF and SMOTE RF (oversampled). At least most of those methods should try to improve the imbalance.
I'm checking performance based on recall, precision, recall-precision (avg precision in python) and roc auc, both from each class (I have 3 classes, 0 1 and 2) and micro and macro averages.
What I see from the results is that on the train sets I get values around 0.999 every time, but then on the test set the performance is around 0.85 for the less frequent classes, while 0.99 for the most frequent class. This is for all methods mostly.
I already tried accounting for imablance with the different models, and I also ran these with GridSearchCV
, so this should be the optimal parameter combination for each model, and cross validated with GroupKFold (since I have clustered obs, to prevent data leakage).
So my question is... does this still count as overfitting, or not? I read that whenever you get a better performance on the train set than the test set, the model is overfitting. I cannot think of anymore methods to improve the train-test performance... isn't this enough, and maybe this is all that can come out of the data?