What is the limit to consider something is overfitting?

Question

I'm running random forests for imbalanced multiclass classification. Because of this, I'm trying many variations of RF: basic RF, balanced RF, weighted RF, undersampled RF and SMOTE RF (oversampled). At least most of those methods should try to improve the imbalance.

I'm checking performance based on recall, precision, recall-precision (avg precision in python) and roc auc, both from each class (I have 3 classes, 0 1 and 2) and micro and macro averages.

What I see from the results is that on the train sets I get values around 0.999 every time, but then on the test set the performance is around 0.85 for the less frequent classes, while 0.99 for the most frequent class. This is for all methods mostly.

I already tried accounting for imablance with the different models, and I also ran these with GridSearchCV, so this should be the optimal parameter combination for each model, and cross validated with GroupKFold (since I have clustered obs, to prevent data leakage).

So my question is... does this still count as overfitting, or not? I read that whenever you get a better performance on the train set than the test set, the model is overfitting. I cannot think of anymore methods to improve the train-test performance... isn't this enough, and maybe this is all that can come out of the data?

You should expect to have better performance on training data that on out-of-sample data. After all, the model has been optimized for the training data! — Dave, Sep 24 '20 at 19:18
Unbalanced classes are almost certainly not a problem, and oversampling will not solve a non-problem: [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) — Stephan Kolassa, Sep 24 '20 at 19:31
Do not use accuracy to evaluate a classifier: [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) [Is accuracy an improper scoring rule in a binary classification setting?](https://stats.stackexchange.com/q/359909/1352) [Classification probability threshold](https://stats.stackexchange.com/q/312119/1352) And every criticism of accuracy applies equally to precision/recall. Use proper scoring rules. And @Dave is completely right, of course. — Stephan Kolassa, Sep 24 '20 at 19:31
You are using improper accuracy scoring rules, which will mislead and makes you use statistically illegitimate procedures such as SMOTE. See https://fharrell.com/post/class-damage — Frank Harrell, Sep 24 '20 at 19:36
Thanks all, very interesting things. The idea of using briar score is good in my case, however I'm having problems actually implementing it in my model... in gridsearch it's a bit hard to ge the multiclass score... I already tried a few things but nothing works... would you have any idea of how to do it? — amestrian, Sep 24 '20 at 21:13

What is the limit to consider something is overfitting?

0 Answers0

Linked