Huge overfitting with Random Forests and Boosted Trees?

Question

In the following picture, the boxplots represent a performance metric (the closer to 1, the better) recorded for 50 runs of cross-validation, and the black filled circles are the training values of the models (performance of the models on the full data set).

enter image description here

It seems to me that the "white model" (Random Forests) strongly overfits in case #2 (and possibly in case #3)? Simply by comparing the training and testing values, can I reasonably infer that?

On a separate note, it seems that the "grey model" (Boosted Trees) has a slight tendency to underfit (in light of cases #1 and #3)? If yes, does it mean that this model is not optimal - and that, therefore, my best model selection procedure is also not optimal? .

PS: in case it matters, the models are used for multiclass classification, the performance metric is the Rank Probability Skill Score (RPSS), and each run of cross-validation randomly leaves 5% of the observations out (~150 observations).

For RFs, one thing that might inflate the test-set performance is that I used the data=data.train argument in the R command (see: http://stats.stackexchange.com/questions/111968/random-forest-how-to-handle-overfitting?rq=1) That does not change the meaning of my question, but might explain why the differences between train/test are so extreme for RFs — Antoine, Apr 23 '15 at 07:47
As per my answer to the linked question, `predict(model, data.train)` gives a meaningless result for random forests. Use `predict(model)` to get honest predictions for your training data. — Hong Ooi, Apr 23 '15 at 10:07
there's a typo in my comment above: it should read ..."inflate the **train-test** performance"... — Antoine, Jun 20 '15 at 08:40

Huge overfitting with Random Forests and Boosted Trees?

0 Answers0