How and if Cross-Validation makes sense in the case of Random Forests is controversial. I guess this is because there's no blue-print on how to use CV in the case of Random Forests.
If we stick to what Breiman says:
In random forests, there is no need for cross-validation or a separate
test set to get an unbiased estimate of the test set error. It is
estimated internally, during the run, as follows:
Each tree is constructed using a different bootstrap sample from the
original data. About one-third of the cases are left out of the
bootstrap sample and not used in the construction of the kth tree.
Put each case left out in the construction of the kth tree down the
kth tree to get a classification. In this way, a test set
classification is obtained for each case in about one-third of the
trees. At the end of the run, take j to be the class that got most of
the votes every time case n was oob. The proportion of times that j is
not equal to the true class of n averaged over all cases is the oob
error estimate. This has proven to be unbiased in many tests.
However, CV might make sense if we apply CV to random forests with the goal of tuning some parameters.
That said, regarding your specific question:
The test results I obtain are quite satisfying but validation on
training set brings even better scores which surprises me as I
expected both training and test scores to be rather similar.
I believe that, although similar in principle, there is no guarantee that the out-of-bag error estimate provided by the random forest would be arbitrary close to a k-fold CV estimate. In other words, what we have here are two estimators of the generalization error that are affected by a bias and a variance; and that bias might be different. The convergence would depend, among other things, on the choice of k. In the estimation of the generalization error, large k mean less (pessimistic) bias. Here we see that the OOB error is roughly equivalent to two-fold cross validation.
If we opt for the validation set approach, where you simply split the data in two chunks (one for training and one for validation) and you train the model once on the test set and then test it once on your hold-out set, you will stress the variability of your generalization error estimation (you have only one shot, you're not averaging out CV experiments). In the validation set approach:
the validation estimate of the test error rate can be highly variable, depending on precisely which observations are included in
the training set and which observations are included in the validation
set (source)
and
In the validation approach, only a subset of the observations—those
that are included in the training set rather than in the validation
set—are used to fit the model. Since statistical methods tend to
perform worse when trained on fewer observations, this suggests that
the validation set error rate may tend to overestimate the test error
rate for the model fit on the entire data set (source)
All this about the Random Forest. In your example you use ExtraTrees. I think that all we said about random forests holds for ExtraTrees iff in their implementation they bootstrap the data the same way Random Forest do.
... Because the splits are selected even more randomly than for
regular RF, bootstrap is not needed as much to decorrelate trees. (source)
In sklearn, as you correctly do, the bootstrap
parameter need to be explicitly set to True
hence, I have nothing to add regarding your configuration setting.
bottom line: there is inherent variability in the estimation of the performance of an algorithm. Is the difference in the oob and test set scores an issue? Possibly.
I am not yet able to interpret in detail the report you have produced. I would try the following experiments:
- run the extra-tree on the full dataset
- otherwise, rather than having one test set score, run a k-fold cross validation. Let's check the distribution of the generalization error
Alongside visual inspection, the two strategies about might help identify the potential impact of outliers.