Cross validation in ExtraTreesRegressor

Question

I've read that Random Trees do not require cross validation as it's implicitly integrated into the forest growing algorithm.

I'm using Scikit's ExtraTreesRegressor on about 15K rows and 300 columns, split into 85% training data and 15% test data.

The test results I obtain are quite satisfying but validation on training set brings even better scores which surprises me as I expected both training and test scores to be rather similar.

I would like to avoid overfitting as I want to use my estimator to predict y values also for my training set and compare them to the real values.

Here's the validation output:

mean (whole dataset) 1713.7427037
std (whole dataset) 1082.71436419

Coefficient of determination on training set: 0.998891590121
Average coefficient of determination using 2-fold crossvalidation: 0.988970576187
oob_score :  0.991159257778

TRAIN - Evaluation

mean of train set 1715.27367884
mean_squared_error 1309.10693306
mean_absolute_error 13.291585397
median_absolute_error 4.64

TEST - Evaluation

mean of test set 1705.07049892
mean_squared_error 6304.20435691
mean_absolute_error 28.4441070137
median_absolute_error 9.62

The params of my ExtraTreesRegressor estimator are:

n_estimators=50,
bootstrap=True,
criterion='mse',
warm_start=False,
min_samples_leaf=1,
oob_score=True,
random_state=33

Am I missing something?

score 2 · Answer 1 · edited May 23 '17 at 12:39

How and if Cross-Validation makes sense in the case of Random Forests is controversial. I guess this is because there's no blue-print on how to use CV in the case of Random Forests.

If we stick to what Breiman says:

In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run, as follows:

Each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases are left out of the bootstrap sample and not used in the construction of the kth tree.

Put each case left out in the construction of the kth tree down the kth tree to get a classification. In this way, a test set classification is obtained for each case in about one-third of the trees. At the end of the run, take j to be the class that got most of the votes every time case n was oob. The proportion of times that j is not equal to the true class of n averaged over all cases is the oob error estimate. This has proven to be unbiased in many tests.

However, CV might make sense if we apply CV to random forests with the goal of tuning some parameters.

That said, regarding your specific question:

The test results I obtain are quite satisfying but validation on training set brings even better scores which surprises me as I expected both training and test scores to be rather similar.

I believe that, although similar in principle, there is no guarantee that the out-of-bag error estimate provided by the random forest would be arbitrary close to a k-fold CV estimate. In other words, what we have here are two estimators of the generalization error that are affected by a bias and a variance; and that bias might be different. The convergence would depend, among other things, on the choice of k. In the estimation of the generalization error, large k mean less (pessimistic) bias. Here we see that the OOB error is roughly equivalent to two-fold cross validation.

If we opt for the validation set approach, where you simply split the data in two chunks (one for training and one for validation) and you train the model once on the test set and then test it once on your hold-out set, you will stress the variability of your generalization error estimation (you have only one shot, you're not averaging out CV experiments). In the validation set approach:

the validation estimate of the test error rate can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set (source)

and

In the validation approach, only a subset of the observations—those that are included in the training set rather than in the validation set—are used to fit the model. Since statistical methods tend to perform worse when trained on fewer observations, this suggests that the validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set (source)

All this about the Random Forest. In your example you use ExtraTrees. I think that all we said about random forests holds for ExtraTrees iff in their implementation they bootstrap the data the same way Random Forest do.

... Because the splits are selected even more randomly than for regular RF, bootstrap is not needed as much to decorrelate trees. (source)

In sklearn, as you correctly do, the bootstrap parameter need to be explicitly set to True hence, I have nothing to add regarding your configuration setting.

bottom line: there is inherent variability in the estimation of the performance of an algorithm. Is the difference in the oob and test set scores an issue? Possibly. I am not yet able to interpret in detail the report you have produced. I would try the following experiments:

run the extra-tree on the full dataset
otherwise, rather than having one test set score, run a k-fold cross validation. Let's check the distribution of the generalization error

Alongside visual inspection, the two strategies about might help identify the potential impact of outliers.

Thanks a lot for your detailed answer! So no idea on how it comes that my test validation has always about 2x the MAE of my train validation? — Rupert Schiessl, May 12 '17 at 14:18
I have made an edit with additional info and a couple of things we could try out. — IcannotFixThis, May 12 '17 at 14:27
Great, I'll check that out and keep you posted. You're right that the issue could simply be the fixed random state of my test split: `X_train, X_test, y_train, y_test = train_test_split(X_scale[t], y[t], test_size=0.15, random_state=33)` — Rupert Schiessl, May 13 '17 at 15:30
The random state is fine. It is just the seed for the pseudo-random # generator. As per my answer, I would like to prove that your model tends to overfit on rel high number of estimators and granular branching (min samples leaf = 1) — IcannotFixThis, May 14 '17 at 08:05

Cross validation in ExtraTreesRegressor

1 Answers1