Out-of-bag error estimate for boosting?

Question

In Random Forest, each tree is grown in parallel on a unique boostrap sample of the data. Because each boostrap sample is expected to contain about 63% of unique observations, this leaves roughly 37% of observations out, that can be used for testing the tree.

Now, it seems that in Stochastic Gradient Boosting, there is also an $OOB_{error}$ estimate similar to the one in RF:

If bag.fraction is set to be greater than 0 (0.5 is recommended), gbm computes an out-of-bag estimate of the improvement in predictive performance. It evaluates the reduction in deviance on those observations not used in selecting the next regression tree.

Source: Ridgeway (2007), section 3.3 (page 8).

I have trouble understanding how it works/is valid. Say I am adding a tree in the sequence. I am growing this tree on a random subsample of the original data set. I could test this single tree on the observations that were not used to grow it. Agreed. BUT, since Boosting is sequential, I am rather using the entire sequence of trees built so far to provide a prediction for those left-out observations. And, there is a high chance that many of the preceding trees have already seen these observations. So the model is not really being tested at each round on unseen observations like with RF, right?

So, how come this is called "out-of-bag" error estimate? To me, it does not appear to be "out" of any bag since the observations have already been seen?

See the discussion here https://github.com/scikit-learn/scikit-learn/pull/1806. You are not alone in expressing concerns the way OOB estimates are calculated. — mpiktas, Sep 16 '15 at 12:34
thanks for the link, but unfortunately all the thread contributors seem to be as lost as I am! — Antoine, Sep 16 '15 at 12:54
Actually no. See https://github.com/scikit-learn/scikit-learn/pull/2188. In this commit the OOB score is changed to OOB improvement the way gbm does it. I do not know how this idea works precisely, but from what I gathered, the oob sample for the current tree is used to calculate OOB improvement. I did not managed to find the mathematical formulas, so it is necessary to dig into gbm code to see how precisely this improvement is calculated. — mpiktas, Sep 16 '15 at 13:08
@Antoine Really exciting question! was a definitive answer found? — Soren Havelund Welling, Oct 19 '15 at 17:43
Nope, unfortunately. Some hints are given by the above links (in the comments), this [thread](http://stats.stackexchange.com/questions/172723/which-data-is-used-at-each-step-of-stochastic-gradient-boosting-subsample-of-th), and this other [thread](http://stats.stackexchange.com/questions/162162/relative-variable-importance-for-boosting). This is driving me completely crazy. Someday someone should shoot an email to Dr. Friedman/Hastie... Thanks for your attention though. — Antoine, Oct 19 '15 at 17:54
I have the same problem as @Antoine: to me, it does not seem possible to do bagging in Gradient Boosting (no matter the implementation), since even if you train with a small subsample of data (using "subsample" parameter in sklearn), like 10%, in the i-th iteration, the ensemble will have seen the entire training dataset. Thus, in practice, it may provide good results, but because it will be overfitting. Well, at least that is how I see it! Comments welcome! EDIT: I mean, it CAN be done, but my point is whether it makes SENSE. — iamgin, Nov 29 '16 at 10:54

score 3 · Answer 1 · answered Apr 25 '15 at 21:39

3

Answering only partially (and adding a new question to your question).

The gbm implementation in R http://www.rdocumentation.org/packages/gbm/functions/gbm has two parameters to adjust some out-of-bagness.

a) train.fraction will define a proportion of the data that is used to train all trees and thus 1-train.fraction will be true OOB (out-of-bag) data.

b) bag.fraction will define the proportion of training data to be used in the creation of the next tree in the boost. Thus there may be some data that is never used for the creation of any tree and they can be truly used as OOB data.(but it is unlikely, see the question below)

Which brings me to the question. Your analysis of 37% of data as being OOB is true for only ONE tree. But the chance there will be any data that is not used in ANY tree is much smaller - $0.37^{ntrees}$ (it has to be in the OOB for all $ntree$ trees - my understanding is that each tree does its own bootstrap). So in RandomForests it should be very unlikely to be any OOB to test the forest. And yet the randomForest implementation in R (based on Breiman's original code) talks a lot about OOB (for example the result data err.rate and confusion see http://www.rdocumentation.org/packages/randomForest/functions/randomForest)

I dont know how to answer that (and I thank you (+1) for asking the question and making me realize I don't understand this aspect of randomForests). The possible solution is that there is only one bootstrap - and all trees are constructed from it - but as far as I know , it is not the case.

answered Apr 25 '15 at 21:39

Jacques Wainer

5,032
1
20
32

1

For RF/Bagging, there is no issue: at any given step in the ensemble-building process, any observation in the original data set can be fed to all the trees that were trained on boostrap samples devoid of this observation. Approximately one third (~37%) of the total number of trees will meet this condition. Further, by letting these tree vote and taking the most popular class, a prediction can be obtained for the observation. The number of times the prediction differs from the true label of the observation averaged over all classes, gives the out-of-bag error estimate – Antoine Apr 25 '15 at 21:48
1

also, it seems that what gives the OOB error estimate ability in Boosting does not come from the `train.fraction` parameter (which is just a feature of the gbm function but is not present in the original algorithm) but really from the fact that only a subsample of the data is used to train each tree in the sequence, leaving observations out (that can be used for testing). Which goes back to my original question. How can these observations be used for testing since they have probably been seen many times by many preceding trees in the sequence? – Antoine Apr 25 '15 at 21:57
forgot to mention that each tree in RF is indeed built from its very own, unique bootstrap sample of the original data – Antoine Apr 25 '15 at 22:06
@user2835597 thanks for your explanations regarding RF. Thus a OOB error of a RF is really evaluating only 1/3 of the trees in the forest (that does not look very useful to me). Well, learning something everyday. – Jacques Wainer Apr 25 '15 at 22:22
Anyway, the train.fraction parameter seems to be the answer to your question. The sklearn implementation also has a similar parameter, the subsample (see http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier – Jacques Wainer Apr 25 '15 at 22:31
From the description of `train.fraction`: "The first train.fraction * nrows(data) observations are used to fit the gbm and the remainder are used for computing out-of-sample estimates of the loss function." So I think the **entire** gbm is fitted using only train.fraction * nrows(data) observations. On the other hand, my question is about the OOB error estimation monitoring as the ensemble is being built, that is, error estimation for each new tree that is added to the series. Unless the original left-out sample is repeatedly used to test the full sequence every time a new tree is added? – Antoine Apr 25 '15 at 22:44

score -2 · Answer 2 · answered Dec 26 '17 at 02:22

-2

I believe they do out of bag estimation on each step. Since results of all steps are added (with coefficients), OOB errors can be also added with same coefficients.

answered Dec 26 '17 at 02:22

user2225548

1

Out-of-bag error estimate for boosting?

2 Answers2