Differences between "in-bag" and "out-of-bag" empirical risks in the R package "mboost"

Question

currently I am using the mboost R-package to estimate some additive models. When using the function gamboost(), you can control the hyper-parameters for boosting by using the option boost_control(). One argument of this option is related to the empirical risk, and there are three alternatives to compute it, "inbag", "oob" and "none". The following is the code I have used:

model <- gamboost(income, data = datafit, control = boost_control(mstop = 10000, nu = 0.1, trace = TRUE), weights = datafit$factor, # from survey design family = QuantReg(tau = 0.05))

My questions are: 1.- I would like to know the differences between using "inbag", "oob" and/or "none" and when it is suggested to use each one. 2.- In the case of "oob", you have the option of introducing an extra vector in oobweights for the out-of-bag weights, how should this vector be? I have seen applications with only 0's and 1's but I have not seen a document about the proportion of 0's and 1's that the vector should have nor if I should also introduce here the weights from the sampling design (A priori, I have introduced this information in the option weights)

score 1 · Answer 1 · answered Jan 12 '19 at 22:40

In-bag (IB) and out-of-bag (OOB) can be easily understood as follows: When we train a model we might choose to train our model using all the data we have available or segment our data such that we have a hold-out set that will not be used during training. Following the training of the learner we use, we now want to have an indication of its performance. If we trained our learner using all the available data, then the error we report is effectively the learner's fit on the training data - this is what we call in-bag. If during training we had a hold-out set, we using the unseen data of the hold-out set to estimate the performance of our learn - this is what we call out-of-bag error. Generally, in-bag error is considered an optimistic indicator of the performance of our learner, it is susceptible to over-fitting and generalises poorly. This is true for any model, boosting-based or not.

Particular to boosting, we train our model iteratively and thus we are prone to over-fitting. Because of this we may want to monitor the performance of our overall ensemble. This is where the in/out-of-bag error comes into play. In addition, certain loss functions allow the inclusion of weights, i.e. a way to indicate that certain observations should be considered more relevant to our task than others. We should not use weights that equal to $0$, as this would effectively indicate that we fully exclude a particular sample point from any weighted calculations. The OOB weights allow for a different weighting to be used in the OOB sample than one used for the function gradient. Particular to mboost if we set value of the weight vector to $0$ and we do not specify an oobweights vector, then that point is used in the OOB error calculations.

As a general rule: avoid in-bag error measures, they are misleading. I would suggest using out-of-bag errors unless they are some specific reasons not to (e.g. the methodology followed explicitly says that it needs the in-bag error for some computation.) There is an excellent CV thread on: "How can I help ensure testing data does not leak into training data?", it is worth a read if one is interested more on this subject.

Thank you, I will take a look on the link you sent. – Johny Arm Jan 14 '19 at 09:24 — Johny Arm, Jan 14 '19 at 09:24

Differences between "in-bag" and "out-of-bag" empirical risks in the R package "mboost"

1 Answers1