XGBoost obtain n_estimators obtimal parameter with early stopping

Question

If i use early stopping with an evaluation set for training, when i have to train the model for the final evaluation what is the best approach? Generally I'd train the model with the full dataset but in this case I can't use the early stopping feature since i have no validation. Is there a way to obtain the proper n_estimators value from the training with the evaluation set and then use it as a parameter? Or it's better to use even for the final result only the partial trained model with early stopping and evaluation set?

Thank you

usεr11852 · Accepted Answer · 2020-03-11T11:33:05.597

You are correct to assume that when using early stopping, following a train-validation split of our data, we will potentially estimate the optimal number of estimators $M$ as being lower than the one that would be optimal when training on the full dataset, $M_{full}$. In a sense, that is only natural as when utilise a larger dataset to train our algorithm we should be are able to learn a richer set of rules without necessarily over-fitting. To be clear: The number of iterations $M$ is the one we computed when using early stopping. For XGBoost, assuming we use train with early stopping, that can be found under the argument best_iteration.

I have not come across a general rule or a research paper on how to accurately estimate the final number of iterations when training on the full training set following a CV procedure. I have come across a rough approximation where if we use $P\%$ of our data in our validation set and we get $M$ iterations as the optimal number, we can approximate the number of iterations when training with the full dataset as $M_{full} = \frac{M}{1-0.01P}$. This is for example put forward by some experienced Kaggle competitors (competition masters or grand masters) here and here. Similarly, another experienced Kaggle competitor also suggests here multiplying $M$ by a fixed factor close to $1.1$ to get the number $M_{full}$ to be used when training the final model. From personal experience, (I am not an experienced Kaggle competitor) I have found that using a slightly increased number of iteration $M$ (about 3-10% more than the one suggested by early-stopping) indeed increases my leader-board position; i.e. it helps the model trained on the full dataset to have better generalisation performance.

Note that if we use a cross-validation schema instead of a fixed validation set, each fold might have a different number of optimal iterations $M$. In that case, we need to be careful not to over-simplify things. It would be relevant to go ahead and look into ensuring that the number of optimal iterations per fold is "ball-park the same" e.g. within 10% of the mean of $M$ across all folds. Otherwise we probably have too variable estimates to reasonably average them. In that case it would be prudent to look into making the per-fold performance more stable before continuing (e.g. by stratifying our response variables and/or by increasing the regularisation parameters used).

The above being said, even if we decide not to scale the "early-stopping" optimal number of iterations, we should re-train our model using the full dataset. This has been covered multiple times in CV.SE; see for example the threads :

for more details.

Thank you for the clear answer, I have just one more problem: how can i obtain the M for my training set? I searched thoroughly among the documentation but I found no reference. — Davide, Mar 11 '20 at 08:26
I am happy to help, please consider upvoting the answer if it is useful (or accepting if it clarifies the issue raised). The number of iterations $M$ is the one we computed when using early stopping. For XGBoost, assuming we use `train` with early stopping, that can be found under the argument [`best_iteration`](https://xgboost.readthedocs.io/en/latest/python/python_intro.html?highlight=best%20iteration). — usεr11852, Mar 11 '20 at 08:46
I just tried and I noticed that if i use the parameter M found instead of early stopping and validation set I obtain a lower score. Why? Shouldn't be the same? — Davide, Mar 11 '20 at 10:28
Apologies, I don't understand the question. $M$ is the one found from early stopping. The working assumption is that for the final model we use $M_{final}$ where $M_{final}$ is say equal to $\rho M$, $\rho$ being approximately $1.05$. — usεr11852, Mar 11 '20 at 10:46
To test that best_iteration returns in fact the optimal M for my training set and model, I tried first to train XGB with "evaluation set" and "early stopping" then I got the relative obtimal M and tried to train again the model, this time with fixed n_estimators = M. I expected to find the same result on the validation set but that was not the case. (X_train and X_valid were obiously the same in both cases) — Davide, Mar 11 '20 at 10:57
I don't think we can test the optimality of $M$ that way. We need to have a separate test set. In addition to that, maybe there was some other differences (e.g. sampling seeds). The cause of the different is a separate question from the one posted originally so it might be worth making a new question. — usεr11852, Mar 11 '20 at 11:08

XGBoost obtain n_estimators obtimal parameter with early stopping

1 Answers1