1

If i use early stopping with an evaluation set for training, when i have to train the model for the final evaluation what is the best approach? Generally I'd train the model with the full dataset but in this case I can't use the early stopping feature since i have no validation. Is there a way to obtain the proper n_estimators value from the training with the evaluation set and then use it as a parameter? Or it's better to use even for the final result only the partial trained model with early stopping and evaluation set?

Thank you

Davide
  • 13
  • 2

1 Answers1

1

You are correct to assume that when using early stopping, following a train-validation split of our data, we will potentially estimate the optimal number of estimators $M$ as being lower than the one that would be optimal when training on the full dataset, $M_{full}$. In a sense, that is only natural as when utilise a larger dataset to train our algorithm we should be are able to learn a richer set of rules without necessarily over-fitting. To be clear: The number of iterations $M$ is the one we computed when using early stopping. For XGBoost, assuming we use train with early stopping, that can be found under the argument best_iteration.

I have not come across a general rule or a research paper on how to accurately estimate the final number of iterations when training on the full training set following a CV procedure. I have come across a rough approximation where if we use $P\%$ of our data in our validation set and we get $M$ iterations as the optimal number, we can approximate the number of iterations when training with the full dataset as $M_{full} = \frac{M}{1-0.01P}$. This is for example put forward by some experienced Kaggle competitors (competition masters or grand masters) here and here. Similarly, another experienced Kaggle competitor also suggests here multiplying $M$ by a fixed factor close to $1.1$ to get the number $M_{full}$ to be used when training the final model. From personal experience, (I am not an experienced Kaggle competitor) I have found that using a slightly increased number of iteration $M$ (about 3-10% more than the one suggested by early-stopping) indeed increases my leader-board position; i.e. it helps the model trained on the full dataset to have better generalisation performance.

Note that if we use a cross-validation schema instead of a fixed validation set, each fold might have a different number of optimal iterations $M$. In that case, we need to be careful not to over-simplify things. It would be relevant to go ahead and look into ensuring that the number of optimal iterations per fold is "ball-park the same" e.g. within 10% of the mean of $M$ across all folds. Otherwise we probably have too variable estimates to reasonably average them. In that case it would be prudent to look into making the per-fold performance more stable before continuing (e.g. by stratifying our response variables and/or by increasing the regularisation parameters used).

The above being said, even if we decide not to scale the "early-stopping" optimal number of iterations, we should re-train our model using the full dataset. This has been covered multiple times in CV.SE; see for example the threads :

for more details.

usεr11852
  • 33,608
  • 2
  • 75
  • 117
  • Thank you for the clear answer, I have just one more problem: how can i obtain the M for my training set? I searched thoroughly among the documentation but I found no reference. – Davide Mar 11 '20 at 08:26
  • I am happy to help, please consider upvoting the answer if it is useful (or accepting if it clarifies the issue raised). The number of iterations $M$ is the one we computed when using early stopping. For XGBoost, assuming we use `train` with early stopping, that can be found under the argument [`best_iteration`](https://xgboost.readthedocs.io/en/latest/python/python_intro.html?highlight=best%20iteration). – usεr11852 Mar 11 '20 at 08:46
  • Thank you, absolutely exhaustive. Have a good day – Davide Mar 11 '20 at 09:10
  • I just tried and I noticed that if i use the parameter M found instead of early stopping and validation set I obtain a lower score. Why? Shouldn't be the same? – Davide Mar 11 '20 at 10:28
  • Apologies, I don't understand the question. $M$ is the one found from early stopping. The working assumption is that for the final model we use $M_{final}$ where $M_{final}$ is say equal to $\rho M$, $\rho$ being approximately $1.05$. – usεr11852 Mar 11 '20 at 10:46
  • To test that best_iteration returns in fact the optimal M for my training set and model, I tried first to train XGB with "evaluation set" and "early stopping" then I got the relative obtimal M and tried to train again the model, this time with fixed n_estimators = M. I expected to find the same result on the validation set but that was not the case. (X_train and X_valid were obiously the same in both cases) – Davide Mar 11 '20 at 10:57
  • I don't think we can test the optimality of $M$ that way. We need to have a separate test set. In addition to that, maybe there was some other differences (e.g. sampling seeds). The cause of the different is a separate question from the one posted originally so it might be worth making a new question. – usεr11852 Mar 11 '20 at 11:08