You are correct to assume that when using early stopping, following a train-validation split of our data, we will potentially estimate the optimal number of estimators $M$ as being lower than the one that would be optimal when training on the full dataset, $M_{full}$. In a sense, that is only natural as when utilise a larger dataset to train our algorithm we should be are able to learn a richer set of rules without necessarily over-fitting.
To be clear: The number of iterations $M$ is the one we computed when using early stopping. For XGBoost, assuming we use train
with early stopping, that can be found under the argument best_iteration
.
I have not come across a general rule or a research paper on how to accurately estimate the final number of iterations when training on the full training set following a CV procedure. I have come across a rough approximation where if we use $P\%$ of our data in our validation set and we get $M$ iterations as the optimal number, we can approximate the number of iterations when training with the full dataset as $M_{full} = \frac{M}{1-0.01P}$. This is for example put forward by some experienced Kaggle competitors (competition masters or grand masters) here and here. Similarly, another experienced Kaggle competitor also suggests here multiplying $M$ by a fixed factor close to $1.1$ to get the number $M_{full}$ to be used when training the final model. From personal experience, (I am not an experienced Kaggle competitor) I have found that using a slightly increased number of iteration $M$ (about 3-10% more than the one suggested by early-stopping) indeed increases my leader-board position; i.e. it helps the model trained on the full dataset to have better generalisation performance.
Note that if we use a cross-validation schema instead of a fixed validation set, each fold might have a different number of optimal iterations $M$. In that case, we need to be careful not to over-simplify things. It would be relevant to go ahead and look into ensuring that the number of optimal iterations per fold is "ball-park the same" e.g. within 10% of the mean of $M$ across all folds. Otherwise we probably have too variable estimates to reasonably average them. In that case it would be prudent to look into making the per-fold performance more stable before continuing (e.g. by stratifying our response variables and/or by increasing the regularisation parameters used).
The above being said, even if we decide not to scale the "early-stopping" optimal number of iterations, we should re-train our model using the full dataset. This has been covered multiple times in CV.SE; see for example the threads :
for more details.