18

I want to optimize hyperparameters of XGboost using crossvalidation. However, it is not clear how to obtain the model from xgb.cv. For instance I call objective(params) from fmin. Then model is fitted on dtrain and validated on dvalid. What if I want to use KFold crossvalidation instead of training on dtrain?

from hyperopt import fmin, tpe
import xgboost as xgb

params = {
             'n_estimators' : hp.quniform('n_estimators', 100, 1000, 1),
             'eta' : hp.quniform('eta', 0.025, 0.5, 0.025),
             'max_depth' : hp.quniform('max_depth', 1, 13, 1)
             #...
         }
best = fmin(objective, space=params, algo=tpe.suggest)

def objective(params):
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dvalid = xgb.DMatrix(X_valid, label=y_valid)
    watchlist = [(dtrain, 'train'), (dvalid, 'eval')]
    model = xgb.train(params, dtrain, num_boost_round, 
                      evals=watchlist, feval=myFunc)
    # xgb.cv(param, dtrain, num_boost_round, nfold = 5, seed = 0,
    #        feval=myFunc)
Klausos
  • 499
  • 1
  • 6
  • 11
  • I suggest you shap-hypetune to industrialize parameter tuning (and also feature selection) with xgboost and hyperopt (https://github.com/cerlymarco/shap-hypetune) – Marco Cerliani Dec 27 '21 at 14:30

2 Answers2

26

This is how I have trained a xgboost classifier with a 5-fold cross-validation to optimize the F1 score using randomized search for hyperparameter optimization. Note that X and y here should be pandas dataframes.

from scipy import stats
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV, KFold
from sklearn.metrics import f1_score

clf_xgb = XGBClassifier(objective = 'binary:logistic')
param_dist = {'n_estimators': stats.randint(150, 500),
              'learning_rate': stats.uniform(0.01, 0.07),
              'subsample': stats.uniform(0.3, 0.7),
              'max_depth': [3, 4, 5, 6, 7, 8, 9],
              'colsample_bytree': stats.uniform(0.5, 0.45),
              'min_child_weight': [1, 2, 3]
             }
clf = RandomizedSearchCV(clf_xgb, param_distributions = param_dist, n_iter = 25, scoring = 'f1', error_score = 0, verbose = 3, n_jobs = -1)

numFolds = 5
folds = KFold(n_splits = numFolds, shuffle = True)

estimators = []
results = np.zeros(len(X))
score = 0.0
for train_index, test_index in folds.split(X):
    X_train, X_test = X.iloc[train_index,:], X.iloc[test_index,:]
    y_train, y_test = y.iloc[train_index].values.ravel(), y.iloc[test_index].values.ravel()
    clf.fit(X_train, y_train)

    estimators.append(clf.best_estimator_)
    results[test_index] = clf.predict(X_test)
    score += f1_score(y_test, results[test_index])
score /= numFolds

At the end, you get a list of trained classifiers in estimators, a prediction for the entire dataset in results constructed from out-of-fold predictions, and an estimate for the $F_1$ score in score.

Matt Wenham
  • 402
  • 2
  • 11
darXider
  • 481
  • 4
  • 7
  • 3
    How does this code manage num_boost_round and early_stopping_rounds? – mfaieghi Mar 18 '20 at 19:53
  • for whoever reading it, do not use the code above - the logic behind it is wrong – Sergey Leyko Jan 06 '21 at 09:11
  • @SergeyLeyko thanks for your input. care to elaborate why the logic is wrong? – darXider Jan 06 '21 at 18:42
  • @darXider, sure. 1 - you have trained 5 models instead of one, the topic starter Klausos asked about "However, it is not clear how to obtain the model from xgb.cv." - he wanted the single model. So its not clear what model to use for unseen data and with what parameters. 2 - you optimize hyperparameters for each fold - which is already strange. In addition, you are doing cv inside of another cv. – Sergey Leyko Jan 13 '21 at 14:29
  • 1
    @SergeyLeyko Yes, please read my other comments in this thread. This procedure is called "nested cross-validation," and it's a way to remove (or lower) the upward bias in performance estimation from regular cross-validation. All 5 models obtained here are equivalent (so no preference at all), but you can use all 5 to form an ensemble model. The fact that you find this strange or that you haven't heard of this doesn't mean that it's "wrong." I suggest you read up on nested cross-validation. – darXider Jan 13 '21 at 18:45
  • @darXider I've read your comment and about the nested cross-validation. - Still, as a result you have 5 hyper-tuned models: ensembling of the models of the same nature is not a good idea, not talking about using it in prod. - You don't use early stopping while finding hyper parameters, so your 'inner' models are not optimal. - You have to use xgboost cv implementation cause of right logic of doing CV there (it evaluates scores after each boosting round) And, nested CV approach is not proved until tested on the unseen test set. You can always 'go deeper' hoping you are improving smth. – Sergey Leyko Jan 14 '21 at 10:49
4

I don't have enough reputation to make a comment on @darXider's answer. So I add an "answer" to make comments.

Why do you need for train_index, test_index in folds: since clf is already doing cross-validation to pick the best set of hyper-parameter values?

In your code, it looks like you perform CV for each of the five folds (a "nested" CV) to pick the best model for that particular fold. So in the end, you will have five "best" estimators. Most likely, they don't have the same hyper-parameter values.

Correct me if I am wrong.

panc
  • 141
  • 1
  • Yes, by default RandomizedSearchCV uses 3-fold CV to determine the params. It can be changed to any other number of folds if required. – Satwik Bhattamishra Jul 25 '18 at 07:51
  • 1
    This is, as you noticed, is a nested cross-validation scheme, and tou are right that the five "best" models don't have the same hyper-parameters. However, in the end, you get 5 equivalent "best" models (and you can use them in an ensemble, for example) to do your predictions. Moreover, what this scheme accomplishes is that it gives you predictions for the entire dataset (as I mentioned in my answer, by combining the out-of-fold predictions of each model). In addition, it also gives you an estimate for the spread in the score (as opposed to just one value). – darXider Nov 22 '18 at 18:16