Why is it that xgb.cv performs well but xgb.train does not

Question

I am trying to control overfitting using xgboost in R using eta but when I compare the overfitting of my xgb.cv readout to the xgb.train readout, I don't know why xgb.cv doesn't seem to overfit and xgb.train does. How can I get the same nice downward progression of mlogloss in xgb.train? I have balanced my classes prior to running the model.

[1] "###########          i is 1 and j 1            ##################"
[1] "Creating cv..."
# this part is good -------------------
[0] train-mlogloss:1.609325+0.000006    test-mlogloss:1.609315+0.000009
[100]   train-mlogloss:1.601508+0.001238    test-mlogloss:1.602480+0.001071
[200]   train-mlogloss:1.594359+0.002151    test-mlogloss:1.596278+0.001812
[300]   train-mlogloss:1.587120+0.002100    test-mlogloss:1.589944+0.001546
[400]   train-mlogloss:1.580558+0.001839    test-mlogloss:1.584062+0.001251
[1] "Took 160 seconds to cv train with 500 rounds..."

[1] "Creating model..."
# this part is bad -------------------
[0] train-mlogloss:1.609341 test-mlogloss:1.609383
[100]   train-mlogloss:1.602439 test-mlogloss:1.609435
[200]   train-mlogloss:1.594991 test-mlogloss:1.609580
[300]   train-mlogloss:1.587814 test-mlogloss:1.609732

My code for cv and train and my parameters is:

param = list("objective" = "multi:softprob"
             , "eval_metric" = "mlogloss"
             , 'num_class' = 5
             , 'eta' = 0.001)

bst.cv = xgb.cv(param = param
                , data = ce.dmatrix
                , nrounds  = nrounds
                , nfold = 4
                , stratified = T
                , print.every.n = 100
                , watchlist = watchlist
                , early.stop.round = 10
)
bst = xgb.train(param = param
                , data = ce.dmatrix
                , nrounds  = nrounds
                , print.every.n = 100
                , watchlist = watchlist
                # , early.stop.round = 10
)

Possible duplicate of [Cross-Validation in plain english?](http://stats.stackexchange.com/questions/1826/cross-validation-in-plain-english) — Sycorax, Apr 10 '16 at 22:12

score 3 · Answer 1 · answered Mar 11 '16 at 05:13

I just lost a couple of days on perhaps the same issue. TL;DR: are you sure your watchlist has the same number and order of columns as your ce.dmatrix?

In the current implementation of xgb.cv, any watchlist argument that gets passed in is going to get ignored. xgb.cv ends up calling xgb.cv.mknfold, which forcibly sets the watchlist for each of the folds as below:

for (k in 1:nfold) {
    dtest <- slice(dall, folds[[k]])
    didx <- c()
    for (i in 1:nfold) {
      if (i != k) {
        didx <- append(didx, folds[[i]])
      }
    }
    dtrain <- slice(dall, didx)
    bst <- xgb.Booster(param, list(dtrain, dtest))
    watchlist <- list(train=dtrain, test=dtest)
    ret[[k]] <- list(dtrain=dtrain, booster=bst, watchlist=watchlist, index=folds[[k]])
  }

This makes sense, since as others have said passing in a watchlist to xgb.cv doesn't make a ton of sense. So the "test" showing in your cv output is not the same data set as the "test" showing in your xgb.train output

xgb.train calls xgb.iter.eval in order to evaluate the test statistics of the in-sample and watchlist data. xgb.iter.eval does the actual computation like this:

 msg <- paste("[", iter, "]", sep="")
      for (j in 1:length(watchlist)) {
        w <- watchlist[j]
        if (length(names(w)) == 0) {
          stop("xgb.eval: name tag must be presented for every elements in watchlist")
        }
        preds <- predict(booster, w[[1]])
        ret <- feval(preds, w[[1]])
        msg <- paste(msg, "\t", names(w), "-", ret$metric, ":", ret$value, sep="")
      }

So it calls predict() using the booster handle. Since this is the same booster handle class that gets returned from a call to xgb.train, this is equivalent to you calling predict() with your finished model.

Somewhere in the bowels of the C++ implementation of Booster, it appears that predict() does not verify that the column names of the data you pass in match the column names of the data your model was built off of. It doesn't even check that there are the correct number of columns. You can see this easily for yourself by examining the output of the following calls:

head(predict(bst, newdata=ce.dmatrix))
#predict using only the first 10 columns, missing values default to 0
head(predict(bst, newdata=ce.dmatrix[,1:10]))
#predict using the wrong columns, because we ignore column names
head(predict(bst, newdata=ce.dmatrix[,sample(ncol(ce.dmatrix))]))

So if your watchlist "test" set is defined incorrectly, you will see exactly the kind of odd behavior you are seeing. You can check if they are identical by doing something along the lines of this:

colnames(ce.dmatrix)[!(colnames(ce.dmatrix) %in% colnames(watchlist[[1]]))]
colnames(watchlist[[1]])[!(colnames(watchlist[[1]]) %in% colnames(ce.dmatrix))]

In my case, I was cleaning my test and training data separately, and because some factor levels showed up in training but not in test, my test data had the wrong number of columns/columns in incorrect places.

Hope that helps.

score 1 · Answer 2 · answered Mar 10 '16 at 11:15

1

Documentation is bit nebulous to me, but whole point of cross validation is to select the best hyper parameters to avoid overfitting. So xgb.cv is using cross validation to tune the parameters before testing, therefore avoiding overfitting.

answered Mar 10 '16 at 11:15

rep_ho

6,036
1
22
44

`xgb.cv` actually doesn't do any hyperparameter tuning. It performs cross validation to estimate model error. – Arthur Sep 26 '21 at 03:43

score 1 · Answer 3 · answered Mar 10 '16 at 19:09

Why would you need a watchlist in the CV method? The respective CV folds ARE the watchlist! I don't know the R command, but in Python verbose_eval=True returns the proper output that you are looking for. My guess is that since CV is only used for hyperparameter tuning and doesn't return a model by itself, the usage of the parameter watchlist somehow interferes w/ the proper triggering of early.stop.round.

P.S.: Your eta parameter is very low. I've never used an eta value lower than 0.01...

Why is it that xgb.cv performs well but xgb.train does not

3 Answers3