I'm currently using xgboost to try and fit a logistic model with a binary outcome on a set of training data, but when I use the model that I get from this training data on a new set of classified test data, the predictions I'm getting back give me probabilities that are greater than 1 and less than 0. I've noticed that predictions on the training data are also prone to this problem, but on the test data I'm getting probabilities > 1.2 and < -0.1, whereas the training data's predicted values that exceed 1 and 0 are something like 1.0001 and -0.001.
What would cause predicting with the xgboost model to do this? Here's the parameters I'm passing, the line for building the model, and the line for prediction that I'm using:
# Select xgboost parameters
parameters = list(eta = 0.005,
max_depth = 15,
subsample = 0.5,
colsample_bytree = 0.5,
seed = 1,
eval_metric = "auc",
objective = "binary:logistic")
# Fit a model
xgb_reduced = xgboost(data = xgbtrain_reduced, parameters,
nround = which.max(cvreduced$evaluation_log$test_auc_mean))
# Load a file without also loading the name of the R object that the file is storing
load.to.prompt = function(file_str){
tmp = load(file_str)
return(get(tmp))
}
# Do prediction
classifier = load.to.prompt(model.file)
bst = xgb.load(classifier$rawbst)
preds = data.frame(issue_id = issue_ids, td_prob = round(predict(bst, newdata = xdat), 3))
Here are some example outputs:
id prob
295 1.257
240113 1.256
576589 1.254
509199 1.245
367088 1.24
479162 1.24
462561 1.225
367201 1.223
227231 1.191
433179 1.186
...
11789 -0.073
37729 -0.073
58831 -0.073
99241 -0.073
422522 -0.073
419522 -0.082
420048 -0.082
526770 -0.082
87704 -0.089
409858 -0.089