xgboost logistic regression predictions are returning values >1 and < 0

Question

I'm currently using xgboost to try and fit a logistic model with a binary outcome on a set of training data, but when I use the model that I get from this training data on a new set of classified test data, the predictions I'm getting back give me probabilities that are greater than 1 and less than 0. I've noticed that predictions on the training data are also prone to this problem, but on the test data I'm getting probabilities > 1.2 and < -0.1, whereas the training data's predicted values that exceed 1 and 0 are something like 1.0001 and -0.001.

What would cause predicting with the xgboost model to do this? Here's the parameters I'm passing, the line for building the model, and the line for prediction that I'm using:

# Select xgboost parameters
parameters = list(eta = 0.005,
          max_depth = 15,
          subsample = 0.5,
          colsample_bytree = 0.5,
          seed = 1,
          eval_metric = "auc",
          objective = "binary:logistic")

# Fit a model
xgb_reduced = xgboost(data = xgbtrain_reduced, parameters,
                  nround = which.max(cvreduced$evaluation_log$test_auc_mean))

# Load a file without also loading the name of the R object that the file is storing
load.to.prompt = function(file_str){
    tmp = load(file_str)
    return(get(tmp))
}

# Do prediction
classifier = load.to.prompt(model.file)
bst = xgb.load(classifier$rawbst)
preds = data.frame(issue_id = issue_ids, td_prob = round(predict(bst, newdata = xdat), 3))

Here are some example outputs:

id      prob
295     1.257
240113  1.256
576589  1.254
509199  1.245
367088  1.24
479162  1.24
462561  1.225
367201  1.223
227231  1.191
433179  1.186
...
11789   -0.073
37729   -0.073
58831   -0.073
99241   -0.073
422522  -0.073
419522  -0.082
420048  -0.082
526770  -0.082
87704   -0.089
409858  -0.089

This is almost always because you are looking at log-odds, not probabilities. — Matthew Drury, Jun 23 '17 at 16:34
Is there anything in particular I should change about my code that would actually give me probabilities in the prediction? I tried a manual adjustment (e^value/(1+e^value)) for the values above but that doesn't seem right either since then my probabilities roughly go from 0.5 to 0.7 — Alex Zhao, Jun 23 '17 at 20:21
There may be a problem in the way that params is being passed as a positional argument instead of a named one: https://stats.stackexchange.com/questions/290538/bugs-in-xgboost-logistic-regression — zkurtz, Jul 09 '17 at 00:11
@MatthewDrury: Thanks. But I dont know why the tree need to return that log odd. Any ideas about that ? Is it related to the optimization for regression ? Thanks — Catbuilts, Apr 02 '19 at 04:38

score 2 · Accepted Answer · answered Jul 10 '17 at 20:27

This was an issue with the code, specifically the variable "params" wasn't being passed correctly to xgboost. The wrong code (from above, with asterisks around the wrong area):

parameters = list(eta = 0.005,
      max_depth = 15,
      subsample = 0.5,
      colsample_bytree = 0.5,
      seed = 1,
      eval_metric = "auc",
      objective = "binary:logistic")

# Fit a model
xgb_reduced = xgboost(data = xgbtrain_reduced, **parameters**,
nround=which.max(cvreduced$evaluation_log$test_auc_mean))

The correct code (replacing params with params = parameters):

parameters = list(eta = 0.005,
      max_depth = 15,
      subsample = 0.5,
      colsample_bytree = 0.5,
      seed = 1,
      eval_metric = "auc",
      objective = "binary:logistic")

# Fit a model
xgb_reduced = xgboost(data = xgbtrain_reduced, params = parameters,
nround=which.max(cvreduced$evaluation_log$test_auc_mean))

score 1 · Answer 2 · answered Jul 02 '17 at 03:25

1

The parameter outputmargin must be set to False in the predict function for it return the probabilities, otherwise the output value will be the log-odds.

To convert log-odds to probabilities you can calculate $p(x) = e^ {x} / (1 + e^{x})$, then the values should be in the interval $[0,1]$.

answered Jul 02 '17 at 03:25

cdutra

294
2
11

I've rerun it with that option and the output hasn't changed. It also looks like the predict function when used with xgboost models has that on by default. – Alex Zhao Jul 03 '17 at 14:12

xgboost logistic regression predictions are returning values >1 and < 0

2 Answers2

Linked