3

I'm currently using xgboost to try and fit a logistic model with a binary outcome on a set of training data, but when I use the model that I get from this training data on a new set of classified test data, the predictions I'm getting back give me probabilities that are greater than 1 and less than 0. I've noticed that predictions on the training data are also prone to this problem, but on the test data I'm getting probabilities > 1.2 and < -0.1, whereas the training data's predicted values that exceed 1 and 0 are something like 1.0001 and -0.001.

What would cause predicting with the xgboost model to do this? Here's the parameters I'm passing, the line for building the model, and the line for prediction that I'm using:

# Select xgboost parameters
parameters = list(eta = 0.005,
          max_depth = 15,
          subsample = 0.5,
          colsample_bytree = 0.5,
          seed = 1,
          eval_metric = "auc",
          objective = "binary:logistic")

# Fit a model
xgb_reduced = xgboost(data = xgbtrain_reduced, parameters,
                  nround = which.max(cvreduced$evaluation_log$test_auc_mean))

# Load a file without also loading the name of the R object that the file is storing
load.to.prompt = function(file_str){
    tmp = load(file_str)
    return(get(tmp))
}

# Do prediction
classifier = load.to.prompt(model.file)
bst = xgb.load(classifier$rawbst)
preds = data.frame(issue_id = issue_ids, td_prob = round(predict(bst, newdata = xdat), 3))

Here are some example outputs:

id      prob
295     1.257
240113  1.256
576589  1.254
509199  1.245
367088  1.24
479162  1.24
462561  1.225
367201  1.223
227231  1.191
433179  1.186
...
11789   -0.073
37729   -0.073
58831   -0.073
99241   -0.073
422522  -0.073
419522  -0.082
420048  -0.082
526770  -0.082
87704   -0.089
409858  -0.089
Alex Zhao
  • 51
  • 1
  • 5
  • 2
    This is almost always because you are looking at log-odds, not probabilities. – Matthew Drury Jun 23 '17 at 16:34
  • Is there anything in particular I should change about my code that would actually give me probabilities in the prediction? I tried a manual adjustment (e^value/(1+e^value)) for the values above but that doesn't seem right either since then my probabilities roughly go from 0.5 to 0.7 – Alex Zhao Jun 23 '17 at 20:21
  • There may be a problem in the way that params is being passed as a positional argument instead of a named one: https://stats.stackexchange.com/questions/290538/bugs-in-xgboost-logistic-regression – zkurtz Jul 09 '17 at 00:11
  • @MatthewDrury: Thanks. But I dont know why the tree need to return that log odd. Any ideas about that ? Is it related to the optimization for regression ? Thanks – Catbuilts Apr 02 '19 at 04:38

2 Answers2

2

This was an issue with the code, specifically the variable "params" wasn't being passed correctly to xgboost. The wrong code (from above, with asterisks around the wrong area):

parameters = list(eta = 0.005,
      max_depth = 15,
      subsample = 0.5,
      colsample_bytree = 0.5,
      seed = 1,
      eval_metric = "auc",
      objective = "binary:logistic")

# Fit a model
xgb_reduced = xgboost(data = xgbtrain_reduced, **parameters**,
nround=which.max(cvreduced$evaluation_log$test_auc_mean))

The correct code (replacing params with params = parameters):

parameters = list(eta = 0.005,
      max_depth = 15,
      subsample = 0.5,
      colsample_bytree = 0.5,
      seed = 1,
      eval_metric = "auc",
      objective = "binary:logistic")

# Fit a model
xgb_reduced = xgboost(data = xgbtrain_reduced, params = parameters,
nround=which.max(cvreduced$evaluation_log$test_auc_mean))
Alex Zhao
  • 51
  • 1
  • 5
1

The parameter outputmargin must be set to False in the predict function for it return the probabilities, otherwise the output value will be the log-odds.

To convert log-odds to probabilities you can calculate $p(x) = e^ {x} / (1 + e^{x})$, then the values should be in the interval $[0,1]$.

cdutra
  • 294
  • 2
  • 11
  • I've rerun it with that option and the output hasn't changed. It also looks like the predict function when used with xgboost models has that on by default. – Alex Zhao Jul 03 '17 at 14:12