xgboost prediction threshold

Question

I am trying to classify the data set "Insurance Company Benchmark (COIL 2000) Data Set" which can be found in Dataset.

I am using XGBoost in R (I am new to XGBoost algorithm) for the classification and the code that I have come up with is as follows-

D <- read.csv("ticdata2000.csv", header=T)

# dim(D)    # O/P- 5823 86

# Make training and testing splits-
train_indices <- sample(1:nrow(D), floor(0.7 * nrow(D)), replace = F)

training <- D[train_indices, ]
testing <- D[-train_indices, ]


library(xgboost)

# Train/Fit model (classifier)-
model_classifier <- xgboost(data = as.matrix(training[-86]), label = training$C86, nrounds=100, eta = 0.1, gamma = 1)


# Make predictions using trained model-
preds <- predict(model_classifier, as.matrix(testing[-86]))


# Convert floating-point values to either 0 or 1 according to 'C86' column-
# for (i in 1:length(preds))
# {
#   preds[i] <- ifelse(preds[i] < 0.1, 0, 1)
# }


length(preds)       # O/P- 1747
length(unique(preds))   # O/P- 408

For 'XGBoost', the results are floats and they need to be converted to categorical values (for classification) at whichever threshold is appropriate for the model. How do I decide the threshold appropriate for my model? The final prediction has to be either 0 or 1.

The minimum and maximum values in "preds" variable is as follows-

min(preds)    # O/P- 0.03360531 max(preds)    # O/P- 0.3086071

Ofcourse, these values are bound to change as I have not used a seed value.

Any help is appreciated!

Thanks

Matthew Drury · Answer 1 · 2018-08-02T18:06:49.257

4

You need to define what your goal is in making these classifications.

In the dataset, you are trying to predict whether a customer is going to purchase an insurance policy. Ostensibly, the model's predictions are going to be used to intervene in the sales process in some way. The correct what to set a threshold is to answer the following questions (and possibly more in the same vein):

How much will the intervention in the sales process cost?
How often will the intervention result in a successful purchase?
How often will the intervention fail to result in a successful purchase?
How much profit will the company make if the customer buys the insurance policy?

Given this type of information, you can calculate the profit to the company given each possible threshold. I.e. if the threshold is 0.1 we will intervene with this set of customers, this many of them will purchase the product and this many wont, so we will get this much money. Now set the threshold to maximize the profit.

edited Aug 02 '18 at 18:06

answered Aug 02 '18 at 18:00

Matthew Drury

33,314
2
101
132

Matthew Drury, nice insight. I am however, just trying to build a "xgboost" model as a newbie and don't have answers to the above mentioned questions. Do you still have some pointers – Arun Aug 02 '18 at 18:23
5

Deciding the threshold needs to take the costs of wrong decisions into account. It is therefore not a *statistical or ML* aspect, but a *decision theoretic* aspect. Without knowing the costs, you can't decide on the optimum threshold. Talk to whoever will use your model to make decisions. If you are just doing self-study, either read up on the cost structure in your application, or declare the project done when you have probabilistic predictions, and move on to the next project. – Stephan Kolassa Aug 02 '18 at 19:07
2

I agree with Stephan here, you're outside the domain of pure stats and ML and into decision theory. If you're really just playing around with xgboost as a newbie, make something up! Treat it like a real problem, and imagine a scenario that would use your model. – Matthew Drury Aug 02 '18 at 19:33
2

Agree with the others. Take the model's estimated probabilities as is and let business considerations dictate how they're used. It may be as simple as "we've budgeted x dollars for this effort, spend the whole budget on the most promising prospects." – dsaxton Aug 02 '18 at 23:01

xgboost prediction threshold

1 Answers1

Linked