I am trying to classify the data set "Insurance Company Benchmark (COIL 2000) Data Set" which can be found in Dataset.
I am using XGBoost in R (I am new to XGBoost algorithm) for the classification and the code that I have come up with is as follows-
D <- read.csv("ticdata2000.csv", header=T)
# dim(D) # O/P- 5823 86
# Make training and testing splits-
train_indices <- sample(1:nrow(D), floor(0.7 * nrow(D)), replace = F)
training <- D[train_indices, ]
testing <- D[-train_indices, ]
library(xgboost)
# Train/Fit model (classifier)-
model_classifier <- xgboost(data = as.matrix(training[-86]), label = training$C86, nrounds=100, eta = 0.1, gamma = 1)
# Make predictions using trained model-
preds <- predict(model_classifier, as.matrix(testing[-86]))
# Convert floating-point values to either 0 or 1 according to 'C86' column-
# for (i in 1:length(preds))
# {
# preds[i] <- ifelse(preds[i] < 0.1, 0, 1)
# }
length(preds) # O/P- 1747
length(unique(preds)) # O/P- 408
For 'XGBoost', the results are floats and they need to be converted to categorical values (for classification) at whichever threshold is appropriate for the model. How do I decide the threshold appropriate for my model? The final prediction has to be either 0 or 1.
The minimum and maximum values in "preds" variable is as follows-
min(preds) # O/P- 0.03360531 max(preds) # O/P- 0.3086071
Ofcourse, these values are bound to change as I have not used a seed value.
Any help is appreciated!
Thanks