I have a binary classification problem with two classes 0 and 1. For training an XGBoost classification model, I apply a balanced data set (50% 0's, 50% 1's).
In reality, 1's are much more abundant than 0's. After applying my newly generated model on some realistically distributed test data, I see very solid recall numbers but poor precision for the less abundant 0-class.
To mitigate this effect, I tried to apply other optimization metrics. In particular, I was very much interested in optimizing precision, F1 or ROC.
For ROC I used the following code:
# load in training data (50/50)
training <- readRDS("Train_Data.rds")
# implement 3fold cv, define twoClassSummary for ROC metric
fitControl <- trainControl(method = "repeatedcv", number=3, repeats=1, classProbs=TRUE, savePredictions=TRUE, summaryFunction = twoClassSummary)
# Set-up grid search for hyperparameter tuning
tune_grid <- expand.grid(
nrounds = seq(from = 200, to = 1000, by = 100),
eta = c(0.025, 0.05, 0.1, 0.3),
max_depth = c(2, 3, 4, 5, 6, 10),
gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1
)
# create model
xGFit1 <- train(target~., data=training, method="xgbTree", tuneGrid = tune_grid, trControl=fitControl, metric = "ROC")
When using F1 or precision, I replaced twoClassSummary with prSummary and changed the metric in the train()-function to either "F" or "Precision".
Unfortunately, - while the absolute number vary a little bit in my confusion matrix, recall and precision values remain unchanged (when rounding to XX% without decimals).
Did I do something wrong?
When optimizing for ROC, F1 or precision is it better to use a realistically distributed training set to have an effect on the test data?