Issue with training on classification metrics other than accuracy (using R and caret)

Question

I have a binary classification problem with two classes 0 and 1. For training an XGBoost classification model, I apply a balanced data set (50% 0's, 50% 1's).

In reality, 1's are much more abundant than 0's. After applying my newly generated model on some realistically distributed test data, I see very solid recall numbers but poor precision for the less abundant 0-class.

To mitigate this effect, I tried to apply other optimization metrics. In particular, I was very much interested in optimizing precision, F1 or ROC.

For ROC I used the following code:

# load in training data (50/50)
training <- readRDS("Train_Data.rds")

# implement 3fold cv, define twoClassSummary for ROC metric
fitControl <- trainControl(method = "repeatedcv", number=3, repeats=1, classProbs=TRUE, savePredictions=TRUE, summaryFunction = twoClassSummary)

# Set-up grid search for hyperparameter tuning
tune_grid <- expand.grid(
  nrounds = seq(from = 200, to = 1000, by = 100),
  eta = c(0.025, 0.05, 0.1, 0.3),
  max_depth = c(2, 3, 4, 5, 6, 10),
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1
)

# create model
xGFit1 <- train(target~., data=training, method="xgbTree", tuneGrid = tune_grid, trControl=fitControl, metric = "ROC")

When using F1 or precision, I replaced twoClassSummary with prSummary and changed the metric in the train()-function to either "F" or "Precision".

Unfortunately, - while the absolute number vary a little bit in my confusion matrix, recall and precision values remain unchanged (when rounding to XX% without decimals).

Did I do something wrong?

When optimizing for ROC, F1 or precision is it better to use a realistically distributed training set to have an effect on the test data?

I believe that [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) addresses your question, because every problem with accuracy that thread examines also holds for F1, precision and most other KPIs. (ROC is a slightly different topic.) I would in particular say that [my advice there](https://stats.stackexchange.com/a/312787/1352) would be helpful. On balancing, see [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) — Stephan Kolassa, Apr 30 '20 at 14:16
alright this is a very good reasoning why my metrics do not deliver different results; as a short follow-up: do you have a go-to way to address a classification problem in which you try to maximize precision (as well as recall) for imbalanced data leaving aside any manipulation of the threshold value? — Arne, May 05 '20 at 06:24
To be honest, I have never thought about this, because it seems like optimizing the wrong KPI. We should aim for calibrated probabilistic predictions, not for high precision. I like analogies: maximizing precision instead of improving probabilistic predictions seems to me like keeping a patient's temperature at *precisely* 37°C, regardless of his overall health - even if we have to heat the corpse to do so. — Stephan Kolassa, May 05 '20 at 17:20
sorry for yet another follow-up: are you an R-user? Does not R optimize for such simple KPIs that you do not see as fit for model evaluation? In case you are using R / caret, how do you define your train-function - what is your evaluation metric? — Arne, May 20 '20 at 06:09
Yes, I use R. I do little classification, but I am not aware of any base functions that optimize for F1 score or accuracy - classifiers usually optimize the likelihood, which is a form of the log score, which *is* a proper scoring rule (and which I would also use in training). User-contributed packages can, of course, do something *cough* different. — Stephan Kolassa, May 20 '20 at 07:50

Issue with training on classification metrics other than accuracy (using R and caret)

0 Answers0