0

Im running an xgboost model to try and find important predictors for a disease from a list of almost 1000 covariates. The prevalence of the disease in my cohort is about 10%.

Given the imbalance data, would the precision-recall AUC or the logloss be a more appropriate matrix to assess the model fit? Is it appropriate to use logloss when classes are not balanced?

Also, playing with hyperparameters tuning, it seems like adding scale_pos_weight is benificial, but should i avoid doing this if i use logloss?

Thank you!

dean
  • 107
  • 5

1 Answers1

3

Yes, the log loss is appropriate. It's a proper scoring rule (see the tag wiki for more info). It can indeed be used with "unbalanced" data. Precision and recall are improper, so don't use them.

More information here (admittedly boilerplate): Are unbalanced datasets problematic, and (how) does oversampling (purport to) help? Do not use accuracy to evaluate a classifier: Why is accuracy not the best measure for assessing classification models? Is accuracy an improper scoring rule in a binary classification setting? Classification probability threshold The same problems apply to sensitivity and specificity, and indeed to all evaluation metrics that rely on hard classifications. Instead, use probabilistic classifications, and evaluate these using proper scoring rules.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • Thank you Stephan for the answer! So just as a quick follow-up: my understanding that with unbalanced data you should indeed not use accuracy, partly because with severely unbalanced data the model can be very accurate even if it always predicts the negative class. But I thought that the same should not hold when using the precision-recall AUC, which specifically focuses on the positive class only, and is also insensitive to the specific probability threshold for classification. Is that not accurate? – dean Jun 29 '21 at 14:40
  • And just to confirm, despite the class imbalance, you suggested that the logloss is an appropriate scoring rule (so, in other words, logloss is OK to use when you have unbalanced data?) – dean Jun 29 '21 at 14:41
  • 1
    AUC is somewhat less bad than just plain accuracy, [take a look at this thread](https://stats.stackexchange.com/q/339919/1352). And yes, log loss can be used in any kind of situation. Predicting rare classes is hard, and it's even more important to use proper scoring rules in this setup. – Stephan Kolassa Jun 29 '21 at 15:29
  • Thanks for all of this! So final question: I read that over/under sampling is actually not needed in most situations (https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he). Yet, when tuning the hyperparameters through CV, it seems like adding weights to make the gradient of a positive sample more influential improves the prediction (https://stats.stackexchange.com/questions/243207/what-is-the-proper-usage-of-scale-pos-weight-in-xgboost-for-imbalanced-datasets). Should this not be done however? – dean Jun 29 '21 at 17:41
  • "Improves the prediction" in what sense? I only skimmed the thread, and I don't see what KPI is used. I have a hard time imagining this improves the probabilistic predictions as measured by proper scoring rules. Conversely, I could easily imagine it "improves" accuracy, by biasing predictions towards the majority class. But that is not an argument, since [accuracy is not a good evaluation measure](https://stats.stackexchange.com/q/312780/1352). There is a lot of misinformation floating around about how to evaluate classifiers. – Stephan Kolassa Jun 30 '21 at 06:31