0

I have an imbalance dataset (60% class 1, 40% class 0). I trained a model and got accuracy, f1, ROC-AUC and PR-AUC. I want to compare them to chance-level performance. obviously chance-level of acc if 60%. How to calculate chance-level of f1, ROC-AUC, PR-AUC? Thanks

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
okuoub
  • 27
  • 8
  • What do you mean by "chance-level"? If we randomly assign labels to instances (e.g., based on the original prevalences), we will end up with accuracy far lower than 60% (specifically, about 52%). We achieve 60% by assigning all instances to the majority class. But I would not call that "chance-level", so I'm unsure what you have in mind for the other metrics. Do you mean the maximum that is achievable with a trivial model? – Stephan Kolassa Nov 10 '21 at 12:29
  • Also: [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) and [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) - all the criticisms there apply equally to F1. (AUROC is a semi-proper scoring rule, so slightly less bad.) – Stephan Kolassa Nov 10 '21 at 12:29
  • @StephanKolassa You are correct, and I am not really interested in accuracy and that's the reason I asked the question. I want to understand in the AUC of "dummy" model should always be 0.5 and if so - why? even for imbalanced dataset – okuoub Nov 10 '21 at 12:32
  • 1
    None of the accuracy measures listed are very good ones. See https://www.fharrell.com/post/class-damage/ – Frank Harrell Nov 10 '21 at 13:13
  • @FrankHarrell Thanks, very nice, I will add them also. However, as I compare it to existing benchmark, I must also provide PR-AUC and ROC-AUC. Can you please specify how I check chance-level performance for them? (Also, how to do it for brier score?) – okuoub Nov 10 '21 at 13:24
  • Most statisticians don't compute chance-level performance but rather use resampling to estimate the overfitting-corrected (bias-corrected) performance. But if you want chance level you can just run this a few times and average: randomly permute Y to associate the wrong responses with the features; repeat all analysis steps and compute performance measures of interest. P.S. Not fruitful to talk about balance; just use good measures. – Frank Harrell Nov 10 '21 at 14:11
  • @FrankHarrell OK so lets say I compute brier score and get a certain results, how can I get an idea how good this number actually is? How much better it is from dummy classifier and how significant it is? – okuoub Nov 10 '21 at 14:41
  • Brier score is excellent but as you allude to it's not easy to judge; we know that smaller is better but how small is good enough? By randomly permuting the outcome variable Y you can see what the base level is. You can also simulate data to get a range of Brier scores. – Frank Harrell Nov 10 '21 at 14:44
  • @FrankHarrell OK, so regardless of the measure, the best way to assess performance is to permute n time, get a sorted vector of randomized performances, get the rank of my actual model, and assume it is the p value? – okuoub Nov 10 '21 at 14:46
  • 2
    That's close to a p-value but is useful. But keep it a bit more simple in addition to looking at the whole distribution over restarts to the permutation as you are doing. Show the mean Brier score over all the restarted permutations and compare it to the observed Brier score. You could call the difference in those two the chance-corrected Brier score. Bootstrap or 100 repeats of 10-fold cross-validation to get an overfitting-corrected Brier score is slightly better than doing that. – Frank Harrell Nov 10 '21 at 16:28

0 Answers0