2

When using undersampling to compensate for unbalanced data, what should you use for a testing dataset?

AngusE
  • 21
  • 1
  • Don't use undersampling at all. Better to use scoring rules to evaluate *probabilistic* classifications: https://stats.stackexchange.com/a/312787/1352 – Stephan Kolassa Dec 10 '17 at 09:59
  • 1
    So, are you saying there's no reason for anyone to use undersampling with unbalanced data? I'm not disagreeing with you, but now I need to reconcile this advice with my professor's recommendation to use undersampling plus the many tedious academic papers I keep finding about the subject. Unfortunately, none of the papers I have found address my question about training and testing datasets. – AngusE Dec 10 '17 at 17:09
  • Yes, I am saying exactly that. Assessing probabilistic predictions via scoring rules instead of accuracy (or similar KPIs) obviates the entire "problem". Yes, you will need more data to come to solid conclusions if you have rare but important classes, but you will need this whether or not you use accuracy. Yes, I know that even people who should know better argue for accuracy and for oversampling to solve "problems" they shouldn't have to begin with. – Stephan Kolassa Dec 10 '17 at 17:16
  • Thanks Stephen. Again, what you are saying makes sense, but I am surprised that there seem to be many variations on this approach. One example that is easily found is here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3946903/. It's somewhat concerning that this technique is often used in medical research. Maybe someone should reach out to these individuals to let them know their approach could be flawed. – AngusE Dec 10 '17 at 17:27
  • 1
    I agree. Statisticians have been trying to do this for a long time (e.g., Frank Harrell's blog posts). Unfortunately, simple KPIs like accuracy are deceptively easy to "understand", and bad ways of dealing with problems propagate faster than the original misunderstandings can be rooted out. I like to point out that the miasma theory of disease (https://en.wikipedia.org/wiki/Miasma_theory) is also easy to "understand" and dominated medical discourse for centuries, but that didn't make it correct. – Stephan Kolassa Dec 13 '17 at 08:01

0 Answers0