0

I am working on an imbalanced binary classification and wondering in what step I should find the best optimal threshold cut-off point. When I tried classifying the dataset with the normal probability threshold, I couldn't see any true positives(failures) but when I lowered the probability threshold, the model started showing me some true positives.

What I am really confused about finding the optimal threshold cut-off is in what step I should find the optimal threshold cut-off point. The following questions I have in mind:

  • Do I need to find the optimal threshold cut-off point before the model development?

  • Do I use cross-validation for it?

  • Do I need separated procedures for finding the optimal threshold cut-off point from model developments? So I run two procedures, one for the optimal probability threshold and the other for the model development.

I hope to hear some advice or answers.

StoryMay
  • 2,273
  • 9
  • 25
  • Do not use [classification thresholds](https://stats.stackexchange.com/q/312119/1352) unless you understand the trade-offs of your *decision*. Unbalanced classes are almost certainly not a problem: [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) – Stephan Kolassa Sep 18 '19 at 08:03
  • I don't think I am asking the same question. My question is not about how to find the optimal threshold cut-off point or what the optimal threshold cut-off point is. My question is about "in what step I should perform throughout the model development and how it should done". – StoryMay Sep 18 '19 at 08:09
  • 1
    And my argument is that you should not do so at all. [See also this earlier answer of mine.](https://stats.stackexchange.com/a/312787/1352) Thresholds and related topics (like accuracy) have been discussed many times at CV, so it's hard to find the "best" duplicate. [See also this.](https://stats.stackexchange.com/a/359936/1352) – Stephan Kolassa Sep 18 '19 at 08:16
  • Thank you for providing references but they don't appear to be directly answering my questions. Maybe I haven't looked at them enough, yet. – StoryMay Sep 18 '19 at 08:21
  • 1
    They are a bit long, and they are a frame challenge. You are asking how to set your threshold. I am arguing that [the entire premise of the question is wrong](https://stats.meta.stackexchange.com/a/5004/1352) - you should not be setting thresholds in the first place. Please do take the time to digest those arguments. – Stephan Kolassa Sep 18 '19 at 08:24
  • I am not asking how to set the threshold. I am asking "in what step through the model development should I undertake to find the optimal threshold cut-off point for imbalanced binary classification dataset?" – StoryMay Sep 18 '19 at 08:24
  • 1
    @ChangheeKang What Stephan is trying to tell you, is that the whole methodology is flawed, including "at which step should I find the optimal cut off". If you read through the links he posted you will find that metrics, which require a cut off to make a discrete prediction, are not the best. In any case, if you are fixed on this path, you should either use CV or a test set to find the "best" threshold. – user2974951 Sep 18 '19 at 09:08
  • Yeap, I am going through the readings @Stephan suggested. Thank you for the short note but it still helps me. – StoryMay Sep 18 '19 at 09:21

0 Answers0