1

Have data with "valid" and "invalid" classes, lots of predictors, over 15. Only 5% of data set is valid (success class 1), 95% is invalid class 0. The number of invalids is skewing my model, it can classify accurately the invalids but it's bad at classifying valids.

Oversampled with valids to get a logistic model that doesn't have too many false negatives, got too many false positives. Changed probability cutoff to 0.65. This lowered false negatives, got too many false negatives now.

Found that adjusting the probability cutoff to 0.65, is the pivot point for too many false negatives vs too many false positives. Does it make sense go with different probability cutoffs for the same model? Use model with prob cutoff 0.5 for accurately classifying 1's and use 0.65 for accurately classifying 0's. Does this make sense? Any other ideas to classify better? I tried classification using other types and same issue.

Trimmed many predictors to a few using p-values and best subsets.

One clarification point, i've trained portions of dataset then validated on full dataset to get accuracy metrics for false positive/negatives.

  • How would you classify something with a probability of $0.575$? – Dave Mar 27 '21 at 03:19
  • hi dave, i don't understand your question. where did 0.575 come from? – logisticnightmare Mar 27 '21 at 04:49
  • It’s halfway between $0.50$ and $0.65$. – Dave Mar 27 '21 at 04:58
  • oh understood. 0.575 leads to high number of false negatives positives. 0.65 is where the switch (tipping point) for big change happen according to the accuracy – logisticnightmare Mar 27 '21 at 05:00
  • So what do you do with an observation that gives you a probability of $0.575$? – Dave Mar 27 '21 at 05:02
  • 1
    Also, do you know the arguments against accuracy as a scoring rule? Here are some of my favorites, especially ham vs spam email. https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email https://twitter.com/f2harrell/status/1062424969366462473?lang=en – Dave Mar 27 '21 at 05:04
  • for my probability cutoff, if logistic result < 0.5, i call it 0. if i adjust it to 0.575 or 0.65 or x, i classify anything below that point as 0 and anything above as 1 – logisticnightmare Mar 27 '21 at 05:05
  • thank you, will read those. i'm novice still, learning – logisticnightmare Mar 27 '21 at 05:05

0 Answers0