3

I am building a classifier for imbalanced data (~2%). I am using LightGBM for the time being but I guess the question could apply to all binary classifiers. The returned probabilities do not extend in the entire [0,1] range but they only reach around 0.6. Does that mean that there is something wrong with the classifier? Or the data? Or the learning process in general (i.e. the classifier is not learning)?

  • 2
    You might want to look up on [calibration](https://scikit-learn.org/stable/modules/calibration.html) and its related discussions (e.g. [this Q&A](https://stats.stackexchange.com/questions/372327/probability-calibration-from-lightgbm-model-with-class-imbalance)). – B.Liu Jul 15 '21 at 12:59

1 Answers1

9

No, a restricted range of predicted probabilities does not necessarily indicate any problems with the data or the classifier. It could just be that the data has too much natural variability to predict the event with such certainty, or equivalently that the covariates included do not adequately explain enough of the variation. That also doesn't imply the returned probabilities are inaccurate.

In fact, for many problems you'd be doing fairly well if the model is able to reliably provide such strong predictions for rare events. Predicting a patient has a 60% chance of having cancer would be very useful, for example.

David Luke Thiessen
  • 1,232
  • 2
  • 15