I have a model that predicts a rare disease (1% prevalence) with good discrimination (AUC). However, the predictions it is giving can not be interpreted as probabilities.
I want to recalibrate the predictions according to the prevalence in the population - and I do not need high resolution, I need 5 bins (levels) of probability - from "very low" to "high" (so I probably don't need sophisticated isotonic regression
or platt scaling
), but I do need a calibrated probability for every bin.
I tried 'learning' the probabilities for these bins from a 'train' set representing the population and then testing that the model is calibrated on a 'validation' set with the Hosmer-Lemeshow
statistic.
However, the statistic is very unstable and the p-value varies from 0.0012 to 0.28 when I repeat this experiment with different random sets. Why is this happening? The entire population size is 4763615 out of which there are 48606 cases. I'm dividing it 70%/30% between train/validation sets. I tried dividing the bins in many ways (equal samples in each bin, equal cases in each bin, equal delta of original predictions..). Is my entire approach reasonable? or is there something else I'm missing?
(related to Evaluating logistic regression and interpretation of Hosmer-Lemeshow Goodness of Fit, but not specific to logistic-regression. Also related to How to choose optimal bin width while calibrating probability models?).