Recalibrating a model predictions to incidence in coarse resolution, and measuring that its calibrated

Question

I have a model that predicts a rare disease (1% prevalence) with good discrimination (AUC). However, the predictions it is giving can not be interpreted as probabilities.

I want to recalibrate the predictions according to the prevalence in the population - and I do not need high resolution, I need 5 bins (levels) of probability - from "very low" to "high" (so I probably don't need sophisticated isotonic regression or platt scaling), but I do need a calibrated probability for every bin.

I tried 'learning' the probabilities for these bins from a 'train' set representing the population and then testing that the model is calibrated on a 'validation' set with the Hosmer-Lemeshow statistic.

However, the statistic is very unstable and the p-value varies from 0.0012 to 0.28 when I repeat this experiment with different random sets. Why is this happening? The entire population size is 4763615 out of which there are 48606 cases. I'm dividing it 70%/30% between train/validation sets. I tried dividing the bins in many ways (equal samples in each bin, equal cases in each bin, equal delta of original predictions..). Is my entire approach reasonable? or is there something else I'm missing?

(related to Evaluating logistic regression and interpretation of Hosmer-Lemeshow Goodness of Fit, but not specific to logistic-regression. Also related to How to choose optimal bin width while calibrating probability models?).

I think the questions which you yourself relate to, are actually also applicable here. Especially @FrankHarrell's answer to the first question referenced to (concerning the nature of the H-L test) is applicable. Also, AFAIK, in order to calculate similar measures of a model's calibration, it does not matter how the predicted probabilities are obtained, as long as these probabilities are actually estimated by the model (so models which only classify patients into outcome yes/no are out). — IWS, Feb 23 '17 at 13:01
Hosmer-Lemeshow is considered obsolete: https://stats.stackexchange.com/questions/273966/logistic-regression-with-poor-goodness-of-fit-hosmer-lemeshow — kjetil b halvorsen, May 14 '20 at 12:10

Recalibrating a model predictions to incidence in coarse resolution, and measuring that its calibrated

0 Answers0