As a discussion from last year about spam/ham email classification shows, just because a model gets perfect classification accuracy does not mean that it really knows what it's doing. In that example, the emails with $P(\text{spam}) < 0.49$ are always ham emails. That is ridiculous. If there is a $49\%$ chance of an email being spam, sure, it is more likely to be ham than spam, but it should not be so surprising to see some of those wind up being spam messages. In fact, that should happen almost half of the time.
Phrased in terms of baseball, a $0.300$ hitter probably won't get a hit, but he does get a hit $30\%$ of the time. If your claimed $0.300$ hitter keeps not getting hits, perhaps he is not a $0.300$ hitter.
I have a model that I know has good calibration, since I generated the data in a simulation and verified with rms::calibrate
that the predicted probabilities almost perfectly match the true probabilities.
However, when I try to do it myself, I fail. I cannot show that the correct proportion of $1$s are below the various thresholds: $30\%$ should be below a cutoff of $0.3$, $80\%$ should be below a cutoff of $0.8$, etc.
I have reasoned through the problem using Bayes' rule; $c$ is the cutoff.
$$ P\big(y = 1 \vert \hat y > c\big) = \dfrac {P\big(\hat y > c \vert y = 1\big)P\big(y = 1\big)} {P\big(\hat y > c\big)} $$
I figure that this, as a function of $c$, should equal the cutoff $c$.
Where have I gone awry?
(Perhaps my logic is sound but I just made a coding error. A Dave can dream, right? (But let's focus on the logic and let me take another shot at the code once I figure out the math.))