The val.prob
function in the rms
R
package has similarities to the calibrate
function discussed in another question of mine, but a key difference in that val.prob
has no notion of a probability model. Indeed, I can run val.prob
on random data.
library(rms)
set.seed(2022)
N <- 100
p <- rbeta(N, 1, 1)
y <- rbinom(N, 1, p)
rms::val.prob(rbeta(N, 1, 1), y)
Unsurprisingly, the results show the random numbers between $0$ and $1$ to be unrelated to the binary y
.
I have some idea of how val.prob
works.
Draw a bootstrap sample of the
p-y
pairs.Fit a model to the
p-y
pairs.Repeat, repeat, repeat...
Use some kind of average to say what the true probability of $Y=1$ is for each given probability
p
.
In R
code:
set.seed(2022)
N <- 500
B <- 1000
pr <- sort(rbeta(N, 1, 1))
y <- rbinom(N, 1, pr)
m <- matrix(NA, B, N)
for (i in 1:B){
idx <- sample(seq(1, N, 1), N, replace = T)
pb <- pr[idx]
yb <- y[idx]
Lb <- glm(yb ~ pb, family = binomial)
m[i, ] <- 1/(1 + exp(-predict(Lb, data.frame(pb = pr)) ))
if (i %% 50 == 0 | i < 6 | B - i < 6){
print(i/B*100)
}
}
pL <- apply(m, 2, mean)
plot(pr, pL, xlim = c(0, 1), ylim = c(0, 1), xlab = "Asserted Probability", ylab = "True Probability")
abline(0, 1)
However, I simulated my data! I know that pr
generated y
, so the calibration should be pretty good, not curved like it is. Even with a large sample size of $50000$, that curve remains when I do my own implementation. When I use rms::val.prob
, even for a sample size of $500$, I have no such issue.
What step(s) am I missing in my implementation?