I have a dataset with more than 20 predictors and a single binary response variable. With only $n=181$ observations (64 deaths, 117 survivors), I decided to apply penalized logistic regression to modeling, with all predictors involved (so that I avoid problems associated with model selection). Nevertheless, I have to produce a ''simpler'' model too (i.e. one that is simple enough to be suitable for a nomogram-style hand calculation in clinical setting). For that end, I intend to use rms
's fastbw
.
To exemplify my questions, I'll use the support
dataset from Hmisc
:
library( rms )
getHdata( support )
fit <- lrm( hospdead ~ rcs( age ) + sex + rcs( meanbp ) + rcs( crea ) + rcs( ph ) + rcs( sod ), data = support, x = TRUE, y = TRUE )
fit
First, I apply penalization:
p <- pentrace( fit, seq( 0, 10, by = 0.01 ) )
plot( p )
fitPen <- update( fit, penalty = p$penalty )
fitPen
I hope I'm correct up to this point.
Next, I validate the model and calculate its calibration curve. If I understand it correctly, I shouldn't validate/calibrate the simpler model, rather, I have to run the necessary functions on the original model, but with bw=T
. That is:
validate( fitPen, B = 1000, bw = TRUE )
plot( calibrate( fitPen, B = 1000, bw = TRUE ) )
Question #1: Am I correct in this? I.e. is it true that to get the simpler model's validation/calibration I have to run these not on the simpler model, but on the original one (with bw=T
)? And the results will be those pertaining to the simpler model, despite the fact that I haven't run validation/calibration on the simpler model itself?
Next, I try to come up with the simpler model explicitly. Interestingly, (Harrell, 1998) uses a method which is based on calculating the logits for the observations, then modeling them with OLS, then narrowing this model with fastbw
. Although it is surely my statistical shortcoming, I simply can't understand why this is necessary.
Question #2: Why can't we directly use fastbw
on the logistic regression model? Such as:
fastbw( fitPen )
fitApprox <- lrm( as.formula( paste( "hospdead ~", paste( fastbw( fitPen )$names.kept, collapse = "+" ) ) ), data = support, x = TRUE, y = TRUE )
And finally, I am not completely sure on where should I apply penalizing in the whole process.
Question #3: Should I penalize the original model, then run fastbw
(see above), and then re-penalize the obtained model? I.e.
p <- pentrace( fitApprox, seq( 0, 10, by = 0.01 ) )
plot( p )
fitApproxPen <- update( fitApprox, penalty = p$penalty )
fitApproxPen
Or I don't have to re-penalize the narrowed model? Or I don't have to penalize the original model and it is sufficient to penalize the simpler one? (I suspect that the very first option is the correct, but I'm not entirely sure.)