I have a Cox Proportional Hazard model with 6 covariates to determine OS. I am now trying to simplify this model by taking some of this covariates down. This is intended for a wide audience so I'm assuming not everyone will be using all 6 variables. I would want to inform the user of the cost of not entering all the variables by giving them some sort of "% of loss" or "accuracy" thing. I've did the following.
Fit the whole model and fit several 1-missing variable models. Then compute RMSE of several quantities. Thing is I don't know how much informative or "correct" this is.
I computed RMSE for:
A) Predictions of both models (so I obtain the mean error of the partial model based on the whole model)
predsfull <- predict(cph(Surv(months, cens)~rcs(var1, 4)+rcs(var2,4)+rcs(var3,4)+var4+var5+rcs(var5,4), data=df, x=T, y=T, surv=T))
predspart <- predict(cph(Surv(months, cens)~rcs(var1, 4)+rcs(var2 ,4)+rcs(var3,4)+var4+var5, data=df, x=T, y=T, surv=T)) ##Var6 is now gone
sqrt(mean((predspart-predsfull)^2))
B) The same as A, but instead of predict
I did the RMSE with the survest(fullmodel)$surv
and the partial model.
C) The RMSE of the bootstrap corrected C-stat given by the validate
function of the rms
package.
Is my approximation correct? I am kind of confused by the quantities I am obtaining, as I'm using RMSE I am getting the same units the response is giving, so, in all scenarios I am getting proabailities. Therefore I am reporting something like "you are sacrificing 0.023% of discriminative power using a partial model vs a full model".
Is any of that making sense?
PS: Code is just an example, I'm doing this for all 6 covariates and even 2 covariates at the same time. So the fact that some of them are non-linear has nothing to do with the fact I'm taking a covariate down, I just added the closest form of my model I could.