Comparing Cox Proportional Hazards Models (variable selection)

Question

I am using a cox proportional hazards model to run a survival analysis in r on a number of non-nested, distinct covariates such as Age, Blood Type, Cancer, etc:

 A, B, C, D, E

When I run the model on the omnibus null hypothesis:

surv ~ A + B + C + D

The effects of all of the covariates are insignificant because the number of subjects that have measurements for every covariate is relatively small. However, when I isolate single or other combinations of covariates in different cox models:

surv ~ A    
surv ~ A + C
surv ~ B + D

I'm showing significant effects because the sample set is larger (i.e. the number of observations discarded by the model shrinks).

What I'm having difficulty understanding is how to do the following:

Comparing the different cox models for the best fit, i.e. is surv ~ A + B + D a better model than surv ~ A + C ? Should I be comparing the likelihood, wald or logrank scores?
Is it possible to run every possible combination of covariates to determine the best model? I have about 15 covariates.
More broadly, is this tactic the best approach to optimizing for both significant covariates and overall model "cost"? I will be attaching a cost to each distinct cox model i.e. using covariates A + B + C in the model costs \$100 while using covariates A + B costs \$75 and using only covariate A costs \$10. I'd like to look at the cost for each combination of covariates vs. the accuracy for each cox model.

Thanks very much for your help!

score 5 · Accepted Answer · answered Aug 28 '14 at 03:34

5

In general there is no reason to do variable selection. The model uncertainty and bias resulting from it are problematic. Insignificant variables are not tragic. And the data are incapable of telling which variables are "really" important. But if you have true costs of measuring variables, you can fit a well-defined sequence of models by adding variables in ascending order of cost, and stop when you have the best model for the money. There is little model uncertainty when using an apriori ordering of variables.

answered Aug 28 '14 at 03:34

Frank Harrell

74,029
5
148
322

Thank you. When you say "best model" how are you determining "best"? – BeginnersMindTruly Aug 28 '14 at 21:54
1

This might be a point on the curve (x=total cost, y=likelihood ratio chi-square) where things start to flatten. – Frank Harrell Aug 28 '14 at 22:21
Again, thanks. But why would you use the likelihood ratio as opposed to the wald test or the score (logrank) test. In looking at model accuracy, the coxph function in r presents (for example): – BeginnersMindTruly Aug 28 '14 at 22:57
`Likelihood ratio test = 10.47 on 2 df, p=0.005316. Wald test = 10.02 on 2 df, p=0.006686. Score (logrank) test = 10.24 on 2 df, p = 0.005966` vs. (for a 13 covariate example): `Likelihood ratio test = 54.06 on 13 df, p=5.897e-07. Wald test = 41.12 on 13 df, p=9.115e-05. Score (logrank) test = 47.1 on 13 df, p = 9.304e-06`. Is it valid to say one model is more accurate than the other and if so, per your prior comment, I'm assuming you would recommend doing so based on the likelihood ratio? – BeginnersMindTruly Aug 28 '14 at 23:03
1

The likelihood ratio $\chi^2$ statistic has the best properties and since it is a simple function of the -2 log likelihood it is more likely to order the models correctly. Be sure that you do not use any $\chi^2$ or $P$-values to select the model sequence. – Frank Harrell Aug 29 '14 at 12:19

Comparing Cox Proportional Hazards Models (variable selection)

1 Answers1

Linked