what is a good measure of goodness of fit for survival models that can be used for comparison between models?

Question

I don't see any of such measure in the output from coxph() in R (Cox proportional Hazard model). Is there a goodness of fit measure for survival models similar to R2 for linear regression?

Update: Actually I found a measure called "concordance" from the coxph() output, which might be what I need. What does this measure mean?

My plan is the following: After initial data cleaning, use random forest to select important variables that predict time to event. However, random forest does not give a risk classification model that can be easily applied in practice. I will then work on the important variables and their interactions in a Cox proportional hazard model to build a model with good fit. Let me know if this sounds right and if you have any suggestion.

When you say "comparison between models," are you comparing "nested" models, in which one model is restricted to a subset of the predictors in a second model, or models in which (1) one contains predictors that aren't present in the other _and_ vice-versa or (2) the models were developed with different approaches like Cox regression versus gradient boosting? The answer depends on which is the case. The more details you provide about your particular situation, the better. Please add this information by editing the question, as comments are easy to overlook and can get deleted. — EdM, Apr 04 '21 at 14:15
R2 is a terrible measure of goodness of fit (see: [Is R2 useful or dangerous?](https://stats.stackexchange.com/a/13317/)). — gung - Reinstate Monica, Apr 04 '21 at 15:14
@EdM thanks I added a paragraph about my plan. I think mainly to compare between different Cox models. Let me know you have other suggestions about building a risk classification model. Thanks a lot. — hehe, Apr 04 '21 at 16:17
R2 is terrible because of overfitting? But it is ok to use for testing dataset right? — hehe, Apr 04 '21 at 16:20

score 1 · Accepted Answer · answered Apr 04 '21 at 17:15

A problem with your overall plan is that the error estimates in your later Cox models won't take into account the fact that you used this particular data sample with random forest to select the predictors used in the Cox models. There's also a danger the process might provide a model that works well on your particular data set but doesn't work well on new samples from the same underlying population.

Before you go any farther, look at Harrell's course notes on regression modeling strategies and other resources linked from the associated web site. Pay particular attention to Chapter 4 of the course notes, on Multivariable Modeling Strategies. Instead of blindly jumping into automated predictor selection, it's usually best to apply your knowledge of the subject matter and of your data to develop (without looking at the outcomes) a set of candidate predictors of a size that's appropriate to the scale of your data. For example, in a Cox model, you generally should limit yourself to about 1 candidate predictor per 15 events in your data set.

Also, unless you have thousands of cases, you shouldn't be splitting the data into separate train/test sets, as you lose precision in the model and power in testing (as you imply to be your approach in a comment). Evaluating the model-building process via bootstrapping is a much more efficient use of your data. See Chapter 5 of Harrell's course notes.

Then, if you like trees, why not use a gradient-boosted model directly? The gbm package in R can handle Cox models, estimate the baseline hazard (smoothed, if you wish), and return predictions of log-hazards for new cases. Those are what is used for getting predictions from standard Cox models. (Starting with a gbm Cox model might require some extra calculations on your part, though.)

If you have thousands of predictors and need to do large-scale predictor selection, use a principled method like LASSO. The R glmnet package can handle Cox models and provide predicted survival curves, based on the model, for specified covariate values.

To answer your original question: survival models are fit by methods that maximize partial likelihood. So measures based on partial likelihood deviance are generally best. That's what's used for the gradient evaluation in gbm and is the default optimization for cross-validation in glmnet. The latter package provides concordance (the fraction of pairs of observations in which the predicted and actual event order is correct) as an alternative, but Harrell (who introduced the concordance C-index to survival analysis) recommends against concordance for model comparison. See this page, for example. Once you have developed a model, there are many available measures of discrimination and calibration for evaluating it; see Chapters 20 and 21 of Harrell's course notes for those aspects of Cox models. If your models can't properly be compared on deviance you could evaluate their performance based on those measures, for example by building the competing models on multiple bootstrap samples of the data and evaluating those measures on their application to the full data set.

Thanks I will look into the resources. I still need to study the gradient boosting methods. The main reason I use the tree method is that they are non-parametric and takes into account interaction terms automatically. There are some continuous predictors in my dataset. It is a clinical dataset. Therefore, the number of predictors are not so many. We selected 10-20 based on clinician's judgement and univariate analysis. — hehe, Apr 04 '21 at 17:55
Thanks so much! I have plenty to learn. I may come back to this post after studying for some time. — hehe, Apr 04 '21 at 18:05
@hehe the single-predictor (what you call "univariate") analysis is not a good way to look for predictors. That can miss predictors whose importance is only apparent when the influence of other predictors on outcome has been taken into account. It's probably best to post a new, more specific question on this site rather than returning to this post after you've studied some more. — EdM, Apr 04 '21 at 18:48

what is a good measure of goodness of fit for survival models that can be used for comparison between models?

1 Answers1

Linked