Measuring Cox PH predictions

Question

I'm running a Cox PH model in python using lifelines package.

The two performance measures this package offers is log-likelihood or concordance index.

I am aware the log-likelihood wouldn't be optimal to measure performance, but more to compare between two or more models.

I've also seen mixed comments regarding the C-index: some say it is the correct way to analyze predictions for survival models, others say it's not good because it basically performs a ranked correlation but it does not take into account precision.

In particular in this package I can run the command predict_median which returns the median time to cure/survive, and inf or a very large number if the observation should not cure. Here is an example to make it clear:

daten2 = daten.iloc[:-10]

cph = CoxPHFitter(penalizer=0.05)

cph.fit(daten2, "length_of_arrears", event_col='cured')
Out[269]: <lifelines.CoxPHFitter: fitted with 14080 total observations, 4573 right-censored observations>

d_data = daten.iloc[0:10,:]

cph.predict_median(d_data)
Out[271]: 
0    612.0
1    579.0
2    104.0
3      3.0
4      4.0
5      4.0
6      4.0
7      7.0
8      9.0
9      4.0
Name: 0.5, dtype: float64

d_data.length_of_arrears
Out[272]: 
0    287.0
1    196.0
2     75.0
3      3.0
4      8.0
5      3.0
6      3.0
7     72.0
8     27.0
9      3.0
Name: length_of_arrears, dtype: float64

d_data.cured
Out[273]: 
0    0.0
1    0.0
2    0.0
3    1.0
4    1.0
5    1.0
6    1.0
7    1.0
8    1.0
9    1.0
Name: cured, dtype: float64

I would like to get an estimate of precision, that is, how many days off is the predicted median survival time. Is there anything like this?

Theres a limit though: if the true event time was censored, how do you measure how "off" you were? This is why lifelines doesn't have a typical "L-d" measure (like squared-error). Not in lifelines, but something that _should_ be, is the brier score, which may solve your problem. — Cam.Davidson.Pilon, Nov 28 '20 at 21:37

score 0 · Answer 1 · answered Nov 28 '20 at 21:19

0

A resampling-based approach could provide what you need, with an emphasis on how well your model might perform on a new sample from the population. See this page for the rationale, how to proceed, and links to more information. In brief, you repeat the entire modeling process on multiple bootstrap samples from your data, and test the multiple models' performance on the full data set. With median survival as your measure, the distribution of test performance results among the multiple models would give estimates of bias and variability in model performance.

answered Nov 28 '20 at 21:19

EdM

57,766
7
66
187

What metric would I use in that case? Concordance is fine? – amestrian Nov 30 '20 at 13:39
@amestrian think about how you want to use your model and choose a corresponding metric. If what you care about is predicting the correct rank order among events, concordance is fine. It's not sensitive for distinguishing among competing models but is OK for characterizing a particular model in that respect. If you care more about times-to-events, then calibration with respect to outcome probabilities at one or more times of interest would be better; we've discussed that [here](https://stats.stackexchange.com/a/498313/28500). – EdM Nov 30 '20 at 13:52

Measuring Cox PH predictions

1 Answers1

Linked