How to do ROC-analysis in R with a Cox model

Question

I've created a few Cox regression models and I would like to see how well these models perform and I thought that perhaps a ROC-curve or a c-statistic might be useful similar to this articles use:

J. N. Armitage och J. H. van der Meulen, ”Identifying co‐morbidity in surgical patients using administrative data with the Royal College of Surgeons Charlson Score”, British Journal of Surgery, vol. 97, num. 5, ss. 772-781, Maj 2010.

Armitage used logistic regression but I wonder if it's possible to use a model from the survival package, the survivalROC gives a hint of this being possible but I can't figure out how to get that to work with a regular Cox regression.

I would be grateful if someone would show me how to do a ROC-analysis on this example:

library(survival)
data(veteran)

attach(veteran)
surv <- Surv(time, status)
fit <- coxph(surv ~ trt + age + prior, data=veteran)
summary(fit)

If possible I would appreciate both the raw c-statics output and a nice graph

Thanks!

Update

Thank you very much for answers. @Dwin: I would just like to be sure that I've understood it right before selecting your answer.

The calculation as I understand it according to DWin's suggestion:

library(survival)
library(rms)
data(veteran)

fit.cph <- cph(surv ~ trt + age + prior, data=veteran, x=TRUE, y=TRUE, surv=TRUE)

# Summary fails!?
#summary(fit.cph)

# Get the Dxy
v <- validate(fit.cph, dxy=TRUE, B=100)
# Is this the correct value?
Dxy = v[rownames(v)=="Dxy", colnames(v)=="index.corrected"]

# The c-statistic according to the Dxy=2(c-0.5)
Dxy/2+0.5

I'm unfamiliar with the validate function and bootstrapping but after looking at prof. Frank Harrel's answer here on R-help I figured that it's probably the way to get the Dxy. The help for validate states:

... Somers' Dxy rank correlation to be computed at each resample (this takes a bit longer than the likelihood based statistics). The values corresponting to the row Dxy are equal to 2 * (C - 0.5) where C is the C-index or concordance probability.

I guess I'm mostly confused by the columns. I figured that the corrected value is the one I should use but I haven't really understood the validate output:

      index.orig training    test optimism index.corrected   n
Dxy      -0.0137  -0.0715 -0.0071  -0.0644          0.0507 100
R2        0.0079   0.0278  0.0037   0.0242         -0.0162 100
Slope     1.0000   1.0000  0.2939   0.7061          0.2939 100
...

In the R-help question I've understood that I should have "surv=TRUE" in the cph if I have strata but I'm uncertain on what the purpose of the "u=60" parameter in the validate function is. I would be grateful if you could help me understand these and check that I haven't made any mistakes.

I would probably take a look at the [rms](http://cran.r-project.org/web/packages/rms/index.html) package and its `cph()` command. — chl, Oct 24 '11 at 15:00
`index.corrected` is what should be emphasized. These are estimates of likely future performance. `u=60` is not needed in `validate` since you have no strata. If you had strata, survival curves can cross, and you need to specify a particular time point for getting the generalized ROC area. — Frank Harrell, Nov 01 '11 at 12:34

score 5 · Answer 1 · answered Oct 27 '11 at 22:56

5

Depending on your needs, embedding a model inside a bigger model and doing a "chunk" likelihood ratio $\chi^2$ test for the added value of the additional variables will give you a powerful test. My book talks about an index arising from this approach (the "adequacy index").

answered Oct 27 '11 at 22:56

Frank Harrell

74,029
5
148
322

+1 for guiding me in the right direction. I just finished doing the C-statistic and the more detailed score that I'm looking at had a C-statistic of 0.4365081 while the other had 0.4414625 (I guess that I should be counting 0.5-Dxy/2 in my case). I took quite a while doing the calculation on my 140 000 sample; I had to lower the bootstraps to 10 and I'm not certain what the impact of that is. I'm looking forward to reading your book (it's in the mail) and hopefully it will help me understand the methodology better and compare the C-statistic with the adequacy index. – Max Gordon Nov 03 '11 at 16:50
Good. It is not easy to tell whether .44 vs. .43 means much without looking at distributions of predicted values. – Frank Harrell Nov 03 '11 at 20:47
I understand it's hard to comment on numbers like that. I'll try to look into the distribution. My main interpretation of the outcome is that very little is explained by my model and even though there is a small difference it's probably not vastly important. It would be interesting what to expect in a survival setting - reaching a value of .8 as they did in the analysis that I referenced in my question seems pretty far away... but then again my survival is the survival of a implanted prosthesis and not patient survival. They also used logistic regression that perhaps changes the estimate. – Max Gordon Nov 05 '11 at 14:15
Logistic regression would not work if time is important or follow-up time varies across subjects. Back to the original question, predicted risks will have a narrow distribution if very little variation is explained by the model. – Frank Harrell Nov 05 '11 at 14:39
Just got your book... I've had a quick lock at the survival part but when I try out your Case study in chapter 20 but I get an error on the impute(w, sz) part: 'variable sz does not have a names() attribute'. I followed chapt. 8: loaded the dataframe with getHdata(prostate) (couldn't find the website in the book), did the w – Max Gordon Nov 09 '11 at 17:38
I think I got it to work with impute(sz, object=w). Somehow I can't get quite the same result as your book, for instance the code on p. 163 "cph(S ~ sz + sg + ap + sbp + dbp + age + wt + hg + ekg + pf + bm +hx)" has an L.R. 83.7 & d.f. 14 and on p. 511 the first example doesn't run while the second gives a lower L.R. than the 126 your book suggested. Am I missing something or have the calculations changed since the publishing? – Max Gordon Nov 09 '11 at 21:56
http://biostat.mc.vanderbilt.edu/rms has R code for some of the book. – Frank Harrell Nov 09 '11 at 23:08

score 2 · Accepted Answer · answered Oct 25 '11 at 14:33

2

@chl has pointed to a specific answer to your question. The 'rms' package's cph function will produce a Somers-D which can be transformed trivially into a c-index. However, Harrell (who introduced the c-index to biostatistical practice) thinks this is unwise as a general strategy for assessing prognostic measures, because it has low power for discrimination among alternatives. Instead of relying on the surgical literature for your methodological guidance, it would be wiser to seek out the accumulated wisdom in Harrell's text, "Regression Modeling Strategies" or Steyerberg's "Clinical Prediction Models".

answered Oct 25 '11 at 14:33

DWin

7,005
17
32

7

Thanks for the note. I think that $D_{xy}$ and $C$ are not bad for describing the predictive discrimination of a single pre-specified model. But as you said, they lack power for doing more than that. – Frank Harrell Oct 25 '11 at 17:01
Thank you for your answers, my situation is that I have three different scores that I want to compare and see how they perform. I haven't had the time to look into the Somers-D part and I'll get back once I've had the time (I had a quick look and didn't find anything useful). I've also ordered @FrankHarrell book, "Regression Modeling Strategies", ISBN 13: 978-0387952321, and hopefully it will guide me in my choices. – Max Gordon Oct 27 '11 at 17:49
2

Since Dxy = 2*(c- 0.5) the calculation of c given Dxy should be trivial. – DWin Oct 27 '11 at 18:16

How to do ROC-analysis in R with a Cox model

Update

2 Answers2

Linked