Interpretation of an output from a SURVIVAL ANALYSIS - COX

Question

I am struggling to understand the output of my survival analysis. These are the following codes and outputs:

summary(coxph(Surv(A_diffdate, Arrythmia) ~AST , data = df))


plot(res.cut, "AST", palette = "npg") #  cut-off = 25.4

res.cat <- surv_categorize(res.cut)
head(res.cat)


df$AST_cat = ifelse(df$AST < 25.4,'Healthy','High_AST')
fit <- survfit(Surv(A_diffdate, Arrythmia) ~ AST_cat , data = df)
ggsurvplot(fit, data = df, risk.table = TRUE, conf.int = TRUE, pval = TRUE)

res.cox <- survfit(coxph(Surv(A_diffdate,Arrythmia) ~ 1, data = df))
ggsurvplot(res.cox, data = df, title ="AST (Arrythmia)",conf.int = FALSE,risk.table = "absolute")

res.info<- coxph(Surv(A_diffdate,Arrythmia) ~ AST + sex + age + bmi, data = df)
summary(res.info)

It would be very very helpful to get a comprenhensive interpretation of these results, that I created based on this web: https://rpkgs.datanovia.com/survminer/reference/surv_cutpoint.html

Thank you so much in advance! Best

score 2 · Accepted Answer · answered Apr 21 '21 at 18:58

There are several ways in which you could improve your modeling of these data. Much of what's below is based on what you can learn from Frank Harrell's course notes or book in terms of multiple-regression modeling in general (especially Chapter 4) and Cox survival models in particular (Chapters 20 and 21).

First, do not categorize a continuous predictor based on its association with outcome. See this thread for why categorization is seldom a good idea. Although your p-value of 0.0094 after dichotomizing AST looks very "significant" (graph of survival curves based on the cutoff of 25.4), that p-value is uninterpretable as it doesn't take into account the fact that you used the data to pick the cutoff.*

If you choose a cutoff based on an association with outcome in one particular data set, that "optimal" cutoff will probably not apply to a new data set. Thus your results will not generalize reliably to new data sets, which is typically what one wants to accomplish. You could demonstrate this for yourself: try the automated cutoff choice on multiple bootstrapped samples of your data and see how much the "optimal" cutoffs vary.

Second, a flexible spline model is often the best way to incorporate a continuous predictor into a model when there is no theoretical basis for some fixed functional form. The pspline() function in the R survival package and the rcs() function in the rms package provide different ways to do that.

Such modeling lets the data tell you the functional form of how a continuous predictor is associated with outcome. For example, none of the models using AST as a continuous predictor show it to be significantly associated with outcome, but it's treated in those models as linearly associated with log-hazard. If it has a true association with outcome that is more complicated, you could miss it unless you go beyond the linear fit. A spline fit can demonstrate the actual form of its relationship to outcome.

Third, you seem to be starting with a bottom-up approach to modeling, starting with a single predictor of interest (AST) and then adding other predictors (age, sex, BMI). Regression model building better starts with an overview of the data and all the predictors, more of a top-down approach. With survival modeling, look at the number of events as a guide to how complex a model you can build. Then use your knowledge of the subject matter and things like associations among the variables (without considering their associations with outcome in your data) to come up with a set of predictors of a scale consonant with the scale of your data. For survival modeling without overfitting, that's usually about 1 predictor (including interaction terms, extra coefficients for spline fits, etc) per 15 events.

With 425 events in your data set, you might be able to develop a much more complex model. You could use flexible continuous fits not just for AST but also for age and BMI. You could include interactions of those continuous predictors with sex, interactions among the continuous predictors, and additional predictors associated with outcome.

Including as many predictors associated with outcome as possible, without overfitting, is generally a good strategy particularly in survival modeling. In survival modeling, as with logistic regression, omitting any predictor associated with outcome runs a risk of biasing the coefficients of included predictors toward lower than their true magnitudes.

Finally, it's important to document the discrimination and calibration of your model. The concordance index is one measure of discrimination between case outcomes; it's the fraction of pairs of comparable cases in which the model-predicted and observed order of events agrees. (As noted in another answer, that's not very good for your models thus far, as 0.5 is what you get just by chance.) Calibration shows how well things like predicted and observed event probabilities agree over the range of the data. The Harrell reference and the rms package provide tools for evaluating calibration and additional measures of discrimination.

*Given the strong association of age with outcome in the final model, I'm particularly concerned that your 2 AST groups simply differ in average age, so that your AST categories are just a proxy for age.

score 1 · Answer 2 · answered Apr 20 '21 at 15:51

1

First of all, you should know that the Cox model fits a risk function h (X, t), that is, it tries to predict the risk h based on the covariates X at time t. To evaluate the predictive capacity of the model, the concordance index is used. Your concordance index is close to 0.5 which is pretty bad (The concordance index must be at least 0.6).

The Pseudo R squared is also used and it's not shown in the summary, which indicates that it does not make sense to show it, so I assume that it must be very bad too (For Cox models it is enough that the R ^ 2 is greater at 0.3, yours should be below 0.1, if I'm not mistaken). You can see the Pseudo R squared with summary (…) $ rsq.

More important than the above, you must test the proportional hazard hypothesis. You can do this using the cox.zph () function. The P-values of all variables must be greater than 0.5 for the model to be valid.

Sorry for the bad news.

answered Apr 20 '21 at 15:51

Marcos Pérez

121
1

I have substituted AST for another blood biomarker that has rsq= 0.01023002 maxrsq = 0.82335777 (which one of these is valid?). p values from the cox.zph are ALT (the new marker)= 0.13, sex =0.57, age =0.44, bmi=0.72, GLOBAL =0.54. This looks good does it? how do you interpret the first density plot as well, I would be interested. thanks – Lili Apr 20 '21 at 16:10
The p-values from cox.zph are good. The density plot is a usual density plot camparing two populations. The **log-rank test** compares the risk functions of two populations and since the p-value of the test is less than 0.5 (0.0094), this means that there is no evidence that tells us that the risk functions of the two populations are equal. So it's a good partition. You should use the variable ` ATS_cat ` in your model. – Marcos Pérez Apr 20 '21 at 17:06
This is so helpful thank you! I have three more questions: (1) what does it mean the graph with the dots&cutpoint below the density? (2) what is the difference between the first plot with the two lines and the second with just one line? does this means that the first is univariate and the second is multivariate? (3) regarding the r square in my previous comment rsq= 0.01023002 maxrsq = 0.82335777 which one of these is valid? - Thank you so much!! – Lili Apr 21 '21 at 08:27
You are welcome. In response to your questions (1) Every point in that graph is a statistic resulting from one Stadarized Log_Rank test. We select the largest to determine the optimal cut-off point, in your case 25.4. (2) The first plot compares the Kaplan -Meier Stadistic of two population, which includes a Log-rank test to compare the hazard functions. The second one is just a plot of the Kaplan-Meier stadistic of all population. (3) The first one is your R-square and the second one is the max R-square posible. – Marcos Pérez Apr 23 '21 at 17:03
Thank you! understood! – Lili Apr 26 '21 at 09:21

Interpretation of an output from a SURVIVAL ANALYSIS - COX

2 Answers2