0

I'm using survminer to try to create a survival formula for a phenotype data set.

library("survival", "survminer")
data <- read.csv(file = "Study.cleaned.csv", header = TRUE)
basic.cox <- coxph(formula = Surv(Observation_time, Event_Detected) ~ Patient_Age + Sex+BMI, data = data)
basic.assump <- cox.zph(basic.cox)
sink('/home/varieties/variables.tsv')
print(basic.assump)
sink()

This produces a file that looks like this:

     rho  chisq      p
A  0.00477 0.0987 0.7534
B -0.02352 2.5383 0.1111
C  0.02513 3.2104 0.0732
GLOBAL                             NA 5.6162 0.1319

I have tried about 2,000 different combinations of each factor A through L by summing them , e.g. A+B+C, A+B+D+E... etc. but the problem is that none of the summaries have all p-values being >= 0.05, which is a requirement of the study.

Also, some times factors will split, and I don't know why. The output file will look like this:

C<100       0.207766 9.59e-09 0.999922
C<50        0.188145 2.07e-08 0.999885
C<69        0.124370 9.49e-09 0.999922
C>325       0.138520 1.05e-08 0.999918
C100        0.004437 9.01e-02 0.764040
C100.0      0.189391 1.14e-08 0.999915
C101       -0.012151 6.76e-01 0.410877...

and I don't know why it's splitting like this.

Why are some factors split like C did in the example above? Is there some guidance on determining formulas?

con
  • 123
  • 10

1 Answers1

1

The cox.zph() function* has a terms parameter setting. If that is set to TRUE, all levels of a categorical factor variable are considered together; if set to FALSE, you get separate analysis of each level versus the reference level. The default is now TRUE but my sense is that it might once have been FALSE. See whether that explains your splitting of factor variables.

The bigger problem is that your attempt to solve the proportional hazards (PH) problem by adding combinations of predictor variables is unlikely to work. It's more likely that there are problems with the model assumptions that need to be dealt with.

One possible problem is mis-specification of the model. For example, if the form of the association between a predictor and outcome is incorrectly specified (that is, the log hazard is not linearly related to the predictor value), then that can show up as a violation of PH. See this page for an example. Log transformations of things like gene-expression values, or flexible modeling of continuous predictors with restricted cubic splines, might improve both the model and the PH results.

There's no a priori reason why PH must hold. Another approach is to use a parametric survival model that does not assume PH, like a log-normal model. That replaces a PH test with corresponding tests of parametric model adequacy.

It's also possible that you truly have time-dependent hazards associated with certain phenotypic variables and that you will need to deal with them directly. For example, the association of some variables with outcome might be very important at early times but insignificant at later times. The survival package vignette called "Using Time Dependent Covariates" goes into ways to deal with time-dependent hazard ratios in Section 4.

It's also possible, particularly with large data sets, that you may get a "statistically significant" violation of PH that is of minimal practical importance. Look at the curves showing the residuals versus time provided by cox.zph() and apply your knowledge of the subject matter. Without proportional hazards a hazard ratio is a type of rough average, characterized in this answer as "a ratio of average hazards," so some deviations from the ideal PH might be acceptable.


*You evidently used cox.zph() in a version of survival prior to 3.0, as your output includes a rho column, the correlation of the scaled Schoenfeld residuals with (transformed) time. In version 3.0 and later, the test is the score test that the earlier correlation test approximated. You will get different results wth later versions of the package, so be warned.

EdM
  • 57,766
  • 7
  • 66
  • 187