1

I have a dataset of patient survival times that I am trying to predict based on n covariates X1...Xn (patient co-morbidities, other diagnoses, lab results) using a Cox proportional hazard model. I'm using the survival and survminer libraries in R to implement the model.

I have

test.cox = coxph(Surv(Elapsed_Time,outcome) ~ Hypertension + Diabetes + Age + Gender, data = StageMat)

where outcome = 1,0 for death/survival, I get for coefficients (making some numbers up) 10 hypertension, 8 diabetes, 0.1 Age, 0.8 GenderMale, i.e. b1 = 10, b2 = 8, b3 = 0.1, b4 = 0.8. This corresponds to a hazard model

h(t) = h0(t)exp[b1*x1 + b2*x2 + b3*x3 + b4*x4]

where h0(t) is an (unknown) baseline hazard function assuming mean values for each covariate. If I then use these estimated values to fit new data, I implement

test.fit = survfit(test.cox, newdata = eval_data[1,])

My question is, what is the closed form expression that R uses to calculate time-dependent probabilities in survfit given that h0(t) is an unknown? There is an approximation based on the mean of the hazard function (or the hazard function of the mean) that I can't seem to get at.

The general form of a survival function is

 S(T) = Exp[- Integral_0_T h(t)]

where Integral_0_T h(t) is the definite integral of h(t) from 0 to T

I would like to write down a closed form exponential function for S(T) given b1...b4 from my model, but because h0(t) is unknown, I'm not sure what this expression would be.

For a proportional hazard model, S(T) can be rewritten as

S(T) = S0(T)^Exp[Sum_i bi*Xi]

but as for h0(t), the baseline S0(T) is unknown. I've seen some papers where S0(t) is replaced by the mean value of S(T) in the dataset to get a closed-form expression, but I don't think this is generally correct because the mean of a function is not necessarily equal to the function of the mean.

Any clarification of how to deal with baseline estimates to get a closed-form expression for the survival (and hazard) functions would help.

Max
  • 123
  • 4

1 Answers1

1

The (Aalen-Breslow) estimator of the baseline cumulative hazard is

$$\hat\Lambda_0(t)=\sum_{s\leq t} \frac{dN(s)}{\sum_{i}Y_i(s)\exp(Z_i(s)\beta)}$$ where $dN(s)$ is the number of deaths at time $s$, $Y_i(s)$ is 1 if individual $s$ is under observation at time $s$, and $Z_i(s)$ is the covariate vector for individual $i$ at time $s$. The sum is over times that an event occurs, because $dN$ is zero at other times

Heuristically (according to Breslow), it comes from Aalen's cumulative hazard estimator $$\hat\Lambda(t)=\sum_{s\leq t} \frac{dN(s)}{\sum_{i}Y_i(s)}$$ but in the Cox model each individual contributes $e^{Z\beta}$ rather than 1 to the denominator.

In fact, what's used in the computation is the increments of $\hat\Lambda$ $$d\hat\Lambda_0(t)=\frac{dN(s)}{\sum_{i}Y_i(s)\exp(Z_i(s)\beta)}$$

The survival for individuals is estimated by first computing the cumulative hazard for individuals $$\Lambda(t;z)=\sum_{s\leq t} \exp(Z_i(s)\beta)\, d\hat\Lambda_0(t)$$ and then using $S(t)=e^{-\Lambda(t)}$.

Finally, if you want the predicted survival curve for a cohort, you compute $S(t;z)$ for each individual and then average them. As you note, this is not the same as the survival curve for an average individual, though in practice the difference is typically not all that big.

Thomas Lumley
  • 21,784
  • 1
  • 22
  • 73
  • Question: is the discretized version of the above completely equivalent, i.e. if I have discrete time s (months, for example) and in place of continuous function N(s) and its differential dN(s), I have n(s) = n(s1)....n(s_max)? – Max May 05 '21 at 16:36
  • Additionally, do you happen to know if the basehaz() function in R uses the estimator that you describe above? Thank you again for the detailed response. – Max May 05 '21 at 17:54
  • I believe that tied event times due to coarsening of time still lead to the same formula for the baseline hazard (it's just the partial likelihood that has options for how to handle ties). The `basehaz` function just calls `survfit()`, so it does the same as `survfit` – Thomas Lumley May 05 '21 at 22:28
  • Does R then have a function in the survival library that computes the baseline hazard function? – Max May 19 '21 at 14:42
  • According to the documentation, basehaz() with center=TRUE calculates the hazard function for mean covariate values, while center = FALSE for covariates = 0. – Max May 19 '21 at 17:04
  • (The documentation also explains why `center=TRUE` is the one you want, even if you don't think you do.) – Thomas Lumley May 19 '21 at 22:14