3

I have some survival data from different cohorts and can fit Kaplan Meier curves and cox proportional hazard models. I want to be able to simulate data that resembles the data found in say the KM curves. What I would like to know is there any way to simulate data from the KM curves or the cox proportional hazard model?

RustyStatistician
  • 1,709
  • 3
  • 13
  • 35

2 Answers2

4

Certainly.

Remember that the KM curves are estimates of the survival function, ie $\hat S_{km}(t)$. As such, you can simulate draws from this distribution by using the inverse survival function, i.e.

$u_i \sim $ uniform(0,1)

$T_i = S_{km}^{-1}(u_i)$

Then $T_i$ will be distributed according the distribution in the KM curve. Note that the KM curve will be a discrete distribution, while the original data generating process may be continuous. Also, this does not take into account the uncertainty in the KM curves.

Here's how one could do this in R:

library(survival)
y <- rexp(100)
isCen <- y > 2
y[isCen] <- 2
fit <- survfit(Surv(y, !isCen) ~ 1 )
quantile(fit, probs = runif(1) )$quantile

Very technically speaking, a Cox-PH model does not necessarily include an estimated baseline survival distribution, which is required to sample from. However, this can be easily remedied by using the KM curves as the baseline survival estimate. Of course, the following method can be used for any proportional hazards model and baseline survival estimate.

From here, we use the relation that if $\eta_i = X^T \beta$ (i.e. the linear predictor for subject i), the estimated proportional hazards survival curve will be $S(t | \eta_i)$ = $S_o(t)^{\exp(\eta_i)}$ where $S_o(t)$ is the baseline survival curve.

Being extra clever, we can then see that

$S^{-1}(u | \eta_i) = S_o^{-1}(u^{\exp(-\eta_i)})$

So given $\hat S_o^{-1}$ and $\eta_i$, we can take draws via

$u_i \sim$ uniform(0,1)

$T_i = \hat S_o^{-1}(u^{\exp(-\eta_i)})$

Cliff AB
  • 17,741
  • 1
  • 39
  • 84
1

@CliffAB has provided a very good answer (+1). Let me add a couple of additional possibilities:

  1. You can try various parametric survival models to see if you can find a distribution that is a reasonable fit. Then you can sample from that distribution using standard techniques. (For a simple example of generating censored survival data from a Weibull distribution, see my answer here: How to simulate censored data.)
  2. You can also bootstrap your data, which will take your sample distributions as estimates of the population distributions. This will be more informative if your samples are larger and better behaved. (The results here should be essentially the same as @CliffAB's strategy. That is, this gives you another way of understanding the strategy he suggested.)
gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • For the record, I actually advise your suggestion (1), *provided one can find a parametric distribution that agrees well with the KM curves* over directly sampling from the KM curves. This is mainly due to the issues of the KM curves being discrete, where as we often think of the generating process being continuous. – Cliff AB Mar 04 '16 at 20:40
  • Smooth fits are likely to be just as good or better than K-M estimates if model assumptions are reasonably well satisfied. I wouldn't worry so much about imitating a step function. When only given KM curves (say from published papers) I digitize the curves and fit a regression spline to the log cumulative hazard (- log KM estimates) for simulation. – Frank Harrell Mar 04 '16 at 20:45
  • @CliffAB, did you suggest that? I'm not seeing it. I can delete this if it is just duplicative. – gung - Reinstate Monica Mar 04 '16 at 20:52
  • 1
    @FrankHarrell, I was thinking along those lines, but it isn't fully worked out in my head. (I was thinking of a lowess fit to the survival curve & using that analogously to CliffAB's answer.) Why not write that up as an official answer? – gung - Reinstate Monica Mar 04 '16 at 20:55
  • @gung: I wasn't claiming that I had said that, but rather endorsing your suggestion over my own; my answer was directly answering the question, but not necessarily the best approach. – Cliff AB Mar 04 '16 at 21:04
  • 1
    Frank Harrell's suggestion is also quite good as well. On the other hand, fitting a lowess to the survival curve can be a little complicated; you need to assure that it is constrained to [0,1], non-decreasing, etc. These types of issues also exist for cumulative hazard fits, but more work (and available code) exists for smoothly fitting the cumulative hazards. – Cliff AB Mar 04 '16 at 21:05
  • @CliffAB, yeah I knew my idea was half-baked. I just had it as an intuition that needed more development to be workable. – gung - Reinstate Monica Mar 04 '16 at 21:12
  • Thank you for the wonderful answer. In regards to option 1) is it reasonable to fit a cox PH model to the data at hand, and then use my estimates of the coefficients from my cox PH to follow the idea proposed by @CliffAB in his post of 2)? – RustyStatistician Mar 04 '16 at 21:43
  • @RustyStatistician, yes Cliff's answer is good. He discusses the possibility of using a Cox model. – gung - Reinstate Monica Mar 04 '16 at 21:46
  • @gung Thank you for the clarification, I guess I was more so confused if that was exactly what he was suggesting (which clearly he was). In regards to your point 2) about bootstrapping. If I have 4 groups, so 4 KM curves, would be reasonable to gather bootstrap samples where each group is sampled with a different proportion? For example, if I had Group A, B, and C would it be ok to bootstrap sample, say, 10% from group A, 30% from group B, and 60% from group C? As in does weighted bootstrap sampling exist? – RustyStatistician Mar 04 '16 at 22:00
  • @RustyStatistician, that should probably be a new question. I'm not aware of that. The standard bootstrap is to draw samples (w/ replacement) of the same size as the original. – gung - Reinstate Monica Mar 04 '16 at 22:06
  • @gung One last question. When bootstrapping I understand that it is best to take a large number of bootstrap samples, but is there a rule of thumb for the size of each bootstrap sample. As in, if my total sample size is 287, should each bootstrap sample size be of size 287 or larger or.... – RustyStatistician Mar 04 '16 at 22:20
  • @RustyStatistician, if N = 287, then each bootsample has size 287. – gung - Reinstate Monica Mar 04 '16 at 22:24