1

I am attempting to model individuals in the Rossi dataset. Looking at this we see that the output of the predict_survival_function gives us a a dataframe with the index being the time point and the column is the individuals survival probability. If I were to plot this I assume I would get their survival probability over time.

When I try to predict the survival function for the same dataset on samples that an event occurred we do not get a 0 for the time point where they 'survived until'. Let me give you a visualization of exactly what I am trying to explain:

from lifelines.datasets import load_rossi
rossi = load_rossi()
rossi.head()

index   week    arrest  fin age race    wexp    mar paro    prio
0       20      1       0   27  1       0       0   1       3
1       17      1       0   18  1       0       0   1       8
2       25      1       0   19  0       1       0   1       13
3       52      0       1   23  1       1       1   1       1
4       52      0       0   19  0       1       0   1       3

We can see that for individual 1,2, and 3 they are arrested at week 20, 17, and 25 respectively.

I then fit the model:

from lifelines import CoxPHFitter
cph = CoxPHFitter()
cph.fit(rossi, duration_col='week', event_col='arrest')

Then I predict the survival function with the same dataset

import matplotlib.pyplot as plt
survival_func_plot = cph.predict_survival_function(rossi)
plt.plot(survival_func_plot.loc[:, 0:5])
plt.legend(labels=survival_func_plot.loc[:, 0:5].columns, loc ="lower left")

I get a plot that does not seem to reflect the fact that individuals 0,1, and 2 have actually been arrested. Looking at the plot after week 30 everyone is above 60% survival.

What could be the possible explanation? As a second question can this survival probability be interpreted as the risk a person is at to be arrested? As far as I can tell this function should tell us what the survival probability for this individual is at timepoint t. I apologize about this question, I am new to this area. Thank you.

enter image description here

Kevin
  • 179
  • 1
  • 6

1 Answers1

2

Second question first:

can this survival probability be interpreted as the risk a person is at to be arrested?

The survival function $S(t)$ represents the probability that an event will not have happened up through time $t$. The assumption in standard survival models is that all individuals will ultimately experience the event (arrest, in this case). The question is how soon will the event occur. The instantaneous risk (hazard) at time $t$, given that one has survived until $t$, is the negative of the ratio of the slope of the survival curve to its value at time $t$.

I get a plot that does not seem to reflect the fact that individuals 0,1, and 2 have actually been arrested... What could be the possible explanation?

Survival predictions can be tricky at first. A predicted survival curve can be thought of as representing what would happen if you had a large cohort of individuals sharing the same set of covariate values. What fraction of those individuals would not yet have experienced the event by the time $t$?

In the model, the early times of the 3 cases that you note were presumably counterbalanced by other cases having similar covariate values but longer "survival" without an arrest, perhaps even beyond the 52 weeks at which follow up seems to have stopped. The predicted curve puts together information from all cases, including those "censored" at last follow up.

EdM
  • 57,766
  • 7
  • 66
  • 187