I am attempting to model individuals in the Rossi dataset. Looking at this we see that the output of the predict_survival_function gives us a a dataframe with the index being the time point and the column is the individuals survival probability. If I were to plot this I assume I would get their survival probability over time.
When I try to predict the survival function for the same dataset on samples that an event occurred we do not get a 0 for the time point where they 'survived until'. Let me give you a visualization of exactly what I am trying to explain:
from lifelines.datasets import load_rossi
rossi = load_rossi()
rossi.head()
index week arrest fin age race wexp mar paro prio
0 20 1 0 27 1 0 0 1 3
1 17 1 0 18 1 0 0 1 8
2 25 1 0 19 0 1 0 1 13
3 52 0 1 23 1 1 1 1 1
4 52 0 0 19 0 1 0 1 3
We can see that for individual 1,2, and 3 they are arrested at week 20, 17, and 25 respectively.
I then fit the model:
from lifelines import CoxPHFitter
cph = CoxPHFitter()
cph.fit(rossi, duration_col='week', event_col='arrest')
Then I predict the survival function with the same dataset
import matplotlib.pyplot as plt
survival_func_plot = cph.predict_survival_function(rossi)
plt.plot(survival_func_plot.loc[:, 0:5])
plt.legend(labels=survival_func_plot.loc[:, 0:5].columns, loc ="lower left")
I get a plot that does not seem to reflect the fact that individuals 0,1, and 2 have actually been arrested. Looking at the plot after week 30 everyone is above 60% survival.
What could be the possible explanation? As a second question can this survival probability be interpreted as the risk a person is at to be arrested? As far as I can tell this function should tell us what the survival probability for this individual is at timepoint t. I apologize about this question, I am new to this area. Thank you.