How to choose a model for survival regression when data does not fit assumptions?

Question

I am trying to perform survival regression (prediction) on a dataset of lifetimes, which is highly concentrated around 1, with a significant right skew. The below photo is how it looks when log-transformed.

I tested a Cox PH model, and several AFT models: Weibull, Log-Logistic, and Log-Normal. Of the 4, the Log-Normal model gives the best log-likelihood and concordance score (about 0.59), but I'm concerned that it does not fit assumptions.

By "does not fit assumptions," I mean that I'm not sure that this distribution qualifies as lognormal. The transformed data in the photo is not especially close to normality and still shows a significant right skew. I also don't know how to check the residuals for this sort of model, as it is my first time doing survival analysis and I'm at the mercy of what the lifelines package in Python has available. As far as I can tell, checking assumptions is not implemented for AFT models.

Inferences about variables are not important to us, but the predictions produced by the model will be used in a business context, so I'm not sure how to proceed. I'd be grateful for an answer to any of the following:

How do I finish checking the assumptions for this AFT model using Python?
If the assumptions are not met, are the predictions completely useless for a business context, despite a concordance of 0.59? (Typical range is 0.5 to 0.7 from what I've read.)
If so, what should I do then?
Should I just reformulate the problem into a different one? (Just now I fit a classifier to predict whether an observation's lifetime would be >= 6, excluding the censored data; sadly it only achieves AUC-ROC of 0.55.)

May use a different survival analysis model? E.g., logit hazard, probit hazard, or complementary log-log discrete time event history model? — Alexis, Jun 07 '19 at 19:00
Thanks for the recommendation! I actually wouldn't know how to implement those in Python. At the moment I'm restricted to whatever is available in `lifelines`, which includes only the four I listed. — conveniencesample, Jun 07 '19 at 19:08

score 3 · Accepted Answer · answered Jun 07 '19 at 20:29

Generally, it's useless to plot the distribution of your outcome and hope you learn something about what distribution to use. Especially so for survival analysis because some values are truncated, so the empirical distribution != the actual outcome distribution. Furthermore, most models assume not that Y has any particular form, but that Y | X has some form. This is only possible to check after regression.

(Here's another QA about this: How to choose "family" in Generalized Additive Model (GAM)).

Specifically to your problem, if your goal is prediction, then the assumptions don't matter. All you care about building is the best $F$ s.t. $F(x)$ is as close to $y$ as possible. So your strategy of trying all the models, and picking the one with the highest score¹ is fine. Now, correcting assumptions can improve prediction (ex. transforming a term into a log-term so that is satisfies a linear assumption), but tweaks like this fall under feature engineering more generally.

¹ Of course, you should cross validate these models, see some new stuff in lifelines to make this easier: compatibility with sklearn

Hi Cam, thank you so much for clearing this up! (And for the lifelines package - it's wonderfully easy to use for a beginner.) I did make sure to cross-validate and it performed similarly well across all the folds, and I'll be sure to keep that in mind re: feature engineering. Thanks again! — conveniencesample, Jun 07 '19 at 21:17

How to choose a model for survival regression when data does not fit assumptions?

1 Answers1