I have data that consists of lab test results for patients, cancer dates for those who got cancer, and time till death/censor from that cancer. This cancer is both rare (most of the patients never get it) and mostly non-lethal (even for the patients that do get it, most live with it for many years).
I would like to predict a risk score for patients before they have cancer, which will answer the question: "according to my lab tests, what's the risk for me to get a lethal cancer?".
I thought of 2 ways to get the most from my data, and using all populations (very lethal cancer, medium and non-lethal cancer and healthy):
- Survival analysis (training a model such as a random survival forest). For the cancerous patients I would enter their true death/censor time, and for healthy patients I would enter censor with the maximum follow-up time.
- Regression (training a model such as
xgboost
). The label would be some risk level such as:healthy=0
,death-from-cancer-within-more-than-3-years=1
,death-from-cancer-within-0-3-years=2
. This method would require estimating time-to-death for censored-out cancer patients, e.g. by using a mean death time for the cancer patients with a recorded death-time.
Which of these 2 ways is preferable (if any)? Is it valid at all to mix healthy with cancer and ask that risk/survival question?
And most importantly - how can I measure which model is better? By using concordance index? - is it valid to assume healthy patients are censored-out in max follow-up time, as they are substantially different from patients that got cancer but lived with it for max follow-up time..