Obtaining R pec survival patient risk percentage

Question

Introduction

I have a 300,000-row cancer dataset with around 60 variables (cancer stage, year of diagnosis, radiation therapy, histology, etc.) with a time variable ("number of months survived") and an event (alive or dead). The last two variables have complete values in the individual records.

Survival months as outcome variable

My initial goal was to create a multilayer perceptron model in WEKA given my data in order to predict the number of months survived for new instances.

Preprocess data
Train the model in WEKA
Assess model's performance (accuracy, specificity, sensitivity)
Test model for new cancer records

Patient risk as outcome variable

The requirements changed thus it was changed to predicting the patient risk of survival within equally-spaced time periods.

Divide data into:
- 24 - 47 months (2 years)
- 48 - 83 months (4 years)
- 84 - 107 months (6 years)
- 108 - 119 months (8 years)
- 120 - "up to what's available" months (10 years)
I will then use the function predictSurvProb from the package pec in R as suggested in this problem to obtain individual survival percentages for my aforemention records. The data will be divided into their own survival months bracket and respective patient risk prediction i.e. records that survived within two years will have a patient risk survival prediction percentage in two years.
After getting all my individual records their respective survival percentage per time period, WEKA will be used to create five models for each time period that will patient survival as the outcome variable.
The five models can be used to predict survival of a single record giving out five different patient survival risk

Problem

I am still learning about R but I managed to apply the sample code (from the pec documentation for predictSurvProb) into my data as:

library(survival)
library(pec)
library(rms)

# fit a Cox model
coxmodel <- cph(Surv(time,vsr)~1,data=cancer,surv=TRUE) 

# predicted survival probabilities can be extracted at selected time-points:
ttt <- quantile(time)

# for selected predictor values:
ndat <- data.frame(vsr=c(0,1)) # I assumed the event variable is provided here

# as follows
predictSurvProb(coxmodel,newdata=ndat,times=ttt) # has error

## simulate some learning and some validation data
learndat <- SimSurv(100)
valdat <- SimSurv(100)

## use the learning data to fit a Cox model
fitCox <- coxph(Surv(cancer$time,cancer$vsr)~vsr,data=cancer)

## suppose we want to predict the survival probabilities for all patients
## in the validation data at the following time points:
psurv <- predictSurvProb(fitCox,newdata=valdat,times=seq(24,48,72,96,120))
## This is a matrix with survival probabilities
## one column for each of the 5 time points
## one row for each validation set individual

I need to obtain the patient risk calculation for 300,000 patients but the line

predictSurvProb(coxmodel,newdata=ndat,times=ttt)

shows the error

Error in .subset2(x, i, exact = exact) : subscript out of bounds

How do I solve this error?

Should you not be providing the predictor variables to `predictSurvProb` as `newdata`, rather than `vsr` which is the outcome? — Brendon, Dec 18 '13 at 19:44
(1) One advantage of neural net approaches to survival analysis is that they do not rely on the assumptions that underlie Cox analysis. You can get by with simpler Kaplan-Meier estimates for censored cases, and avoid this complexity. Look at [link](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2813661/) and the references therein for what seems to be something close to what you have in mind. (2) Make sure you have a good handle on the meaning and reliability of the underlying clinical data. Staging and other clinical variables can have different meanings among different types of cancer. — EdM, Dec 18 '13 at 19:47
@Brendon, thank you for the comment. I was at first confused in the docu about the term "predictor variables" but I guess I overthought it. I will provide the 60 variables right? — Saggy Manatee And Swan Folk, Dec 18 '13 at 23:56
@EdM, thank you very much esp. the journal article. Yes, the oncologist told me that as well and he approved my dataset. So I will not need the `predictSurvProb` right after all...but how do I obtain my patient risk variable? I understood that they used KM and an input vector but I'm not familiar with the methodology. Will a [KM method like this](http://stats.stackexchange.com/a/26291/35842) suffice for the problem? — Saggy Manatee And Swan Folk, Dec 19 '13 at 00:46
Can someone please take a look at my question? https://stackoverflow.com/questions/65137064/r-plotting-roc-curves-without-the-pec-library thanks! — stats_noob, Dec 04 '20 at 04:09

score 4 · Accepted Answer · edited Apr 13 '17 at 12:44

The method you link to in your comment should work, if you choose to follow the neural-network survival analysis approach in the article I linked to in my comment. For each patient in the model that approach uses a list of probabilities of being alive at each time of interest: 1/0 for patients known to have died, and for "censored" cases a 1 until last follow-up and thereafter the KM survival estimate.

Having said that, however, I urge you to consider looking at the other neural-network approaches noted in that article and any other more recent developments; I have a fair amount of experience with survival analysis, but not with neural-network approaches. Also, although neural-network approaches can give good predictive behavior, the hidden variables make it difficult to say what predictor variables really "matter," something that clinicians typically care about. The Survival Task View page available at CRAN mirrors shows other approaches for high-dimensional data like yours that might give results easier to interpret heuristically, and the MachineLearning Task View page shows what's available for neural network and other machine-learning approaches in R.

Thank you so much for all the help! I will reassess my study based on your input and other references :D — Saggy Manatee And Swan Folk, Dec 19 '13 at 16:18

Obtaining R pec survival patient risk percentage

Introduction

Survival months as outcome variable

Patient risk as outcome variable

Problem

1 Answers1

Linked