Introduction
I have a 300,000-row cancer dataset with around 60 variables (cancer stage, year of diagnosis, radiation therapy, histology, etc.) with a time variable ("number of months survived") and an event (alive or dead). The last two variables have complete values in the individual records.
Survival months as outcome variable
My initial goal was to create a multilayer perceptron model in WEKA given my data in order to predict the number of months survived for new instances.
- Preprocess data
- Train the model in WEKA
- Assess model's performance (accuracy, specificity, sensitivity)
- Test model for new cancer records
Patient risk as outcome variable
The requirements changed thus it was changed to predicting the patient risk of survival within equally-spaced time periods.
- Divide data into:
- 24 - 47 months (2 years)
- 48 - 83 months (4 years)
- 84 - 107 months (6 years)
- 108 - 119 months (8 years)
- 120 - "up to what's available" months (10 years)
- I will then use the function
predictSurvProb
from the packagepec
inR
as suggested in this problem to obtain individual survival percentages for my aforemention records. The data will be divided into their own survival months bracket and respective patient risk prediction i.e. records that survived within two years will have a patient risk survival prediction percentage in two years. - After getting all my individual records their respective survival percentage per time period, WEKA will be used to create five models for each time period that will patient survival as the outcome variable.
- The five models can be used to predict survival of a single record giving out five different patient survival risk
Problem
I am still learning about R
but I managed to apply the sample code (from the pec
documentation for predictSurvProb
) into my data as:
library(survival)
library(pec)
library(rms)
# fit a Cox model
coxmodel <- cph(Surv(time,vsr)~1,data=cancer,surv=TRUE)
# predicted survival probabilities can be extracted at selected time-points:
ttt <- quantile(time)
# for selected predictor values:
ndat <- data.frame(vsr=c(0,1)) # I assumed the event variable is provided here
# as follows
predictSurvProb(coxmodel,newdata=ndat,times=ttt) # has error
## simulate some learning and some validation data
learndat <- SimSurv(100)
valdat <- SimSurv(100)
## use the learning data to fit a Cox model
fitCox <- coxph(Surv(cancer$time,cancer$vsr)~vsr,data=cancer)
## suppose we want to predict the survival probabilities for all patients
## in the validation data at the following time points:
psurv <- predictSurvProb(fitCox,newdata=valdat,times=seq(24,48,72,96,120))
## This is a matrix with survival probabilities
## one column for each of the 5 time points
## one row for each validation set individual
I need to obtain the patient risk calculation for 300,000 patients but the line
predictSurvProb(coxmodel,newdata=ndat,times=ttt)
shows the error
Error in .subset2(x, i, exact = exact) : subscript out of bounds
How do I solve this error?