0

I am currently looking for evaluating/validating a survival analysis model on quite highly right censored data set. The thing is that i have many individuals in the data set. I wanted to use c-index as a KPI and the problem is that it is kind of N^2 while even for a 10-fold cross validation, I still have 10 000 positive events in my test set (and much more censored individuals). This make it quite impossible to use c-index directly. I wanted to know if it could be statistically relevant to perform stratified sampling on the test set to have an estimation of the c-index at each cross validation step ? I guess this would be dependent on the sampling factor. Is it something that make sense (I haven't found any publication on this subject) or should I just forget c-index ?

I thank you in advance for your help.


Update:

I want to model churn prediction per customer for a Telecom company that is yet not data driven. For now i am not sure if I want to use time varying covariates so i only work with 'intrinsic' covariates. I focus on phone contract that ends before term (because client want to resume it).

Like gender, region (economic feature engineering), type of contract (One-Hot Encoding).

Currently I try to use simple models such as Cox Proportional Hazard, even if the proportional assumption is not verified, I just try to avoid over-fitting.

I don't do parameter tuning yet. I use Cross-validation in order to ensure my model is robust enough.

My dataset is quite huge (300 000 contracts) that cover 5 months. The contracts have different terms. So I made a kind of standardization by creating a lifetime of contract from 0.0 to 1.0 and add the duration of the contract as a feature.

My management wants me to have kind of accuracy of the model which is quite weird because the censoring ratio is very high (95%). So i don't know yet how i will answer this question. But first I need to unsure the model is stable.


Question Sum up:

The goal is to predict early termination of contracts per contract for a Telecom company (Probability that they will end early, and also the Survival function). I currently use Cox Proportional Hazard function. I want to validate this model (it's the first one), assessing whether or not it is stable during Cross validation. I want to use c-index because of high ration (>0.95) of censoring. But C-index is N^2 complexity. I wanted to know if it make sense (statistically speaking) to perform sampling over the test set several time in order to be able to compute an estimation of c-index.

eva
  • 1
  • 2
  • 1
    Please say more about the goal of your modeling, the type of model (e.g., Cox or some type of tree-based model), the number of predictors that you are evaluating, and the number of censored cases. Note that the C-index is [not very sensitive](https://stats.stackexchange.com/a/17517/28500) as a performance indicator. What parameter values are you trying to tune with your cross-validation? Please add that information to the question rather than placing it in a comment, as comments sometimes get lost. – EdM Jan 06 '20 at 15:12
  • 1
    This question could also be a bit simpler. For instance, I can't tell if the goal is to model associations or perform predictions. If the latter, even in part, the Cox model alone is not good for predictions. With big data, people use log-linear models far more often. – AdamO Jan 06 '20 at 15:50

0 Answers0