0

I have a churn dataset which has 79 variables. I commit the cox regression model on the dataset and I get the following result:

Concordance= 0.664  (se = 0.001 )
Rsquare= 0.152   (max possible= 1 )
Likelihood ratio test= 63855  on 16 df,   p=<2e-16
Wald test            = 70941  on 16 df,   p=<2e-16
Score (logrank) test = 75130  on 16 df,   p=<2e-16

Now I am confused as to what I should do to increase my R Squared but it cannot be more than 0.2? What can I infer from this ?

C. Gupta
  • 3
  • 2

1 Answers1

1

Depending on the nature of the data, $R^2$ values in Cox regressions can be quite low. I'm presently working on some clinical survival data for which a model with very good predictive behavior only has an $R^2$ of 0.17. A high $R^2$ would require high precision in predicting the actual times of events, often an unrealistic goal.

The Concordance Index might tell you more about how useful your model is. It is the fraction of pairs of cases in which the actual order of events matches the order predicted by the model. The value of 0.664 shows that you predict the correct order in almost 2/3 of pairs of cases. That might be good enough for your purposes. You also should consider validation and calibration of your model to estimate how well it will generalize to new data, for example with tools provided by the rms package in R.

It might be possible to improve your model, but help on that would require a lot more information about the nature of the data, the predictor variables, the events, and the number of cases. Also, it's somewhat strange that with 79 predictor variables your model only has 16 degrees of freedom. If you did some type of variable selection to reduce the number of variables (e.g., forward stepwise selection) then your results might be suspect and might not generalize well to new samples of data. This classic answer discusses the problems with such variable selection in the context of linear regression, but the issues are the same for Cox regressions.

EdM
  • 57,766
  • 7
  • 66
  • 187
  • actually I am using 19 variables which I have selected based on several factors like Chi-Squared test being one of them. – C. Gupta Aug 21 '18 at 11:20
  • @C.Gupta if you pre-selected variables based on their relations to outcome then you do run risks of over-fitting the data and of modeling results based on the present sample not generalizing well to future data samples. Look at the page linked in my answer to start learning about the problems this approach entails. Consider penalized approaches, like LASSO or ridge regression, to minimize these problems, and testing your models with bootstrapping or cross-validation. – EdM Aug 21 '18 at 14:02
  • No the fact is that many of the 79 variables are categorical and there are a few binary flag variables as well. But I have already tested the scatter plots and variability of the covariates in Tableau following which I have taken the decision. Can you give me some of your time? This is an industry project I am working on and my clients are ACN and Sprint in USA. The problem is I am not getting sufficient helping hand here which is a major concern. Can you share with me your email id where I can send the data and the code I have written in R. . – C. Gupta Aug 22 '18 at 06:26