1

Considering that for a logistic regression one approach is to cut the numeric variables (or group the categorical ones) with some algorithm before running the logistic model (to allow the algorithm to give them the number that will make the log(odds) linear with the B*X, what would be the way to do this for a Cox model?

For example you can use a chaid or a statistical test to group the variables and then convert the levels to factors and run the logistic regression. Finally you canuse some kind of variable selection like stepwise with Akaike criterion (but this is after introducing the variables grouped in groups that have significative different target rates).

I thought that, in the Cox model, since what you are trying to estimate is the survival, and this is a distribution over time, one way would be to cut/group the variables (each one) with some kind of test like a goodness of fit for the real survival curve (this is: if the curves are the same, then you should not cut in that point).

Do you have any other ideas to cut the variables in a useful way for a Cox model? Is there any package (R) to do this?

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
GabyLP
  • 641
  • 6
  • 13
  • 1
    Should be "a bad approach is to cut the numeric variables". – user158565 Jul 16 '19 at 19:53
  • @user158565, why? Since it has to be linear in the log(odd) I use to cut them with one of the algorithms that I described and then instead of having a factor I use the numeric value of the log(odds). Anyway, would you please answer my question? – GabyLP Jul 16 '19 at 20:11
  • 4
    See [this page](https://stats.stackexchange.com/q/68834/28500) for extensive discussion about why cutting/grouping continuous predictors is typically a poor idea. In particular, cutting or grouping based on relations to survival in a data sample is likely to be overfit and unlikely to generalize well back to the population of interest. – EdM Jul 16 '19 at 21:09

1 Answers1

4

It is probably a much better idea to use splines instead of dichotomizing continuous variables, which essentially needlessly throws away information. Terry Therneau has a vignette on how to do that in the Cox model using R. If you are unsure about what degree of splines to use, there's always the option of optimizing that on your training data using cross-validation before evaluating model performance on a hold-out test set.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
Björn
  • 21,227
  • 2
  • 26
  • 65
  • 1
    +1 but note that most Cox models simply don't have enough data points to warrant separate training and held-out test sets. [Harrell](https://www.fharrell.com/post/split-val/) suggests that with fewer than 20,000 data points you are better off using bootstrapping or repeated cross-validation to validate the process used for building a model on the entire data set. His [rms package](https://cran.r-project.org/package=rms) in R provides for cubic splines; degree of splines can be determined by ANOVA between models with different numbers of knots. – EdM Jul 16 '19 at 21:04