2

This is a question regarding a project I am working on, yet is better dealt broadly.

When creating a hazard model (say coxph) using longitudinal data with time-dependent covariates, the best case scenario is without right-censored data, yet the Cox model handles right censored data just fine. But to what extent? Obviously there is a limit, as data that is completely censored can't be constructed into a hazard frame. So my questions are two (with a conditional third):

  1. Is there a point in which the Cox proportional hazards model becomes inaccurate due to too many censored (right censored) events? Is an event-to-censored ratio of 1:10 problematic? What about 1:100?
  2. How does the previous problem changes or is answered in regards to the $N$ and/or to the number of covariates in the model?
  3. Finally, when there are problems - (a) how to detect them, and (b) what can be done a are there better suited models?
Yuval Spiegler
  • 1,821
  • 1
  • 15
  • 31

1 Answers1

3
  1. A potential problem with many fewer events than censored cases is that it might be difficult to detect the existence of informative censoring or other deviations from proportional hazards assumptions. All the information about the relative hazards needs to be in the covariates themselves. For Cox analysis the censored cases should not differ from those with events except that their events, given the risks associated with the covariates, haven't happened yet. That might be hard to examine if there are few events, as in this example. If covariates are time-varying this might be even a bigger problem. My sense here is that this type of problem is more likely with small absolute numbers of events, not necessarily a low ratio of events to censored cases.

  2. The usual rule of thumb is no more than 1 covariate considered per about 15 events, unless some penalization is used. This is similar to the rule of thumb for binary classification, where you need about 15 cases of the least-frequent class per variable. Each level of a categorical predictor beyond the first should count as a covariate, as should each additional parameter determined from the data in handling continuous covariates (e.g., in fitting splines).

  3. For the first type of problem, you have to be vigilant in testing whether the assumptions of the Cox model have been met. Tools like those in cox.zph() in R give a way to approach this, but you will have very low power for detecting deviations from proportional hazards if you have few events. That's where you have to be vigilant. For the second type, compare the number of covariates to the number of events. If there are too few events, you can use subject-matter knowledge to remove some covariates, combine some correlated predictors into a single predictor, or in some circumstances combine multiple predictors into a single propensity score. Or you can use penalized methods like ridge regression or LASSO to handle a larger number of predictors while minimizing over-fitting. Tools like cox.zph() do not provide tests of over-fitting. Tests of over-fitting require examination of multiple re-samples from your data to see how well your model-building process might extend to new samples. The rms package in R, for example, provides tools for this.

EdM
  • 57,766
  • 7
  • 66
  • 187
  • Thank you for answering. A few questions: *1)*"The censored cases should not differ from those with events.." - can you elaborate on this? often times the censored differ - not all will get a heart attack even followed to their deaths, not all companies will go bankrupt etc.. *2)* I know about the 15 rule of thumb, but I always see it in regards to cross-sectional type analysis. *3)* Can one of the tests in the 'coxph' fit be indicative to overfitting? – Yuval Spiegler Dec 07 '16 at 11:09
  • Expanded answer to cover these. Briefly: (1) you may see more censoring among cases with low hazard, but all information about hazards should be in the covariates themselves; (2) that rule of thumb is a good place to start in general, but if you have multiple or competing events in the same individuals that might need to be reconsidered, perhaps to count individuals having events rather than total numbers of events; (3) over-fitting has to do with whether the model will generalize to new samples, not whether you met the proportional hazards assumptions in your present data sample. – EdM Dec 07 '16 at 15:00