1

I am trying to use CoxTimeVaryingFitter model in python from lifelines package, for making inference on which features have a causal impact on a success outcome. The features are time-varying so this model seems appropriate.

As time 't' increases, the success outcome rate decreases, and so the imbalance of outcome increases with time. My questions are:

  1. Should imbalance be handled in an inference model? If so, what is the best way and should it be handled at each time 't'?

  2. Should the rows in the train dataset be only up to a chosen max time t? ie. should entries at large time 't' where success rate is extremely small be excluded from the train dataset ? Is there a good way to choose the cutoff point for 't'?

Fiori
  • 81
  • 4
  • [Are you sure class imbalance is a problem?](https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he?noredirect=1&lq=1) – Dave Jun 21 '21 at 19:18

1 Answers1

1

In a time-to-event model it's typical to have imbalance at late times. If at most 1 event is possible, many in higher-risk groups necessarily have had the event and have already left the cohort before the later times. With a proportional hazard model you're assuming that the relative hazards associated with your covariates are constant over time, so time per se isn't an issue. The coefficients are estimated from whatever information is available at all event times. It's just that late times have fewer cases available to provide information.

A point to remember in a model with time-varying covariates, however, is that it's the covariate values at each event time that are used in the estimates. Some types of covariates might better be modeled as cumulative or average values of some sort rather than their values at particular times. Apply your knowledge of the subject matter carefully to the data setup and modeling.

EdM
  • 57,766
  • 7
  • 66
  • 187