Survival Analysis when Independent Variable Proportions Shift Heavily Over Time

Question

I am attempting to conduct survival analysis on hiring processes with Kaplan-Meier, where the event is termination from the organization: testing a few independent variables known at an employee’s hiring against how long they stay. However, unlike most clinic studies, the proportions of these independent variables seen in the data can be heavily related to time – for example, if Talent Pipeline were an independent variable under consideration, and the organization used to use Talent Pipeline A most of the time, shifted towards Pipeline B, and used Pipeline B nearly always by the end of the time period under consideration.

I would expect this to result in secular trends that I need to account for. Many papers discuss various other considerations that can arise, but I am not seeing much on this issue. Any tips on how best to account for it, and/or links to more information on the subject? Thanks so much!

(Note: These aren't time-varying covariates, just the proportions of the categories of the independent variables seen in the data shift heavily over time.)

If there is no epoch-effect, i.e a change in the relative risks of differences in the predictors over time, I would think this would not be an issue. You can check that. — DWin, Nov 25 '20 at 04:01

score 0 · Answer 1 · answered Nov 28 '20 at 15:09

0

The main potential problem, as pointed out in a comment by @DWin, would be "a change in the relative risks of differences in the predictors over time."

One way to deal with that problem would be to include the date of hiring as a predictor (which you presumably are doing anyway), along with an interaction term between TalentPipeline and date of hiring. That would either tend to rule out the problem (insignificant interaction term) or control for it.

answered Nov 28 '20 at 15:09

EdM

57,766
7
66
187

Thank you both! My understanding is that when adding date interactions for logistic regressions it is helpful to convert the date into a numeric number of days since some reference/origin date. Is the same true with Cox PH and other survival models? – naive_bayesian Dec 01 '20 at 19:39
@naive_bayesian in all regressions the choice of reference date affects the value of the "main effect" coefficient for what you're interacting it with, `TalentPipeline` here. With an interaction, the coefficient reported for `TalentPipeline` will be what holds at the reference date for which elapsed time = 0. If you use 1 January 1900 as a reference date, that value will be a good deal different from what you'd get with 1 January 2020 as reference. The model and predictions from it are really the same, but people sometimes get confused if the reference date isn't representative of the data. – EdM Dec 01 '20 at 19:53
Thank you. After adding the interaction term I'm seeing a statistically significant interaction coefficient, indicating the relative risks of terminations in the two pipelines do vary over time. But both the graphical and statistical tests for PH indicate the PH assumption is met. Do those seem to be contradictory to you, or am I missing something? – naive_bayesian Dec 01 '20 at 20:36
@naive_bayesian there's no necessary contradiction. PH just has to do with whether the hazards associated with the predictors are constant over time elapsed _since study entry_, presumably the time elapsed since hire. The actual date of hire, now used as a predictor itself and in an interaction, provides different information from the time elapsed since study entry/hire date. From the perspective of PH, the actual date of hire and its interaction with `TalentPipeline` are just 2 more predictors, which you now have found to contribute constant hazards _over the time elapsed since the hire_. – EdM Dec 01 '20 at 20:55
Thanks so much! Also, a friend suggested only including a subset of the dataset for modeling, limiting it to observations/employees with apply dates within the time range during which the `TalentPipeline`s were both used more than x% of the time (or possibly, the time range during which the proportions of `TalentPipeline` used was relatively flat). Do you think filtering in either of these ways would add any benefit beyond simply including the interaction term? – naive_bayesian Dec 01 '20 at 21:13
I read in your answer to a different question that "it's seldom a good idea to put too much importance on the results of a single-predictor Cox model or Kaplan-Meier curve." When the goal is group comparison and not prediction, would you have the same concerns about a minimal model with interaction (i.e. only 2 predictors and their interaction -- in this example, `TalentPipeline`, `HireDate`, and their interaction)? I am looking at other covariates as well but may not include them when making an overall recommendation, unless it's important for a fair `TalentPipeline` comparison. Thanks! – naive_bayesian Dec 01 '20 at 21:33
@naive_bayesian in terms of subsetting data, you don't know until you try. You generally lose power by omitting cases, but if there are big extremes in proportions of `TalentPipeline` types over calendar time you might find it easier to justify. In particular, if at extreme ranges of calendar time one or the other `TalentPipeline` type is missing, then those extreme calendar-time ranges should be omitted _if your interest is comparing the types of_ `TalentPipeline`. – EdM Dec 01 '20 at 22:54
@naive_bayesian for group comparison you run the risk that differences you attribute to groups (like types of `TalentPipeline`) are really due to differences in covariate values among the groups the might not hold in future cases. So in your situation a correct group comparison is actually a prediction based on _correcting for covariate values_. If you have vetted the full model and find that the simple one is close enough, you could present the simpler one to others and note that it is a simplification of a more complicated, well-vetted model. – EdM Dec 01 '20 at 23:04
Thank you so much. After fitting a fuller model I'm not as fortunate with regard to PH. `TalentPipeline` meets PH until the interaction with `HireDate` is added, after which `TalentPipeline`,`HireDate`, & their interaction all fail PH (tested with cox.zph() in R at .10 alpha). As I understand it, the typical remedy for continuous variables violating PH is to add that time-related interaction term (especially when an estimate is needed for that variable, as in this case) -- any advice on what else I can do? Or does the fact I added the interaction justify proceeding as if PH was met? – naive_bayesian Dec 02 '20 at 11:21
@naive_bayesian use your judgment and knowledge of the subject matter. A "significant" deviation from PH [might not be big enough to matter](https://stats.stackexchange.com/a/61136/28500), or might be fixed by [transformation of continuous predictors](https://stats.stackexchange.com/q/379416/28500). The time-dependent vignette for the [R `survival` package](https://cran.r-project.org/package=survival) shows alternatives to time-related interaction terms and their correct implementation if needed. [Frank Harrell](https://hbiostat.org/) provides advice in sections of his RMS documents. – EdM Dec 02 '20 at 17:54

Survival Analysis when Independent Variable Proportions Shift Heavily Over Time

1 Answers1