0

I have to work with Cox regression but I'm not getting fully how it works. So I created a very basic fake data sample, and tried to fit a Python Lifelines CoxPHFitter with it.

Here is my sample:

enter image description here

I'm assuming "alcohol and cigarettes are predicting more death, while sport helps getting healthy and body height does not have any impact", which seems to make sense.

But when I run it with:

cph = CoxPHFitter()
cph.fit(df, duration_col='survival', event_col='death')
cph.print_summary()
cph.plot()

I just can't understand the logic of the results I get:

enter image description here

enter image description here

Anyone could explain how to interpret this? Especially, why alcohol seems to have an inverse effect of cigarettes, which is not what I deduce from my dataset?

Especially, could you explain the coef result, and the log(HR) (95% CI).

user650108
  • 103
  • 1
  • 3
    Did you check the correlation between cigarette and alcohol (0.98) beforehand? – chl Oct 11 '20 at 19:17
  • Thanks chl for taking the time. Well, I don't understand why this strong correlation between cigarette and alcohol does not imply that cigarette and alcohol are on the same side of the plot graph (cigarette on the right and alcohol on the right)? – user650108 Oct 11 '20 at 19:55
  • This might be related to [colinearity](https://stats.stackexchange.com/q/1580/930). Is this all the data? I'm surprised you could estimate all parameters since it fails for me both in R and Stata. – chl Oct 12 '20 at 08:36
  • Yes, I ran the model with this small dataset. To be honest, I got a `ConvergenceWarning: Newton-Rhaphson failed to converge sufficiently.` warning. But it shows me the results anyway. – user650108 Oct 12 '20 at 13:46

1 Answers1

2

Your specific result has little to do with Cox regression itself but a lot to do with regression in general, when predictors are correlated. As @chl notes, in your data (and also in real life) smoking behavior and alcohol intake are highly correlated.

A search on this site for "correlated predictor" just turned up 2500 hits. This is a very common situation. In addition to the post that @chl linked to in a comment, you might want to look at this thread on how adding predictors can make a previously "significant" predictor seem to become "insignificant," or this thread about apparently opposite behavior when predictors are added to a model.

There are a few issues potentially at play here. For one, sometimes the true effect of a predictor is only seen when other predictors associated with outcome are taken into account. For another (maybe more relevant in your case), if two predictors are highly correlated with each other and with outcome, a regression model won't know which of them to give "credit" for and will effectively choose the one that fits best in your data set, or alternatively deem both "insignificant" in some cases if they are too highly related. For a third possibility, remember that a linear regression model can be thought of similarly to a Taylor expansion of a function, limited to only the first-order terms in your example. Sometimes with correlated predictors, one of them might get too much credit based on its linear approximation, and a coefficient of opposite sign for the other might be providing some correction for that over-estimation.

As your question was about Cox regression, note that this is even a bigger problem there than for standard linear regression. In standard linear regression, the types of problems noted above arise when you omit from a model a predictor that is both associated with outcome and with included predictors. In Cox regression, like in logistic regression, you can bias results even if an outcome-related predictor isn't at all correlated with the included predictors.

EdM
  • 57,766
  • 7
  • 66
  • 187