When (and why) is a conditional logistic regression equivalent to a Cox proportional hazards model?

Question

In the the help for the clogit function in the survival package in R, the details section starts with:

It turns out that the logliklihood for a conditional logistic regresson model = loglik from a Cox model with a particular data structure. Proving this is a nice homework exercise for a PhD statistics class; not too hard, but the fact that it is true is surprising.

Does anyone on here know (a) what is that data structure, and (b) why this is the case?

closely related: I [recently](http://stats.stackexchange.com/questions/202348/statistical-methods-for-data-where-only-a-minimum-maximum-value-is-known/202375#202375) showed that logistic regression (ie logit link, not cloglog) with log(time) as a covariate produces the same model (up to a nonlinear transformation of baseline parameters) as a proportional odds survival regression model with a loglogistic baseline for current status data. — Cliff AB, Mar 25 '16 at 23:17
I haven't worked it out, but I suspect replacing the logit link with the cloglog will result in a proportional hazards model (definitely true) with a Weibull baseline (not sure about that). — Cliff AB, Mar 25 '16 at 23:18
Conditional logistic regression with only one response per strata is equivalent to the Cox model when there are no ties between the event times in the Cox model. More generally the models are equivalent when there are say $k_i$ responses in strata $i$ in conditional logistic regression, and a $k_i$-way tie at event time $t_i$ in the Cox model. — g.s, Feb 15 '18 at 05:36
Once can think of duration data as matched case-control data, where cases are matched on the periods of risk. An example data is found here (p.10 and below) http://psfaculty.ucdavis.edu/bsjjones/discreteslides.pdf — dzeltzer, Mar 25 '16 at 23:06

score 3 · Answer 1 · answered Feb 10 '16 at 20:58

Both conditional logistic regression and survival analysis are forms of semi-parametric inference where complex relationships between unmeasured risk factors (such as a baseline hazard function or unmeasured risk factors) are controlled by organizing data into risk sets.

Formally, a risk set in survival analysis is a collection of individuals at risk for the event at each time point in which a failure is observed. The distribution of measured risk factors of the survivng cohort is compared to those of individual who failed at the event time. This ratio allows us to control for the complex, unmeasured baseline hazard function which other factors mediate multiplicatively using a hazard ratio. We ignore the amount of time that actually elapses between each failure time, and consider each risk set to be incrementally at "greater risk" by some unknown amount due to their longer duration of follow-up.

A conditional logistic regression does not have a risk set, per se, but a matched set. These are individuals among whom all unmeasured risk factors are assumed to be the same. Conditional logistic regression iteratively predicts what the cumulative risk of events is in each matched set insofar as matched sets can be ranked in terms of their unmeasured risk. Using a Cox model, each ranked matched set is treated like a risk set in a Cox Model, and then the odds ratios for events are calculated using the same partial likelihood from the Cox Model. Using predictions from these estimated odds ratios, the ranking is updated to account for what is now known about these matched sets' risk due to unmeasured factors (since our updated predictions take better account of the measured risk factors using odds ratios). This process iterates until there is agreement (or convergence) using an expectation maximization framework. This is why clogit takes so much longer to converge than a simple Cox Model.

Formally, because there is a "little bit of estimation" in terms of risk of unmeasured factors in the conditional logistic regression, this method is a "conditional likelihood" maximization whereas the Cox Model is a "partial likelihood" maximization.

So

Data structure) risk sets / matched sets

Why) both account for unmeasured sources of risk.

Thanks. While I appreciate your answer and @Bjorn's answer, I'm looking for something that shows the data structure a bit more explicitly. So, for example, some R / SAS / whatever code that constructs the appropriate dataset and shows that the two estimates are the same, or equations showing how the two parameterizations compare to each other. Again, thanks -- I really do appreciate the time you spent on this. — Jake Fisher, Feb 13 '16 at 16:12
Your question is conceptual and this is a conceptual answer. You should read the help files, attempt the regression, and work through debugging on the stackoverflow forum for help with the R or SAS. — AdamO, Apr 08 '21 at 18:41

score 2 · Answer 2 · answered Feb 10 '16 at 18:36

The data structure is with all observed event times tied at the same time and only happens with one particular handling of these ties - namely the one that ensures that the test matches up with the standard log-rank test. This has an underlying assumption that tied event times truly indicate events happening at the exactly same time. This is as opposed to them having occurred at some point before, but only observed/identified/diagnosed at the recorded event time. In case of two groups being compared, in both cases the hypergeometric distribution is used to compare the groups and the effect sizes are parameterized the same way.

Maybe you can improve this answer by explaining the semi-parametric component (log rank test is non-parametric), and how clogit is a conditional likelihood whereas Cox models consider a partial likelihood, and lastly how a survival model considers *time* whereas clogit has no notion of time. — AdamO, Feb 10 '16 at 19:22

When (and why) is a conditional logistic regression equivalent to a Cox proportional hazards model?

2 Answers2