Log-rank / Cox analysis with very unequal sized groups: alternative calculations of p-value?

Question

I would like to set up a series of tests on the difference in survival between two very unequal sized groups.
Generally either log-rank (using the R survdiff function) or a cox regression (R coxph) with stratified patient variables works well. However, in some cases one group is small and the event relatively low incidence, which makes the expected number of events very small. In those circumstances, it does not seem sensible to use the p-value generated by the log-rank test, since this is based on a chi-squared test, which is inappropriate for small numbers of expected events (surprisingly R does not give a warning message for this). Taking an admittedly fairly extreme example to illustrate:

survdiff(formula = survobject ~ (Fixation == i), data = TKRGroup)
n=637763, 424 observations deleted due to missingness.

                         N Observed Expected (O-E)^2/E (O-E)^2/V
Fixation == i=FALSE 637725    11174 1.12e+04  5.52e-04      11.9
Fixation == i=TRUE      38        3 5.17e-01  1.19e+01      11.9

Chisq= 11.9  on 1 degrees of freedom, p= 0.000555

Cox regression gives a higher p-value of 0.0023, though it still looks a rather on the low side for these values of observed and expected events.

coxph(formula = survobject ~ (Fixation == i), data = TKRGroup)

                  coef exp(coef) se(coef)    z      p
Fixation == iTRUE 1.76       5.8    0.577 3.05 0.0023

Further summary information gives

Likelihood ratio test= 5.58  on 1 df,   p=0.01813
Wald test            = 9.27  on 1 df,   p=0.002325
Score (logrank) test = 11.92  on 1 df,   p=0.0005543

At this point, I could do with some expert advice on which, if any, of these p-values to use, or whether there is some alternative approach available (preferably available within an R package!) Given the size of the groups, I rather naively attempted to get some idea of a sensible p-value by applying a Poisson exact test to the observed and expected figures; Values of observed / expected of 3 / 0.517 would give a cumulative Poisson P(X ≥ 3) = 0.0157. That seems a much more reasonable figure, though I am not sure I could defend it.

Is there some reason that you care about the p-value itself? Have you considered bootstrap re-sampling/analysis? — EdM, Sep 21 '14 at 14:54
Im afraid a lot of attention is paid to whether the groups have "significantly" different survival, whatever arbitrary thresholds for significance are set (a whole topic for debate there...). Actually in this case I use two thresholds for "Alert" (p < 0.05) and "Alarm" (p < 0.001). The choice of method will determine the category, and in an determine the — Knackiedoo, Sep 22 '14 at 15:48
Is this to be used to set "Alert" and "Alarm" warnings for future predictions (e.g., for new patients)? — EdM, Sep 22 '14 at 15:55
Exactly. (I was just going to write that when I accidentally submitted!). I will look into resampling, though any pointers as to how to implement that in this case would be very welcome! — Knackiedoo, Sep 22 '14 at 16:01
The chi-square approximation to the null distribution won't apply to a cell with expected count so low. You can still use the chi-square statistic but (for example) use simulation under the assumptions of the null to get the null distribution (and hence the p-value) - much as the `chisq.test` function can do. However, you may still find the p-value coming out pretty low; getting an observed up as far as '3' when the expected is quite low is still going to be fairly surprising. — Glen_b, Sep 22 '14 at 22:54

AdamO · Answer 1 · 2014-09-22T17:28:29.503

In these kinds of comparisons, you'll find that what happens is a two-sample test becomes very approximately a one sample test where all the power comes from the smaller group (they are being "calibrated" to the larger group), and so the assumptions behind sample sizes in 1 sample tests apply for that group. 3 deaths does not suffice to estimate a Cox model. Survival models are driven by the numbers of events, not the denominator.

If there is no censoring in these data, you can condition upon the failures observed after a fixed point and compare survival by looking at proportions which did not survive beyond that fixed point. It is a basic proportions test of a contingency table and achievable via Fisher's Exact Test which is accurate is small samples.

$$ \begin{array}{ccc} & \mbox{Died} & \mbox{Lived} \\ \overline{\mbox{Fix} }& 11,174 & 626,551\\ \mbox{Fix} & 3& 35\\ \end{array} $$

The benefit of using an Exact test is that it is effectively answering the question of "what is the probability I may have seen 0, 1, 2, or 3 deaths out of 38 in the Fix group given that my expected death rate is ($0.02 = 11174 / 637725$). The effect of the large non-fix group is that the variability in expected rate will be very low and almost entirely determined by those data.

Thanks. Unfortunately the data is heavily right censored, which is why I want to use all the available data in a Cox model. Some would argue that the p-value on log-rank test is meaningful even with zero events http://stats.stackexchange.com/a/91509 . However, the smallness of this group is exactly the conundrum; the significance of a difference in survival appears to have an exaggerated level of significance with small numbers. I could put an arbitrary lower limit on event number because the p-value is apparently unreliable, but that is rather ducking the problem! — Knackiedoo, Sep 23 '14 at 20:40
Have you tried a case-cohort design, matching individuals to the smaller group on individuals in the larger group on as many matching factors as possible, e.g. age, sex, prognosis, etc.? — AdamO, Sep 23 '14 at 21:05

score 1 · Answer 2 · answered Sep 22 '14 at 16:38

Do not pay too much attention to the p-values. They just provide a probability that you might accidentally have found a survival difference in your particular study sample when there really isn't a difference in the population as a whole.

You evidently want to use predictor variables for each new patient to classify relative risk. It's better to base that classification on an estimate of each new patient's predicted survival or equivalent (like time to some undesired event), and also on the reliability of the estimate, rather than on the p-values from analyses of your study sample.

There are simple predict functions for R coxph and survreg objects, but you will be better off learning to use the rms package in R, which provides ways to validate your model and even build nomograms for prediction. The author, Frank Harrell, is a regular contributor to this site.

Log-rank / Cox analysis with very unequal sized groups: alternative calculations of p-value?

2 Answers2