Interpretation of p values for each level of a covariate?

Question

In the output from a cox regression I get several p values from one variate - one for each level of the variate bar one level. I want to know the effect of the variate on the outcome (adjusted by other variates) and therefore one p value would suffice. (I can't seem to use anova because I'm using the R survival function survSplit - although that would be good). I am therefore stuck with trying to interpret the p values for individual levels of the variate.

My question is - what exactly do they mean ? - and when do I know that the variate has a significant effect on the outcome ? Do all the p values for each level of the variate have to be significant to conclude that the variate has a significant effect ?

The p values seem to be calculated with reference to one of the levels of the variate (presumably treated as a reference level) - but is that useful ? As an example the outcome could be death with genetic-time groups given by survSplit and the variate of interest could be an ordinal variable e.g. the nodal status of a patient.

In regression I normally think of whole variates, not different levels, as having a regression coefficient. Does that mean that here each level of a variate has its own regression coefficient which is independent of all levels apart from the reference level ? and if it is significant it can be interpreted as having an association with the outcome/time (=hazard) that is relative to the reference level's association with the outcome/time ? (note that this doesn't seem very useful if I don't know what the reference level's association with outcome/time actually is !?)

Unfortunately anova.rms and base anova will not work directly (at least with my data) on one R survival coxph(survSplit) object, presumably because the groups:strata(time_group) parameter which has two levels always gives one of these levels (for each time) with a line of NAs and se(coef) of zero: as in the following results from model

coxph(Surv(tstart, time, BCSSsplit) ~ groupsplit:strata(time_group) + Nodal_status_CATEGOR_CATEGOR, data = BCSSsplitdata)                                                            
                                                 se(coef)      z        p
Nodal_status_CATEGOR_CATEGORa                   2.959e-01  2.071   0.0384
Nodal_status_CATEGOR_CATEGORb                   2.698e-01  5.643 1.67e-08
Nodal_status_CATEGOR_CATEGORc                   2.964e+03 -0.005   0.9959
groupsplitHer2+:strata(time_group)time_group=1  2.774e-01 -4.112 3.92e-05
groupsplitTNBC:strata(time_group)time_group=1   0.000e+00     NA       NA
groupsplitHer2+:strata(time_group)time_group=2  3.930e-01  2.563   0.0104
groupsplitTNBC:strata(time_group)time_group=2   0.000e+00     NA       NA

The error given is Error in .rowNamesDF<-(x, value = value) : invalid 'row.names' length

anova will however compare two such objects - so I'll have to settle for that unless someone knows a method whereby anova will handle such a coxph object. (The time split (or other adjustment) is necessary following Schoenfeld testing for proportional hazards.)

Does this answer your question? [How to test the statistical significance for categorical variable in linear regression?](https://stats.stackexchange.com/questions/31690/how-to-test-the-statistical-significance-for-categorical-variable-in-linear-regr) Although the question is about linear regression, the principles are the same for your Cox regression. The `anova()` function in the R `rms` package provides a way to evaluate the overall significance of a categorical predictor. There are special ways to deal with ordinal categorical predictors. — EdM, Dec 29 '20 at 14:08
"With Cox regression each level has two different means or values according to whether the outcome is 0 or 1" isn't really correct. Each coefficient for a non-reference level of a categorical predictor in Cox regression represents the difference in log-hazard between that level and the reference level. You can think about those coefficients as playing the roles of the regression coefficients and associated estimated means in linear regression, with similar interpretations in terms of interpreting p-values. — EdM, Dec 29 '20 at 15:37
"this doesn't seem very useful if I don't know what the reference level's association with outcome/time actually is"--but that is just what you get with a Cox regression. There is a baseline hazard function over time, representing a situation with all predictors at reference levels. Then the regression coefficients represent the log-hazard differences from that baseline. If you want a closed-form survival estimate, you need to use a parametric regression rather than the Cox semi-parametric regression. — EdM, Dec 29 '20 at 19:59
I think that you might be having some more fundamental problem with your analysis for which you need help, before you worry about the p-values. In particular, it's not clear how and why you are using `survSplit()`, for example when you say "genetic-time groups given by survSplit". Also, I have no problem in doing `anova()` on models built on a dataset re-formatted by `survSplit`, whether in the base `survival` package or in `rms` (provided that I use the `rms` function `cph()` for the analysis with `rms`). More details about your specific application would help a lot. — EdM, Dec 29 '20 at 21:18

EdM · Answer 1 · 2020-12-30T18:03:14.443

Most of what's going on here is the same for any type of regression involving categorical predictors having more than 2 levels. Some is specific to Cox proportional-hazards regressions, noted at the end.

With treatment coding of predictors, the default in R, the regression coefficient for each predictor represents the associated difference from a baseline. In ordinary least squares, that baseline is the intercept, the estimated mean value when continuous predictors have values of 0 and categorical predictors are at their reference values.

In a Cox regression, the baseline is an empirical baseline (log)hazard as a function of time. That is the "reference level's association with outcome/time" you seek. The predictor coefficients in Cox regression are the differences from that in log-hazard.

Although you say "In regression I normally think of whole variates, not different levels, as having a regression coefficient," that's not correct for a multi-level categorical predictor. A categorical predictor with $k$ levels is treated as a set of $k-1$ predictors, so you get that many coefficients. Under treatment coding, each coefficient represents the difference of a level from the reference level. If you have an ordinal predictor, in R you can specify it as ordered and the internal coding will be changed from treatment coding to a polynomial-contrast coding that respects the ordering of the levels. Again, that's true whether you are doing OLS, Cox regressions, or generalized linear modeling.

So, when you ask:

Does that mean that here each level of a variate has its own regression coefficient which is independent of all levels apart from the reference level? and if it is significant it can be interpreted as having an association with the outcome/time (=hazard) that is relative to the reference level's association with the outcome/time?

the answer under treatment coding is almost "Yes": except that the coefficient estimates might have covariances that need to be taken into account in testing and obtaining confidence intervals for predictions.

This answer shows how to use likelihood-ratio tests to estimate the overall association of a multi-level categorical predictor with outcome. An alternative, implemented in the R rms package, is to use Wald tests to evaluate overall associations of a multi-level predictor, including its interaction terms, with outcome.

There is no problem in general for doing such analyses on time-stratified survival data. The problem is that you specify an interaction between the time strata and the breast-cancer types with :, without a main-effect predictor for the breast cancer types. I understand that the example in the survival package documentation also does that, but it can be prone to error.

As you found, you get an error when you try to perform anova() on such a model; the rms package won't even accept that coding. Try with the * syntax, which provides the main effects as well as the interaction terms. Here's an example:

> library(rms) # also loads survival package and veteran dataset
> vet2 <- survSplit(Surv(time, status) ~ ., data= veteran, cut=c(90, 180), episode= "tgroup", id="id")
> vfit2 <- cph(Surv(tstart, time, status) ~ trt + prior + karno*strat(tgroup), data=vet2)
> anova(vfit2)
                Wald Statistics          Response: Surv(tstart, time, status) 

 Factor                                        Chi-Square d.f. P     
 trt                                            0.00      1    0.9535
 prior                                          0.09      1    0.7642
 karno  (Factor+Higher Order Factors)          62.07      3    <.0001
  All Interactions                             18.86      2    0.0001
 karno * tgroup  (Factor+Higher Order Factors) 18.86      2    0.0001
 TOTAL                                         63.70      5    <.0001

Many thanks for the discussion. Interpretation of level coefficients still seems quite complex (if the parameter doesn't have an easy-to-interpret baseline reference) - but I might do this in the future. In the meantime I've found that {car} Anova does act on one coxph model using survSplit data (whereas the other anova functions mentioned do not) - so temporarily the coding problem is solved. — Abiologist, Jan 02 '21 at 21:28

Interpretation of p values for each level of a covariate?

1 Answers1