When finding significance in parts of a categorical variable during stepwise selection, do you include the entire variable?

Question

I'm making a stepwise selection of variables and in some categorical variables I find a significant association on some parts, say cancer type 4 is significant but not 1, 2 and 3 in the variable "cancer type".

When I make my final regression model including all significant variables (I have set the threshold at p less than 0.3) do I include the entire variable "cancer type" or do I only include cancer type = 4? If so how do I only include that one cancer type in STATA (bonus question)?

score 2 · Accepted Answer · answered Jul 01 '19 at 20:54

The best answer to your question is: avoid stepwise selection.

Even in standard linear regression, stepwise selection of predictors is not a good strategy, as explained for example on this page. This is even more important in survival analysis (which seems to be your situation), as omitting any predictor that is associated with survival from a model will lead to bias in the coefficients for the remaining predictors even if they are not correlated with the omitted predictor. See this page for some introduction to this problem

With survival analysis you are typically best off by including as many predictors reasonably related to survival as you can without overfitting: typically limiting to one predictor per 10-20 survival "events" (deaths, recurrences) in standard Cox modeling, or using a penalized method like ridge regression. (Stata might not support ridge regression for survival models, however.)

If your "types" of cancer are clinically defined Stages, remember that each higher Stage is related to overall worse outcome based on studies of hundreds to thousands of patients. It's quite possible that a small study wouldn't have the power to distinguish survival of, say, Stage II from Stage III. If you have small numbers in each of 4 cancer Stages it might make sense--before you look at the survival results--to combine nearby Stages into groups (e.g., I+II versus III+IV) so that you compress the Staging information into only 2 categories.

If you really are looking at different types of cancer (e.g., lung versus breast versus prostate) you should think carefully about what you are trying to accomplish. Different types of cancers have different natural histories, standard therapies, and prognoses (the Stage numbers have different implications for survival depending on the type of cancer), so you might be better served by analyzing each type separately.

I'm a bit confused. I had a meeting with a statistician about my project and he said that the best approach was to "do a logistic regression for each variable at a time and then pick out those with p less than 0.3, then include these in a final regression model using the same dependant variable with these independent variables". Is that not stepwise selection? — Paze, Jul 02 '19 at 07:27
@Paze that is not [stepwise selection](https://en.wikipedia.org/wiki/Stepwise_regression), e.g. serial addition of predictors one at a time based on which best improves a prior model. You screened out predictors only weakly associated with outcome (p>0.3), then used the rest _together in a single model_. That avoided many dangers of one-predictor-at-a-time stepwise modeling; including several predictors potentially related to outcome minimized omitted-variable bias. My suggestions about handling Stages or different cancer types still hold. — EdM, Jul 02 '19 at 20:54

score 0 · Answer 2 · answered Jul 01 '19 at 15:59

You could in principle put types 1, 2 and 3 as a single group, but I don't know if this makes sense in your particular case.

Also, please remember that statistical significance is an arbitrary treshold. Nothing "magic" happens at $p=0.05$. And, by the way, that "significance" is set with respect to a reference group, which is also arbitrary. Changing the reference group may change $p$-values for the different categories.

So, in short, unless there is a clear (a priori) reason to establish "groups" among cancer types, my approach would be to either include or exclude the "type" variable, rather than "p-hacking" the model after seeing what has or hasn't a significantly non-zero effect

When finding significance in parts of a categorical variable during stepwise selection, do you include the entire variable?

2 Answers2

Linked