Is it advisable to drop certain levels of a categorical variable?

Question

Let's say that I have one categorical variable with six levels, and I then create five indicator variables in order to represent the six levels. If two of the five variables are insignificant, then do I drop these two? I assume not, but I was not sure. I was thinking that it might be better to test the full (all five variables) versus the reduced (just the three significant variables) model and, if that was not significant, then just leave all five of the variables in. I was not sure what to do. Oh and I meant for this to be in the context of fitting a logistic regression model.

It seems to me that you're mixing two distinctly different actions: dropping **levels** of categorical variables and dropping **variables** themselves. A clarification would be helpful. — Aleksandr Blekh, Mar 10 '15 at 00:12
Please see my answer. Just after posting it I've discovered @Ben Kuhn's answer, which seems to be very nice (+1), if a bit short. — Aleksandr Blekh, Mar 10 '15 at 01:30

score 5 · Accepted Answer · answered Mar 10 '15 at 01:19

5

You should leave all five indicator variables in. Dropping predictors because they are non-significant leads to biased estimates for regression coefficients and inflated p-values.

A good reference that discusses this at length is Frank Harrell's Regression Modeling Strategies. You can find a summary of the problems with dropping insignificant features in section 4.3 there.

answered Mar 10 '15 at 01:19

Ben Kuhn

5,373
1
16
27

1

What if you're only interested in prediction, and don't care about the p-values? – rw2 Jun 18 '21 at 09:23

score 5 · Answer 2 · answered Mar 10 '15 at 01:40

The latter approach (comparing two models with and without the five variables and decide if you should keep them as a set) is better.

The problem with dropping the indicator is that you'll change the p-values of the remaining levels as well, as you're shifting the intercept (aka the reference group.) Given a model:

$y = b_0 + b_1 Lv2 + b_2 Lv3 + b_3 Lv4 + b_4 Lv5 + b_5 Lv6 + \epsilon$

The intercept represents the mean of $y$ for group $Lv1$. Now, if we drop, say, the last two terms:

$y = b_0 + b_1 Lv2 + b_2 Lv3 + b_3 Lv4 + \epsilon$

Because you only drop the variable and not the cases, subjects in levels 5 and 6 need a place to go: notice that your intercept is now picking the groups $Lv5$ and $Lv6$ as well, representing the mean $y$ for levels 1, 5, and 6.

So, two major points: 1: your reference group can change and such change is not always sensible. 2: you may be surprised to see the significant results you wish to keep may be gone, due to the reference group mean has also changed.

I might be wrong, but I've had an impression that the OP is interested in dropping factor levels as opposed to variables themselves. Nevertheless, your answer is nice (+1). — Aleksandr Blekh, Mar 10 '15 at 02:35
I just re-read your answer and I'm taking my words about the levels back - sorry about misreading the text. I probably need a cup of coffee right away :-). — Aleksandr Blekh, Mar 10 '15 at 02:39

score 2 · Answer 3 · edited Apr 13 '17 at 12:44

2

Here's my two cents. I can't say with full certainty, but, I guess, it very much depends on a model and data. If I understand this answer correctly, @gung advises to test your model(s) after dropping all and then some levels. However, the details on how exactly to perform the testing are rather fuzzy (at least, to me). Perhaps, he will be kind enough to expand on that for beginners like me.

You may also find relevant and useful this course notes document on logistic regression (in R) by Professor Christopher Manning (Stanford University). Among other things, he describes dropping whole categorical variables (factors in R terminology) and manipulations with categorical variable levels, such as collapsing several levels into a single one and other manipulations, as well as the impact of those actions on quality of regression models and interpretations of analysis results.

edited Apr 13 '17 at 12:44

Community

1

answered Mar 10 '15 at 01:28

Aleksandr Blekh

7,867
2
27
93

@gung is explaining there how to perform tests of null hypotheses that the expected response is the same for all levels, or for any two levels, not advising model selection based on the results of these tests. Given his answer [here](http://stats.stackexchange.com/questions/20836/), I'd imagine he'd be quite circumspect about advising that. Your link shows *how* to collapse levels based on the results of hypothesis tests, but offers no motivation or justification for doing so (though domain knowledge is being used as a constraint). – Scortchi - Reinstate Monica Mar 10 '15 at 12:25
@Scortchi: Thank you for your comment. I didn't say anything about model selection in my answer. In regard to gung's answer in this thread, well, it maybe clear to more experienced people, like you, but to me it's a bit fuzzy. Not sure why you reference here gung's model selection answer, which is nice, but IMHO not directly relevant to this discussion's topic. – Aleksandr Blekh Mar 10 '15 at 12:46
What else is dropping/leaving in factor levels or whole factors (in this question) referring to but model selection? – Scortchi - Reinstate Monica Mar 10 '15 at 12:52
@Scortchi: Well, yes, factor manipulation _can be_ considered as model selection, but, more accurately, as **subset** of it (as model selection might consider other _aspects_ of models). Therefore, my points are: 1) model selection is a _much larger area_ than factor manipulation; 2) gung's model selection answer doesn't touch on the _factor manipulation aspect_, which is the topic of this question. – Aleksandr Blekh Mar 10 '15 at 13:02
1

My point was only that explaining the tests isn't the same thing as saying you ought to use them for factor manipulation (a subset of model selection, as you say) - which, given the context of this question, an unwary reader might be liable to conclude. (And everything said in gung's model selection answer applies whenever you pick the "best" model from many based on a measure of fit - whether you look for the "best" by dropping whole predictors or collapsing levels of categorical predictors.) – Scortchi - Reinstate Monica Mar 10 '15 at 13:30
@Scortchi: Now I understand what you mean. Point taken. I will need to read on that (and more) at some point later. With my (non-statistics) Ph.D. dissertation defense coming next month, needless to say that I'm a little distracted to further improve my statistical knowledge at the moment... :-). – Aleksandr Blekh Mar 10 '15 at 13:50

Is it advisable to drop certain levels of a categorical variable?

3 Answers3

Linked

Related