Exclude entire categorical variable? Or just some of its dummies?

Question

I'm trying to predict whether a person is likely to get robbed or not (=dependent variable).

I will use logistic regression (classifier) to make the prediction.
In my data I have a lot of independent variables: some are numerical, some are categorical.
I have to convert my categorical variables into dummies. E.g. Ethnicity (latino, asian, etc.) will be converted to is_latino, is_asian, etc. where the values can be either 0 (yes) or 1 (no).
Since I have a lot of independent variables I want to apply feature selection as I expect that not all of them are actually adding value to the prediction. For now I'm just looking at p-values to start off with (will implement cross-validation at a later stage)

The problem:

My results show that some dummies are significant, whereas other dummies relating to the same variable are not. E.g. is_latino seems to be very significant, whereas is_asian is not significant.

The question:

Can I continue by using only those dummies which appear to be significant (is_latino)? Or do I need to keep all the ones which relate to the same variable (ethnicity)?

The question is asked & answered before on this site: https://stats.stackexchange.com/questions/273154/is-feature-selection-with-dummy-coding-of-categorical-variables-problematic — kjetil b halvorsen, Mar 04 '18 at 09:19
Coding 0 for yes and 1 or no is a convention, but one likely to be confusing to yourself and others. It's simpler by far to use the other way round, 1 for yes and 0 if no. (Indeed, it is possible that you really did this and just wrote you did by accident.) — Nick Cox, Mar 04 '18 at 10:06

score 1 · Answer 1 · edited Mar 04 '18 at 10:02

As you already suggest in your question, when you include a categorical variable such as ethnicity and decide to model it through k-1 dummies (assuming you have an intercept in the model), you should keep either all of them or none of them.

In essence I think you're interested in whether ethnicity has an important / significant effect on your outcome (being robbed). If you would remove a certain level of ethnicity, then the interpretation changes with it. It could for example be that once you remove is_latino that a highly significant level, e.g. is_asian would become insignificant.

Exclude entire categorical variable? Or just some of its dummies?

1 Answers1