1

I'm trying to predict whether a person is likely to get robbed or not (=dependent variable).

  • I will use logistic regression (classifier) to make the prediction.
  • In my data I have a lot of independent variables: some are numerical, some are categorical.
  • I have to convert my categorical variables into dummies. E.g. Ethnicity (latino, asian, etc.) will be converted to is_latino, is_asian, etc. where the values can be either 0 (yes) or 1 (no).
  • Since I have a lot of independent variables I want to apply feature selection as I expect that not all of them are actually adding value to the prediction. For now I'm just looking at p-values to start off with (will implement cross-validation at a later stage)

The problem:

  • My results show that some dummies are significant, whereas other dummies relating to the same variable are not. E.g. is_latino seems to be very significant, whereas is_asian is not significant.

The question:

  • Can I continue by using only those dummies which appear to be significant (is_latino)? Or do I need to keep all the ones which relate to the same variable (ethnicity)?
Nick Cox
  • 48,377
  • 8
  • 110
  • 156
  • The question is asked & answered before on this site: https://stats.stackexchange.com/questions/273154/is-feature-selection-with-dummy-coding-of-categorical-variables-problematic – kjetil b halvorsen Mar 04 '18 at 09:19
  • Coding 0 for yes and 1 or no is a convention, but one likely to be confusing to yourself and others. It's simpler by far to use the other way round, 1 for yes and 0 if no. (Indeed, it is possible that you really did this and just wrote you did by accident.) – Nick Cox Mar 04 '18 at 10:06

1 Answers1

1

As you already suggest in your question, when you include a categorical variable such as ethnicity and decide to model it through k-1 dummies (assuming you have an intercept in the model), you should keep either all of them or none of them.

In essence I think you're interested in whether ethnicity has an important / significant effect on your outcome (being robbed). If you would remove a certain level of ethnicity, then the interpretation changes with it. It could for example be that once you remove is_latino that a highly significant level, e.g. is_asian would become insignificant.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
Amonet
  • 738
  • 1
  • 7
  • 26