I'm trying to predict whether a person is likely to get robbed or not (=dependent variable).
- I will use logistic regression (classifier) to make the prediction.
- In my data I have a lot of independent variables: some are numerical, some are categorical.
- I have to convert my categorical variables into dummies. E.g. Ethnicity (latino, asian, etc.) will be converted to is_latino, is_asian, etc. where the values can be either 0 (yes) or 1 (no).
- Since I have a lot of independent variables I want to apply feature selection as I expect that not all of them are actually adding value to the prediction. For now I'm just looking at p-values to start off with (will implement cross-validation at a later stage)
The problem:
- My results show that some dummies are significant, whereas other dummies relating to the same variable are not. E.g. is_latino seems to be very significant, whereas is_asian is not significant.
The question:
- Can I continue by using only those dummies which appear to be significant (is_latino)? Or do I need to keep all the ones which relate to the same variable (ethnicity)?