Is feature selection with dummy coding of categorical variables problematic?

Question

In the context of feature selection it is common to recode categorical variables with more than 2 categories into dummies. Selection methods such as elastic nets or lasso regression select the best predictors, whereby it is possible that only some dummies of each categorical variable are selected. I am wondering, if there can appear some problems due to this procedure. I found some comments about the topic on Quora and a tutorial, stating that the procedure should be used carefully, but that there are no general problems. However, I was not able to find any detailed literature or any educated guidelines, which could be followed.

Question: Can there appear any problems, if not all dummies of a categorical variable are selected for a model?

For example, I could imagine that the automatic selection relies on the order of the categories and the resulting reference category. Let's say there is a variable with categories A, B, and C. A dummy recoding into dummyB and dummyC would probably result in different variable selections compared to a dummy recoding into dummyA and dummyB.

Any advice or literature is highly appreciated!

UPDATE:

Based on Ben's comment I found some literature about a comparison of the lasso and the group lasso, which addresses my question:

http://pages.stat.wisc.edu/~myuan/papers/glasso.final.pdf

http://people.ee.duke.edu/~lcarin/lukas-sara-peter.pdf

However, based on this literature 2 further questions appeared:

1) It seems like the normal lasso is still used regularly, whereby the group lasso doesn't appear that often in current literature. Is there a specific reason for that?

2) When I have categorical variables with many categories, isn't it a problem, if I select the whole categorical variable? Or in other words, is it sometimes advantageous to use the lasso instead of the group lasso?

this is the original motivation for the group lasso: http://pages.stat.wisc.edu/~myuan/papers/glasso.final.pdf — user795305, Apr 11 '17 at 17:59
Thank you, I just read the paper and it addresses my question very well! However, I am still wondering about 2 thing: 1) It seems like the normal lasso is still used regularly, whereby the group lasso doesn't appear that often in current literature. Is there a specific reason for that? 2) When I have categorical variables with many categories, isn't it a problem, if I select the whole categorical variable? Or in other words, is it sometimes advantageous to use the lasso instead of the group lasso? — Joachim Schork, Apr 12 '17 at 07:11
Hey, no problem. 1) I'm not sure it's true that group lasso isn't used much. All the same, maybe one reason that it isn't used as often as it should is that sometimes it's hard to specify the groups perfectly. Group specification is easy when you're grouping levels of a categorical feature together, but it can get difficult in other situations. — user795305, Apr 15 '17 at 18:15
2) These lasso type methods promote sparsity. One type of sparsity we might want is a kind of "measurement sparsity"--so that we don't have to measure too many features. In that setting, group lasso is more natural than plain lasso when applied to a categorical feature. It varies though depending on what you're after. — user795305, Apr 15 '17 at 18:15

user795305 · Accepted Answer · 2018-02-26T15:15:56.827

(I'm writing this here just to be sure the question isn't left "unanswered".)

Yes, it can be a problem if we run lasso on a design matrix with dummy variable coding. Perhaps only some levels will be selected by the model. Like you mention, this makes the coding we choose a "tuning parameter" of the model, something that will change our estimate and that the user has to specify. This alone is undesirable, but it's also undesirable from a practicality stand point. If any levels of a factor are in the model, we will have to measure the factor, but then we only get to use it's value when it happens to be in the selected levels! This is especially problematic when the factor is expensive to measure.

So, can I use a filter method such as removing features with low variance, with the categorical variables? If so, how? Because some levels of a categorical variable may be selected while others may be dropped right? If not, what other methods should I use? Thanks! — Sndn, Jan 26 '19 at 19:58

Is feature selection with dummy coding of categorical variables problematic?

1 Answers1

Linked

Related