Can I perform RFE on a Categorical variable with multiple values?

Question

I have a dataset with a variable called attributes and it is an "array" of strings.

E.g.

Row 1:  ['backlight', 'stopwatch', 'world_clock']

Row 2: ['backlight', 'world_clock']

After doing some tokenizing, we transformed this array into multiple boolean (categorical) variables (actual dataset had > 50 variables):

backlight | stopwatch | world_clock | ... | + 50 more of these 'attributes' variables
    T           T           T
    T           F           T

Because now we have more than > 50 variables, we want to perform Recursive Features Elimination Cross Validation (RFECV) in order find an optimum no. of variables and perform regression to predict another variable, price.

However, I am unsure if I can remove these newly "variables". From these previously asked questions:

the answers say that it doesn't make sense to remove a level of a categorical variable and it makes sense to me. But, my categorical variable attributes is not exactly categorical, it's more like an array. In fact, in my opinion, each attribute is independent of the other and they can in fact exists as separate columns but I am not sure about that...

I have 2 questions actually ...

What is this kind of variables called? Is it categorical? I have tried searching with different terms but to no avail
To add upon the linked questions, is it okay to "drop" these tokenized boolean "variables" and perform RFE and regression? Does it affect the accuracy of my model?

score 1 · Answer 1 · answered Aug 29 '20 at 05:53

But, my categorical variable attributes is not exactly categorical, it's more like an array. In fact, in my opinion, each attribute is independent of the other and they can in fact exists as separate columns but I am not sure about that ...

This is in a way true, and the usual argument that you cannot drop a level of a categorical variable (without changing its meaning) is not relevant here. Each of your attributes variables is a binary variable in itself. But, the usual arguments against doing automatic variable selection do apply, see Variable selection for predictive modeling really needed in 2016?. Look instead into regularization, search this site.

Can I perform RFE on a Categorical variable with multiple values?

1 Answers1