I have a dataset with a variable called attributes
and it is an "array" of strings.
E.g.
Row 1: ['backlight', 'stopwatch', 'world_clock']
Row 2: ['backlight', 'world_clock']
After doing some tokenizing, we transformed this array into multiple boolean (categorical) variables (actual dataset had > 50 variables):
backlight | stopwatch | world_clock | ... | + 50 more of these 'attributes' variables
T T T
T F T
Because now we have more than > 50 variables, we want to perform Recursive Features Elimination Cross Validation (RFECV) in order find an optimum no. of variables and perform regression to predict another variable, price
.
However, I am unsure if I can remove these newly "variables". From these previously asked questions:
- Does it make sense to apply recursive feature elimination on one-hot encoded features?
- Can I ignore coefficients for non-significant levels of factors in a linear model?
the answers say that it doesn't make sense to remove a level of a categorical variable and it makes sense to me. But, my categorical variable attributes
is not exactly categorical, it's more like an array. In fact, in my opinion, each attribute is independent of the other and they can in fact exists as separate columns but I am not sure about that...
I have 2 questions actually ...
What is this kind of variables called? Is it categorical? I have tried searching with different terms but to no avail
To add upon the linked questions, is it okay to "drop" these tokenized boolean "variables" and perform RFE and regression? Does it affect the accuracy of my model?