How to process categorical features with many values?

Question

I want to apply machine learning and deep learning.

I have categorical data on string. My first option was to perform dummy encoding on the columns (scikitlearn). But there are some columns that have thousands of categorical values, if i use dummy encoding, this will expand the dataset enormously.

What other alternative do I have? If I simply perform a label encoder and then scale everything between 0 and 1 it could work?

This is basically the same question that I answered two days ago here: https://stats.stackexchange.com/questions/227125/preprocess-categorical-variables-with-many-values/277302#277302 — kjetil b halvorsen, May 05 '17 at 15:39
The most common and simplest way is to collapse the values or make new variables from it. — SmallChess, May 05 '17 at 15:43
@SmallChess: Yes, but that doesn't really take the problem seriously. If you want/need to take it seriously, see my linked answer above. — kjetil b halvorsen, May 05 '17 at 15:52

score 0 · Answer 1 · answered May 05 '17 at 15:33

If you have some domain knowledge, you may try to group your categories into broader, more general categories.

You could also try performing feature selection on these categorical variables. Feature selection using decision trees could be particularly useful here; you may find that you can prune a lot of the categories or even categorical variables.

Finally, if it is feasible to perform dummy encoding, I do not see why you shouldn't just do it. The deep network should be able to deal with it.

How to process categorical features with many values?

1 Answers1

Linked