1

I want to apply machine learning and deep learning.

I have categorical data on string. My first option was to perform dummy encoding on the columns (scikitlearn). But there are some columns that have thousands of categorical values, if i use dummy encoding, this will expand the dataset enormously.

What other alternative do I have? If I simply perform a label encoder and then scale everything between 0 and 1 it could work?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Javi
  • 161
  • 1
  • 4

1 Answers1

0

If you have some domain knowledge, you may try to group your categories into broader, more general categories.

You could also try performing feature selection on these categorical variables. Feature selection using decision trees could be particularly useful here; you may find that you can prune a lot of the categories or even categorical variables.

Finally, if it is feasible to perform dummy encoding, I do not see why you shouldn't just do it. The deep network should be able to deal with it.

rinspy
  • 3,188
  • 10
  • 40