1

I have a query that I perform, I have a dataset that has several categorical features with thousands of levels.

Applying get_dummies would generate a dataset that I could not work with. I would be interested to make selection of the most important levels ​​and the rest of levels ​​of less importance group them. Then i can apply get_dummies.

Do you have any idea how to do this?

It is possible to apply chi squared to the levels ​​of the categorical features instead of to the own features.

I usually use python and scikit learn.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Javi
  • 161
  • 1
  • 4
  • 4
    Sounds like you have a highly context-dependent problem - you may want to add a bit more detail about where your dataset is from and what it represents, and what you want to investigate – wjchulme May 16 '17 at 12:50
  • 1
    Such and similar questions crop up quite often (three today!). Have a look at https://stats.stackexchange.com/questions/227125/preprocess-categorical-variables-with-many-values/277302#277302 and the links therein. – kjetil b halvorsen May 16 '17 at 13:08
  • sounds like a duplicate of https://stats.stackexchange.com/questions/227125/preprocess-categorical-variables-with-many-values/277302#277302 and https://stats.stackexchange.com/questions/277797/how-to-process-categorical-features-with-many-values – Josef May 16 '17 at 13:11
  • I'm going to check the links you posted, thank you very much – Javi May 16 '17 at 14:09

0 Answers0