How to deal with categorical features with thousands of levels?

Asked May 16 '17 at 12:36

Active May 17 '17 at 10:54

Viewed 65 times

I have a query that I perform, I have a dataset that has several categorical features with thousands of levels.

Applying get_dummies would generate a dataset that I could not work with. I would be interested to make selection of the most important levels and the rest of levels of less importance group them. Then i can apply get_dummies.

Do you have any idea how to do this?

It is possible to apply chi squared to the levels of the categorical features instead of to the own features.

I usually use python and scikit learn.

edited May 17 '17 at 10:54

kjetil b halvorsen

63,378
26
142
467

asked May 16 '17 at 12:36

Javi

4

Sounds like you have a highly context-dependent problem - you may want to add a bit more detail about where your dataset is from and what it represents, and what you want to investigate – wjchulme May 16 '17 at 12:50
1

Such and similar questions crop up quite often (three today!). Have a look at https://stats.stackexchange.com/questions/227125/preprocess-categorical-variables-with-many-values/277302#277302 and the links therein. – kjetil b halvorsen May 16 '17 at 13:08
sounds like a duplicate of https://stats.stackexchange.com/questions/227125/preprocess-categorical-variables-with-many-values/277302#277302 and https://stats.stackexchange.com/questions/277797/how-to-process-categorical-features-with-many-values – Josef May 16 '17 at 13:11
I'm going to check the links you posted, thank you very much – Javi May 16 '17 at 14:09

How to deal with categorical features with thousands of levels?

0 Answers0