Categorical variables with too many levels in machine learning

Question

I have a machine learning problem where the dependent variable is binomial (Yes/No) and some of the independent variables are categorical (with more than 100 levels). I'm not sure whether dummy coding these categorical variables and then passing them to the machine learning model is a optimal solution.

Is there a way to deal with this problem?

You could look into fused lasso or fusion, see https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-categories/237000#237000 — kjetil b halvorsen, Apr 24 '17 at 14:26

score 1 · Answer 1 · answered Apr 24 '17 at 15:09

How to deal with the categorical variables with too many levels will depend on data.

If we have HUGE amount of data (say 1 billion data points), a categorical variable that has $100$ different levels may not be a problem. Since it is very likely that we have sufficient "training examples" on each level.

However, the aforementioned example is less likely happen in real data. In most real world data, the data has "80-20" rules, which means few levels will be very often, and most levels will not have sufficient data in. Then you may consider to combine them into "Other" category.

Categorical variables with too many levels in machine learning

1 Answers1