How to improve accuracy in the case of categorical data with many levels and no correlation

Question

Consider a simple multiclass problem in which there is a categorical variable with many levels (>1000). The nature of the problem is such that we can not reduce the dimensions of this variable.

The classic way to solve this problem, would be to one-hot encode the categorical variable and train a single multiclass model using gradient boosting or random forrest.

My question is whether one should expect an improvement in accuracy if the categorical variable is removed, by clustering data into groups of data sets with similar categorical variables. Then one could train multiple multiclass classifiers. Predictions would be made by looking at the value of the categorical variable, then applying the multiclass classifier trained on that data set.

What is the difference between having a single multiclass classifier trained on categorical data compared to using the approach described above?

score 0 · Answer 1 · answered Mar 17 '19 at 02:02

I think you describe hierarchical clustering. It cannot be universally said that it's better or worse, and actually, decision trees/random forest kind of are an hierarchical approach anyway. So, if you plan to use random forests anyway, I don't think you will get better results by training a random forest on each category value of a categorical variable and then pooling the results (e.g., by majority vote). However, other methods (e.g., logistic regression as a naive example) may benefit from such a hierarchical approach.

Btw. you mention that you can one-hot encode your multi-category variable for e.g., gradient boosting or random forests. If you have multiple variables, and one variable has a lot of categories that's not always a good idea because then you will have a proportionally large number of one-hot encoded features for particular variable. In that case, I would look for random forest implementation based on decision tree algorithms that are implemented without being limited to binary nodes.

score 0 · Answer 2 · answered May 09 '20 at 15:28

Your problem is one where some context is really needed to say much, I think. So please add some more context! Besides that, there are many possible ideas, and I will here limit to reference some other posts which are relevant.

Principled way of collapsing categorical variables with many levels? have many ideas (and when going there, look at the linked sidepanel.) Some posts dealing specifically with random forests is Label encoding vs Dummy variable/one hot encoding - correctness? and Random Forest Regression with sparse data in Python. Or maybe look to embeddings as used in text analytics.

How to improve accuracy in the case of categorical data with many levels and no correlation

2 Answers2