1

I am working with a dataset that is essentially all categorical data. I have 20-30 distinct columns of categorical data, with some columns having as many as 1000 different categorical values. If I use dummy variables to convert all of my categorical data I will have so many features that I don't know how I could ever interpret my results.

I'm curious as to what is best practice for these sorts of problems? I care more about interpretability of my model than the predictive power. What are the best categorical modeling techniques to use? And is the false numerical relationships label encoder will introduce out weigh the benefits of reduced dimensionality and ease of interpretability.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • If you created the dummy variables and then used some sort of penalized model, like lasso that would eliminate most of the variables, then the results may not be hard to interpret. – Glen Jan 17 '18 at 16:17
  • What about label encoding and then re-encoding those labels into binary? This is a dimensional reduction of massive amount of categories. Other types of dimensionality reduction may also be useful after label encoding. Maybe a clustering algorithm? You could try some kind of PCA or TSNE on the generated features to look for consistent groupings and reduce your features that way. – neuroguy123 Jan 17 '18 at 16:36
  • If you have a categorical variable with 1000 levels, I think you have some problem. However, converting categorical variables into numeric ones will add a much bigger problem. – Peter Flom Jun 26 '19 at 11:05

0 Answers0