I am building a simple classification model, the classification model is used for identifying negative or positive feedback on a ticketing system. The tickets have alot of categorical data with high cardinality and the data should be of significance in the model. Is there any standard ways of using these categories without creating hugely sparse data.
Asked
Active
Viewed 2,438 times
1 Answers
1
As you already implicitly suggested a dummy encoding with a one-hot encoding would be a sparse but sensible solution.
I would suggest using this approach with an extention that reduces the sparsity of the dummy encoding. For example, you could think of:
- simply reducing the cardinality by mapping low occurring values of the category to the same (new) value. Some domain knowledge might come in handy here.
- less simply reducing the dimension of the dummy variables of multiples categorical variables. For example, a restricted boltzman machine could summarize the concurrence of V1=A and V2=B in a way that is meaningfully for binary variables.
Instead of changing the Dat you could also change the modeling tools by:
using techniques that can handle categorical variables like a single decision tree or (more advanced) random forest or gradient tree boosting.
using sparse matrices of the data and modeling tools that can use be applied on sparse matrices. This saves tremendous amounts of memory while you don't need to put additional effort in the data representation.

Pieter
- 1,847
- 9
- 23