I am working with a categorical variable that has a lot of levels (let's say more than 20). I would like to binarize all the levels doing one-hot-encoding in order to use these new variables in a machine learning model. I don't want to keep all the 20 levels, but only the top-k, in order to do not pas useless information to the model.
I am solving a multiclass classification problem, so the target variable has more than two levels. I thing that I can rank the levels of considered variable using the $\chi^2$ measure. So basically,I will do the following steps:
- Binarize (One-hot-encoding) the selected variable
- For each binary level, evaluate the $\chi^2$ measure w.r.t the target variable
- Rank the levels according to the $\chi^2$
Do you believe it should be a good idea? Is it necessary to perform the $\chi^2$ hypotesis testing or should I directly use the $\chi^2$ measure?
I have additional issues concerning the implementation, since I am using pyspark, but that's another story.