Select top-k feature from a categorical variable using $\chi^2$

Question

I am working with a categorical variable that has a lot of levels (let's say more than 20). I would like to binarize all the levels doing one-hot-encoding in order to use these new variables in a machine learning model. I don't want to keep all the 20 levels, but only the top-k, in order to do not pas useless information to the model.

I am solving a multiclass classification problem, so the target variable has more than two levels. I thing that I can rank the levels of considered variable using the $\chi^2$ measure. So basically,I will do the following steps:

Binarize (One-hot-encoding) the selected variable
For each binary level, evaluate the $\chi^2$ measure w.r.t the target variable
Rank the levels according to the $\chi^2$

Do you believe it should be a good idea? Is it necessary to perform the $\chi^2$ hypotesis testing or should I directly use the $\chi^2$ measure?

I have additional issues concerning the implementation, since I am using pyspark, but that's another story.

score 0 · Answer 1 · answered Aug 29 '20 at 06:04

This is probably not a good idea, because it is a version of feature selection based on marginal association, which ignores all multivariate information. Don't do it. Better to keep all the levels, and look at some special methods for categorical variables with very many levels, see Principled way of collapsing categorical variables with many levels?

Select top-k feature from a categorical variable using $\chi^2$

1 Answers1