I am building a classification model where my label is categorical (0 or 1). I want to use scikit-learn’s SelectKBest to select my top 10 features, but I’m not sure which score function to use. I thought I’d use chi2, but not all my variables are categorical. Which function works best with mixed variables (categorical, continuous, discrete)? I’ve seen several posts where people use f_classif, but isn’t ANOVA only valid if my label is continuous and predictor variables are categorical? I’m trying to find a score function that can handle all of my variables.
Asked
Active
Viewed 3,861 times
1 Answers
2
Try mutual_info_classif
scoring function. It works with both continuous and
discrete variables. You can specify a mask or indices of discrete features in
discrete_features
parameter:
>>> from functools import partial
>>> from sklearn.feature_selection import mutual_info_classif, SelectKBest
>>> discrete_feat_idx = [1, 3] # an array with indices of discrete features
>>> score_func = partial(mutual_info_classif, discrete_features=discrete_feat_idx)
>>> s = SelectKBest(score_func)
But note that discrete does not always imply category. So if a feature is not a comparable discrete variable, I suspect the corresponding score will not make much sense.

Sanjar Adylov
- 246
- 1
- 3
-
Thanks Sanjar, that’s a good point. What can be done in that case? For example, if the variable is “number of cars.” – Insu Q Aug 13 '19 at 12:20
-
@InsuQ the *number* of cars or other things is reasonable to score. But, for example, the *color* (one-hot or label-encoded) might not be. – Sanjar Adylov Aug 13 '19 at 13:01
-
So would you have to run SelectKBest twice? Once to handle discrete and continuous numerical variables and then a second time with maybe chi2 score function to handle the categorical variables? – Insu Q Aug 13 '19 at 13:08
-
@InsuQ actually, I don't think twofold feature selection is necessary. It really depends on your data. Try using `mutual_info_classif` on all the data you have, and then make some analysis. – Sanjar Adylov Aug 13 '19 at 13:37