How to compare $t$-tests to $\chi^2$-tests for feature selection

Question

I am trying to learn feature importance for an ML model. To do this I am running a univariate statistical test between each independent feature and the dependent variable I am trying to predict.

The dependent variable is binary and the features are mixed (some are categorical some are continuous). As a result, for the features that are continuous, I am running a two sample $t$-test and for the categorical features I am running a $\chi^2$-test.

I want to be able to compare the outputs of these two different tests in order to see which features are the most closely related to the dependent variable. How do I compare the results from these two different tests?

I am not sure how to do this as the test statistics are obviously different and just because the test shows that the relationship is statistically significant, this of course does not mean that the relationship is a particularly important one. For example, the mean difference between Group A and Group B could be very statistically significant but only be a mean difference of a mere 0.01 units.

Therefore, how do I find the features that not only have the largest magnitude relationship with the dependent variable I am trying to predict but also the highest statistical significance.

Thanks in advance for any help and let me know if I should provide more of my particular example.

I am using scikit learn's feature_selection f_classif, chi2 methods to do this analysis. (They do what I previously described. two sample $t$-test and $\chi^2$-tests, respectively.)

As you'll find on most of the highest rated answers here, this kind of feature selection based on statistical significance testing is deferred to e.g. shrinkage models (lasso, ridge) or exclusion of variables based on expert knowledge. Any kind of statistical test for predictive power will essentially result in the data being used twice (to select features and to predict the outcome with them). This will almost always result in serious overfitting and biased results. — Frans Rodenburg, Jul 27 '18 at 05:20
Thanks Frans. You mean that because I would be running so many univariate tests there is a high probability that this would create many false positives? Moreover, if this is the case then can I not use the Bonferroni correction? — tbrick, Jul 27 '18 at 15:49
Unfortunately, this is a fundamental issue that is not solved by multiple testing correction. Significance tests are not appropriate for finding predictors to include in the model. — Frans Rodenburg, Jul 27 '18 at 19:57
This is the best explanation on here, in my opinion: https://stats.stackexchange.com/a/20856/176202 — Frans Rodenburg, Jul 29 '18 at 03:02
I changed the title to be more in line with the question. Feel free to change it back if I misinterpreted anything. — Frans Rodenburg, Jul 30 '18 at 05:17

score 2 · Accepted Answer · answered Jul 29 '18 at 07:03

Regardless of whether you could, I don't think that you should. There are three problems with this approach:

Significance does not correlate with predictive power. For example, two sample means may differ significantly while those samples' probability distributions overlap considerably. In this case, there is no separation that will yield a high prediction accuracy.
The $t$-test compares the means of the groups, in relation to their standard error. The $\chi^2$-test compares the observed proportions of categories in either group to their expected proportions (expected if there were no difference). The $p$-values of these tests are evidence against different null-hypotheses.
Even if you were to use another way to find the 'best' predictors, you would be using the data twice: Once to select variables and once to estimate the parameters of a model with those variables. This is a form of stepwise regression (which is a bad idea).

Lastly, I don't see any mention of training/validation and where you would apply these tests. A significant difference in the groups in your sample need not imply an actual difference in those group's populations. However, just like the problem of multiple testing you mentioned in the comments, solving this will not solve the fundamental problem with this kind of variable selection.

So how then should you proceed? As mentioned in the comments and answers of this question, you may want to opt for a regularized regression model instead, which will shrink smaller coefficients to(wards) zero. Moreover, certain models can 'ignore' irrelevant predictors reasonably well, such as tree-based regression with random forests. Entire books have been written on model selection alone, but that Q&A is a very good starting point to better understand why this simply won't work well.

How to compare $t$-tests to $\chi^2$-tests for feature selection

1 Answers1