1

I use the Chi-square test for feature selection. I use it only when all entries in the contingency table are greater then 5.

Is that the correct approach statistically?

What happens for example, if there's a feature that appears 1000 times only in positive examples? It seems that it should pass the test. Am I using it wrong?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Roy
  • 719
  • 6
  • 14
  • 1
    See [Is multiple logistic regression the right choice or should I use univariate logistic regression?](http://stats.stackexchange.com/a/106248/17230). Sometimes called univariable screening, or univariate selection, the approach can result in over-optimistic estimates of your model's predictive performance (be sure to include the feature selection procedure in the validation), & makes no allowance for confounding. I'd imagine it works best when predictors have either strong or negligible effects, & are only weakly correlated, or orthgononal by design. – Scortchi - Reinstate Monica Jun 08 '15 at 08:39
  • On Pearson's chi-squared, note first that such rules of thumb apply to *expected* & not *observed* counts, & see e.g. [Applicability of chi-square test if many cells have frequencies less than 5](http://stats.stackexchange.com/q/35657/17230) – Scortchi - Reinstate Monica Jun 08 '15 at 09:00

0 Answers0