chi square test for large data sets

Asked Jun 08 '15 at 08:19

Active Feb 09 '22 at 15:13

Viewed 1,370 times

I use the Chi-square test for feature selection. I use it only when all entries in the contingency table are greater then 5.

Is that the correct approach statistically?

What happens for example, if there's a feature that appears 1000 times only in positive examples? It seems that it should pass the test. Am I using it wrong?

edited Feb 09 '22 at 15:13

kjetil b halvorsen

63,378
26
142
467

asked Jun 08 '15 at 08:19

Roy

1

See [Is multiple logistic regression the right choice or should I use univariate logistic regression?](http://stats.stackexchange.com/a/106248/17230). Sometimes called univariable screening, or univariate selection, the approach can result in over-optimistic estimates of your model's predictive performance (be sure to include the feature selection procedure in the validation), & makes no allowance for confounding. I'd imagine it works best when predictors have either strong or negligible effects, & are only weakly correlated, or orthgononal by design. – Scortchi - Reinstate Monica Jun 08 '15 at 08:39
On Pearson's chi-squared, note first that such rules of thumb apply to *expected* & not *observed* counts, & see e.g. [Applicability of chi-square test if many cells have frequencies less than 5](http://stats.stackexchange.com/q/35657/17230) – Scortchi - Reinstate Monica Jun 08 '15 at 09:00

chi square test for large data sets

0 Answers0