2

I am attempting to build a predictive (machine-learning) logistic regression model that contains mostly categorical (non-ordinal) variables. As part of a variable selection process I run a Pearson Chi-sq test over these factor variables to determine how appropriate they might be in the model. I get an error message that the 'approximation may be incorrect'. It appears this is due to zero cells in the contingency table or simply too many factor levels in several of the variables.

Either way, I'm pretty sure having 30+ levels for some of these factors is not good for my model. My question is: is there some recommended procedure for grouping these levels (in both training/testing sets) to simplify these particular variables? Some levels have very little in common with each other to group in any domain-specific way. I'm wondering if simply random grouping or even something like grouping all levels that start with the same level would be better than my 30+ levels. Seeing as how the 'hash-trick' can be used to group levels (see Epstein's talk slide 16), then something like this should work, yes? (Apologies if this has been asked, but I couldn't find it anywhere or any great references regarding this factor level grouping/pooling)

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Joe
  • 151
  • 4

0 Answers0