1

Suppose I want to regress $X$ on $y$, controlling for categorical $z$ with $100$ different levels. I believe that linear regression is appropriate. Normally I would create dummies $D_i$ for each category of $z$ and run a least squares regression of the model

$$y = X\beta + \Sigma_{i=1}^{100}\delta_iD_i +\epsilon $$

However suppose $z$ is such that the first two categories $D_1$ and $D_2$ represent 49.5% of the data each and $D_{3:100}$ account for only 1%. So I merge everything but the first 2 categories together into one category, so that $z$ is now only 3 levels.

After doing this, can I still say I am controlling for $z$? It feels like since the individual coefficients for $D_{3:100}$ are no longer identifiable I'm losing something.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
badmax
  • 1,659
  • 7
  • 19
  • I guess you are losing something, you've gone from 100 categories to category_1, category_1 and others. But if it's valid to reencode 3-100 as a single category the model itself is still valid – David Waterworth Jan 25 '20 at 01:44
  • Have a look at https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels – kjetil b halvorsen Jan 25 '20 at 01:50

1 Answers1

1

There is in general no reason to believe that just because some levels are infrequent, they have the same effect on the outcome variable. So I would be doubtful of your proposal. Maybe go for some regularization approach, for your problem I would try the fused lasso. Some useful discussion in Principled way of collapsing categorical variables with many levels?.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467