What happens when you merge dummy variables together?

Question

Suppose I want to regress $X$ on $y$, controlling for categorical $z$ with $100$ different levels. I believe that linear regression is appropriate. Normally I would create dummies $D_i$ for each category of $z$ and run a least squares regression of the model

$$y = X\beta + \Sigma_{i=1}^{100}\delta_iD_i +\epsilon $$

However suppose $z$ is such that the first two categories $D_1$ and $D_2$ represent 49.5% of the data each and $D_{3:100}$ account for only 1%. So I merge everything but the first 2 categories together into one category, so that $z$ is now only 3 levels.

After doing this, can I still say I am controlling for $z$? It feels like since the individual coefficients for $D_{3:100}$ are no longer identifiable I'm losing something.

I guess you are losing something, you've gone from 100 categories to category_1, category_1 and others. But if it's valid to reencode 3-100 as a single category the model itself is still valid — David Waterworth, Jan 25 '20 at 01:44
Have a look at https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels — kjetil b halvorsen, Jan 25 '20 at 01:50

score 1 · Answer 1 · answered Jan 26 '20 at 15:55

There is in general no reason to believe that just because some levels are infrequent, they have the same effect on the outcome variable. So I would be doubtful of your proposal. Maybe go for some regularization approach, for your problem I would try the fused lasso. Some useful discussion in Principled way of collapsing categorical variables with many levels?.

What happens when you merge dummy variables together?

1 Answers1