How to deal with 100+ levels in categorical variable in multiple linear regression?

Asked May 03 '20 at 13:09

Active May 03 '20 at 13:26

Viewed 26 times

Im trying to model: Y~x0+x1+x2+x3+x4, were Y is a continous variable (cost), x0 is the intercept, x1 is a continous variable (days) and x2-x4 are categorical variables with mulitple levels. The categorical variable x2 have 156 levels (each level representing a different diagnosis code, i.e. lung cancer, migraine etc). I want to include x2 in the model but I dont want 156 different dummy variables, were each dummy variable represent a diagnosis code.

Here is a picture of the frequencies of each level (censored):

About 2/3 of the levels are significant at 0.05, when Y~x2.

What is the best way to deal with this kind of problem in R?

edited May 03 '20 at 13:26

kjetil b halvorsen

63,378
26
142
467

asked May 03 '20 at 13:09

Jam.Wil

1

Can you group the diagnosis codes into groups, thereby increasing your N for each group and maybe even providing more meaningful categoricals? e.g. group all the lung conditions, heart conditions, brain conditions, etc? – E. Rei May 03 '20 at 13:29
1

You could look into various contrast coding methods. But most likely just keeping the top k categories and a single other column is good enough. – Georg Heiler May 03 '20 at 13:43
1

Maybe this answers your Q: https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels – kjetil b halvorsen May 03 '20 at 15:25

How to deal with 100+ levels in categorical variable in multiple linear regression?

0 Answers0