0

Im trying to model: Y~x0+x1+x2+x3+x4, were Y is a continous variable (cost), x0 is the intercept, x1 is a continous variable (days) and x2-x4 are categorical variables with mulitple levels. The categorical variable x2 have 156 levels (each level representing a different diagnosis code, i.e. lung cancer, migraine etc). I want to include x2 in the model but I dont want 156 different dummy variables, were each dummy variable represent a diagnosis code.

Here is a picture of the frequencies of each level (censored): enter image description here

About 2/3 of the levels are significant at 0.05, when Y~x2.

What is the best way to deal with this kind of problem in R?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Jam.Wil
  • 45
  • 5
  • 1
    Can you group the diagnosis codes into groups, thereby increasing your N for each group and maybe even providing more meaningful categoricals? e.g. group all the lung conditions, heart conditions, brain conditions, etc? – E. Rei May 03 '20 at 13:29
  • 1
    You could look into various contrast coding methods. But most likely just keeping the top k categories and a single other column is good enough. – Georg Heiler May 03 '20 at 13:43
  • 1
    Maybe this answers your Q: https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels – kjetil b halvorsen May 03 '20 at 15:25

0 Answers0