Im trying to model: Y~x0+x1+x2+x3+x4
, were Y
is a continous variable (cost), x0
is the intercept, x1
is a continous variable (days) and x2-x4
are categorical variables with mulitple levels. The categorical variable x2
have 156 levels (each level representing a different diagnosis code, i.e. lung cancer, migraine etc). I want to include x2
in the model but I dont want 156 different dummy variables, were each dummy variable represent a diagnosis code.
Here is a picture of the frequencies of each level (censored):
About 2/3 of the levels are significant at 0.05, when Y~x2
.
What is the best way to deal with this kind of problem in R?