How to handle categorical predictors with too many levels?

Question

I think it may be a problem if we directly use dummy variable for a categorical predictor having hundreds of levels.

I have found one solution from the book 'Elements of Statistical Learning' (p.329). The solution is mentioned in classification tree sections. Specifically, the solution orders the levels of the categorical predictor by the number of occurrence of each level in one class, and then treats the predictor as an ordered predictors.

I wonder for models other than classification tree, such as linear regression, what would be proper ways of handling categorical predictors with too many levels.

I found an old post asking similar questions, but no answers have been posted:

Categorical logit Predictor with too many different levels

1) Can you provide us with the categorical predictor of interest (or a similar example) so that we know what these "hundreds of levels" mean? 2) What is the distribution of the outcome throughout these levels? Thanks! — Matt Reichenbach, Aug 21 '13 at 05:13
Suppose we have a data-set collecting the answer (Yes or No) to a question from participants of different regions, represented by their county name. Is it possible to add the factor county name as a predictor to predict the probability of people answering 'Yes'? I just make up this example and I hope it makes sense. — Jerry, Aug 21 '13 at 05:50
Your country-scenario seems perfectly suited for a multilevel aka hierarchical aka mixed model, where participants would be located on level 1 and countries on level 2 treating country as a random factor. If the example you gave matches your actual research question, go for multilevel; if not, maybe you can provide an example closer to your question? — hplieninger, Aug 21 '13 at 06:23
@hajöpe, I could understand your reasoning of using mixed model, because I tend to believe the collected response will have correlations within each county. And using the county as random effect term could model such correlation. But if I use it as random effect I would be estimating the variance of the random effect (suppose I add the effect as the intercept, which is assumed normal distribution), so I could not give prediction on specific county. Also, can you explain a bit more why using random effect model could address the questions of predictors with too many levels? — Jerry, Aug 21 '13 at 07:35
You can add country-level predictors to your model (e.g., size, wealth, continent), but these can - without another superordinate level - only be estimated as fixed effects, i.e., the slope is modeled for ALL countries. Having hypotheses and models for every single country probably makes no sense. Either focus on a few, or build one multilevel model including all countries with broader hypotheses (e.g., country wealth increases the prob of people answering 'yes'). — hplieninger, Aug 21 '13 at 08:09
Fused lasso or: https://stats.stackexchange.com/questions/227125/preprocess-categorical-variables-with-many-values/277302#277302 — kjetil b halvorsen, May 16 '17 at 23:41

score 10 · Accepted Answer · answered Aug 21 '13 at 11:49

10

I can't see that ordering the levels by frequency creates an ordinal variable.

Shrinkage is necessary to deal with this problem, either by using penalized maximum likelihood estimation (e.g., R rms package's ols and lrm functions for quadratic (ridge) L2 penalty) or using random effects. You can get predictions for individual levels easily using penalized maximum likelihood estimation, or by using BLUPS in the mixed effects modeling context.

answered Aug 21 '13 at 11:49

Frank Harrell

74,029
5
148
322

Thanks for this clarification. I really appreciate it. Perhaps the ordering approach only works for classification tree. For your reference, this approach is explained on page 329 of the book element of statistical learning (http://www-stat.stanford.edu/~tibs/ElemStatLearn/) – Jerry Aug 21 '13 at 17:10

score 1 · Answer 2 · answered Aug 21 '13 at 08:40

If I understand your problem ( and the old post you linked too) - you are saying that some of the levels have very little data to accurately estimate the effect.

So either reduce the levels by "hand" (creating a new "other" level for all those levels with insufficient data) or what about using L2/L1 regularisation

How to handle categorical predictors with too many levels?

2 Answers2

Linked

Related