Linear Regression and High Dimensional Categorical Data

Question

I've read that mean encoding is useful for classification tasks with high dimensional categorical data.

My question:

What kinds of encodings are effective for high dimensional categorical data in linear regression tasks? For example, can mean encoding be adapted?

Some possible applications:

Assigning credit limit using job title (we need to encode job title)
Predicting wait time at a nightclub from shoe type (we need to encode shoe type)

See https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels — kjetil b halvorsen, Jun 25 '19 at 18:54

score 1 · Answer 1 · answered Jun 26 '19 at 23:22

Mean encoding seems to be popular in machine learning, but you should try better ways. The problem that dummy coding (or "one-hot") with many levels uses much memory, can be solved with sparse matrices. But still there might be too many parameters with one for each level, so some kind of collapsing of levels could be useful, see Principled way of collapsing categorical variables with many levels?.

What you should do might also depend on which kind of model you are using (you didn't tell us). For one idea see Strange encoding for categorical features.

Linear Regression and High Dimensional Categorical Data

1 Answers1

Linked