group certain categorical levels in an "others" level?

Asked May 21 '15 at 08:59

Active May 17 '17 at 11:05

Viewed 644 times

Imagine I have a dataset with a categorical variable with many levels, and I want to use this dataset for binary (positive/negative) supervised learning.

In this categorical variable certain levels have many observations, while some others have very few, perhaps even only one observation.

When does it make sense to group certain levels of a variable in an "others" group and treat them as a single one in the supervised learning problem? Which criteria and methods could I use? Does it depend only on the size of each level, or something else? How could this affect the supervised learning results?

I'm thinking it might depends not only on the number of observations of the levels, but also with how many positive/negative examples belong to that level. For example, if I have a level with only 10 observations but all 10 of them are positive, I'd be less prone to put this level into the "others" group, while if it had only 5 positive observations I would.

edited May 17 '17 at 11:05

kjetil b halvorsen

63,378
26
142
467

asked May 21 '15 at 08:59

dukebody

Are there other variables in your data set you plan to use? or just these two? – Eric Farng May 21 '15 at 20:33
There are more variables I plan to use. Mainly categorical ones. – dukebody May 22 '15 at 07:16
1

Have a look at https://stats.stackexchange.com/questions/227125/preprocess-categorical-variables-with-many-values/277302#277302 and links in there. – kjetil b halvorsen May 17 '17 at 11:05

When/how should I bucket/recode/group certain categorical levels in an "others" level?

0 Answers0