Imagine I have a dataset with a categorical variable with many levels, and I want to use this dataset for binary (positive/negative) supervised learning.
In this categorical variable certain levels have many observations, while some others have very few, perhaps even only one observation.
When does it make sense to group certain levels of a variable in an "others" group and treat them as a single one in the supervised learning problem? Which criteria and methods could I use? Does it depend only on the size of each level, or something else? How could this affect the supervised learning results?
I'm thinking it might depends not only on the number of observations of the levels, but also with how many positive/negative examples belong to that level. For example, if I have a level with only 10 observations but all 10 of them are positive, I'd be less prone to put this level into the "others" group, while if it had only 5 positive observations I would.