I am building binary classification models to fit my data and do prediction. Some of the variables have many levels, like the workclass variable below. So I am considering combine some levels into one. For example, I can combine "Federal-gov", "Local-gov", and "State-gov" into one level called "gov". However, I have two questions here.
- On what circumstance should I combine levels? How can I know it probably will improve my model?
- If I think doing so will be useful, how can I know that some levels are similar enough so I can combine them? Do I do any tests? How to check the similarity?
Thanks.
> summary(census$workclass)
Federal-gov Local-gov Never-worked Private Self-emp-inc Self-emp-not-inc
960 2093 7 22696 1116 2541
State-gov Without-pay NA's
1298 14 1836