1

I am building binary classification models to fit my data and do prediction. Some of the variables have many levels, like the workclass variable below. So I am considering combine some levels into one. For example, I can combine "Federal-gov", "Local-gov", and "State-gov" into one level called "gov". However, I have two questions here.

  1. On what circumstance should I combine levels? How can I know it probably will improve my model?
  2. If I think doing so will be useful, how can I know that some levels are similar enough so I can combine them? Do I do any tests? How to check the similarity?

Thanks.

> summary(census$workclass)
     Federal-gov        Local-gov     Never-worked          Private     Self-emp-inc Self-emp-not-inc 
             960             2093                7            22696             1116             2541 
       State-gov      Without-pay             NA's 
            1298               14             1836 
kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Evan Liu
  • 105
  • 6

0 Answers0