So I have a dataset with a categorical column rather skewed. Lets imagine something like this:
Type - AmountObservations
C1 - 10000
C2 - 9500
C3 - 8000
C4 - 2000
C5 - 500
C6 - 500
C7 - 10
C8 - 10
C9 - 9
C10 - 9
C11 - 9
C12 - 9
...
C299 - 1
C300 - 1
For this particular variable observations can have 300 possible values (from C1 to C300). But most of the observations are of type C1, C2 and C3.
To simplify the number of splits in a node, in a decision tree, I was planning to group the possible values into bins.
I know the binning process is some kind of compression, so it imply information loss. The less amount of bins, the bigger the loss of information.
Initially I was planning to make 30 bins to group categories like:
C1-C30 -> bin1
C31-C60 -> bin2
...
C271-C300 ->bin30
But then I will have a really skewed dataset, where almost the 99% of the observations will be of type bin1
.
Another option would be to have:
C1 -> bin1
C2 -> bin2
C3 -> bin3
C4-C300 -> bin4
With this approach I will have a less skewed dataset, but I will move from 300 categories to 4. Which I believe may be a big loss of information.
What is the better approach? If there is another option I didn't come up with, please share.