Decision Tree - What to avoid first a Skewed dataset or reduce too much the number of bins

Question

So I have a dataset with a categorical column rather skewed. Lets imagine something like this:

Type - AmountObservations
C1 - 10000
C2 - 9500
C3 - 8000
C4 - 2000
C5 - 500
C6 - 500
C7 - 10
C8 - 10
C9 - 9
C10 - 9
C11 - 9
C12 - 9
...
C299 - 1
C300 - 1

For this particular variable observations can have 300 possible values (from C1 to C300). But most of the observations are of type C1, C2 and C3.

To simplify the number of splits in a node, in a decision tree, I was planning to group the possible values into bins.

I know the binning process is some kind of compression, so it imply information loss. The less amount of bins, the bigger the loss of information.

Initially I was planning to make 30 bins to group categories like:

C1-C30 -> bin1
C31-C60 -> bin2
...
C271-C300 ->bin30

But then I will have a really skewed dataset, where almost the 99% of the observations will be of type bin1.

Another option would be to have:

C1 -> bin1
C2 -> bin2
C3 -> bin3
C4-C300 -> bin4

With this approach I will have a less skewed dataset, but I will move from 300 categories to 4. Which I believe may be a big loss of information.

What is the better approach? If there is another option I didn't come up with, please share.

Why do you want to simplify the number of splits in a node by combining bins? — Stephan Kolassa, Jul 31 '18 at 14:56
@StephanKolassa I though it makes the algorithm to work better. — Ignacio Alorre, Jul 31 '18 at 14:57
@StephanKolassa besides I read that a skewed column may cause as well problems in a node. However if I have this concepts wrong, please let me know. I am starting still on this and any hint is always appreciated. — Ignacio Alorre, Jul 31 '18 at 15:02

score 1 · Accepted Answer · answered Jul 31 '18 at 15:18

Unbalanced classes are usually not a problem: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help? The key is to use a good metric to evaluate your classification. Use probabilistic predictions, and evaluate them using proper scoring-rules. Using accuracy can mislead you badly: Why is accuracy not the best measure for assessing classification models? and Is accuracy an improper scoring rule in a binary classification setting?

That said, whether your current binning is what you truly need depends on what you will do with the end result of your model. Suppose your classification classifies people into

"will repay the loan on time" (C1)
"will repay after a reminder phone call" (C2)
"will repay after a reminder letter" (C3)
"will not repay because of reason X" (C4)
"will not repay because of reason Y" (C5)
"will not repay because of reason Z" (C6)
etc.

In this case, you are really only interested in whether people fall in bins C1-3 versus C4-300. In this case, you should at least try (probabilistic) classifications with these combined bins. However, note that this is not necessarily driven by any statistical issues with unbalanced data, but by the process that consumes your classification.

Ok thanks a lot for the feedback. So in that case it is fine for a note in a decission tree to create 300 splits, one per possible category in the column? — Ignacio Alorre, Jul 31 '18 at 15:26
Base on your description, I see no *a priori* reason why there should be a problem on the statistical side. (300 bins may be harder to interpret.) — Stephan Kolassa, Jul 31 '18 at 16:08

Decision Tree - What to avoid first a Skewed dataset or reduce too much the number of bins

1 Answers1