0

I want to build a binary classification tree to clasfiy wether a person is working or not and use the model for prediction. I read that unbalanced data could be a problem. Now i ask myself at which treshold is the unbalance big enough that it could yield to problems for the tree? Below you can see the table outputs for the variable to see how unbalanced the the variable is that i try to predict. I want build 2 trees for 2 different years. Is the data to unbalanced or is it okay?

> table(testSet2002$Partizipation)

   0    1 
1031 2229 
> table(trainSet2002$Partizipation)

   0    1 
2361 5246 
> table(testSet2015$Partizipation)

   0    1 
1040 2210 
> table(trainSet2015$Partizipation)

   0    1 
2352 5265 
  • See also [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) – Stephan Kolassa Jan 29 '19 at 10:11
  • Difficult to say without further information. One very important aspect though, is the criteria you use to split new branches, since that is very strongly influenced by the distribution of your data. See "Learning Decision Trees for Unbalanced Data" for a discussion. https://www3.nd.edu/~nchawla/papers/ECML08.pdf – jpmuc Jan 30 '19 at 11:32
  • @jpmuc does an implementation of the hellinger distance splitting criteria exists in R? – MasterStudent1992 Feb 05 '19 at 10:52

0 Answers0