1

I am currently working on a dataset in R-studio and as the title might suggest I am having difficulty creating the tree I'm looking for. My dataset consist of 122151 observations with 33 Variables. The dataset is already prepared for properly for a treemodel (no empty values, binairy values, maxs/means/mins)

for eases sake, lets call the dataset df1, the dependent variable x1, and the predicting variables y1, y2, y3,....y32

With the use of the tree package I setup the following code:

     tree <- tree(x1 ~ y2+y3+y4.......+y32, data=df1, model=FALSE)

this however results in a tree with only one node as seen below, where it's suppose to give a tree with roughly 17 nodes.

http://i57.tinypic.com/2lvelad.jpg

What I expect to be the problem is the configuration of the dependent variable, namely 341 yes (1) and 121000+ no (0). This seems to mess up the predictive part and is kinda neglecting the tree.

Is there any way to input a setting that gives a 50% chance for the binary valuation to occure in the dependent variable so the tree actually grows, rather than receiving a 1 node branch?

Chen Orihara
  • 148
  • 6

2 Answers2

1

What you are seeing is a typical class imbalance problem, and decision trees do not deal very well with that.

I would point you to this very useful answer on how to deal with your problem.

In short, you can try under-sampling you dominant class, over-sampling you under-represented class, or other techniques like cost-sensitive training.

Bar
  • 2,492
  • 3
  • 19
  • 31
  • some packages also include cost functions in the tree. – charles Apr 08 '15 at 22:48
  • I see, I've been wondering how to to deal with imbalance's in R. In Statistica it never posed a problem. It does look like an interesting point to take in mind when working with these datasets, I shall give it a try – Chen Orihara Apr 09 '15 at 08:12
0

I think there are two possible issues here:

-first one is to assure yourself that x1 is numeric in order to build up a regression tree.

-assuming you're building up a regression tree a second aspect is to play with the cp parameter in control.rpart (take a look at documentation). This parameter controls the tree prunning and it's likely that you need a lower cp just to see more nodes and branches, but beware with overfitting.. .try for example with 0.001.

D.Castro
  • 646
  • 4
  • 7