1

I am running a random forest on a binary classification variable using about 30 explanatory variables. Please have a look at the screenshot below. You see that the out-of-bag errors for class 2 increases as the number of trees increases. This looks weird, as I expect the out of bag error to decrease as more trees are added.

Can somebody explain this or point me in the right direction?

enter image description here

Zhubarb
  • 7,753
  • 2
  • 28
  • 44
user3387899
  • 351
  • 4
  • 11
  • 1
    Data, software , code could help clarify. It would say OBB err looks rather constant. Still unusual I admit. – Soren Havelund Welling Nov 26 '15 at 07:43
  • The random forest was performed in R. More info on the data: the data are related to HIV patients following a specific treatment. The dependent variable is whether or not they have taken their pill at a specif point of time. The explanatory variables are patient specific characterstics such as age, gender, etc. together with extra information about previous dates: has the patient taken its pill yesterday? Did he reported symptoms yesterday? – user3387899 Nov 26 '15 at 08:11
  • Ok, nice :) try to update your question with command line which you run the model with, and show a header of the data.frame. On explanation could be that the model has not predictive power and adding more trees do not change this. Moreover, the target classes of the training data is skewed something like 1:4, but I'm just guessing – Soren Havelund Welling Nov 26 '15 at 09:08
  • How many instances do you have in your training set? I would try to run random forests with different random seeds. The estimations of the OoB with one single tree is pretty unreliable. – Simone Nov 26 '15 at 22:56
  • My training set contains 32000 observations. – user3387899 Nov 27 '15 at 07:56
  • 2
    In light of the size of your data set and number of predictors, you need to grow a forest of at least 1000 trees before even being able to try to look at the OOB error rate – Antoine Dec 06 '15 at 17:34
  • I agree with @Antoine. even the default number of trees would be helpful - 200. Using 20 trees is barely an ensemble. [link](http://stats.stackexchange.com/questions/164048/can-random-forest-be-used-for-feature-selection-in-multiple-linear-regression/164250#164250) – EngrStudent Dec 14 '15 at 03:47

0 Answers0