10

I have used rpart.control for minsplit=2, and got the following results from rpart() function. In order to avoid overfitting the data, do I need to use splits 3 or splits 7? Shouldn't I use splits 7? Please let me know.

Variables actually used in tree construction:

[1] ct_a ct_b usr_a

Root node error: 23205/60 = 386.75

n= 60        

    CP nsplit rel error  xerror     xstd
1 0.615208      0  1.000000 1.05013 0.189409
2 0.181446      1  0.384792 0.54650 0.084423
3 0.044878      2  0.203346 0.31439 0.063681
4 0.027653      3  0.158468 0.27281 0.060605
5 0.025035      4  0.130815 0.30120 0.058992
6 0.022685      5  0.105780 0.29649 0.059138
7 0.013603      6  0.083095 0.21761 0.045295
8 0.010607      7  0.069492 0.21076 0.042196
9 0.010000      8  0.058885 0.21076 0.042196
chl
  • 50,972
  • 18
  • 205
  • 364
samarasa
  • 1,287
  • 6
  • 18
  • 26
  • 1
    I answered this in the follow-up you posted to previous Q. Given that, there was no need for this. I mentioned that you shouldn't edit Q's to follow-up *for future reference*! – Gavin Simpson Jul 25 '11 at 17:02
  • 1
    To avoid searching for the related question in the future, here is the link to the previous Q: http://stats.stackexchange.com/questions/13446/recursive-partitioning-using-rpart-method-in-r. – chl Jul 25 '11 at 21:49

1 Answers1

10

The convention is to use the best tree (lowest cross-validate relative error) or the smallest (simplest) tree within one standard error of the best tree. The best tree is in row 8 (7 splits), but the tree in row 7 (6 splits) does effectively the same job (xerror for tree in row 7 = 0.21761, which is within (smaller than) the xerror of best tree plus one standard error, xstd, (0.21076 + 0.042196) = 0.252956) and is simpler, hence the 1 standard error rule would select it.

Gavin Simpson
  • 37,567
  • 5
  • 110
  • 153