4

I would like to fit a single tree. In the h2o R package, I can use h2o.randomForest() with the following options:

h2o.randomForest(y = y, x = x, training_frame = data, 
                 ntrees = 1, 
                 mtries = number_of_predictors_here,
                 sample_rate = 1)

I have a question about sample_rate meaning. I assume in case I specify sample_rate = 1, it will use all the data. Is this correct? Or it will still do sampling with replacement?

Will this approach provide a correct way to fit a single tree?

Silverfish
  • 20,678
  • 23
  • 92
  • 180
Sergey
  • 41
  • 1

1 Answers1

3

Yes, that is currently the correct way to train a single decision tree in H2O. We have a ticket open to create a wrapper for this which will make it a bit more straight-forward to use.

If you use sample_rate = 1, that means it will not do any sampling, so it will use the full training set.

Another thing to note is that h2o.randomForest() has a default max_depth value of 20, so if you want the trees to grow all the way down, unconstrained, you might also choose to set that value to something large, like 1000.

Erin LeDell
  • 765
  • 3
  • 11
  • 1
    On the other hand, it would be much better if you would provide an API for setting no sampling, rather than using 1 for that purpose. The OP thought in a correct way since even Breiman though about bagging as a sample with replacement of the same size of the original sample. – rapaio Aug 22 '17 at 22:48
  • Yeah, that's what we have in mind for the future `h2o.decisionTree()` function -- it would be a wrapper for `h2o.randomForest()` that removes any notion of sampling so the user doesn't have to think about or wonder what kind of sampling is going on. – Erin LeDell Aug 22 '17 at 23:03
  • I must admit I'd not thought about this before, but if sample_rate is 0.9999, is it doing sample-with-replacement? http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/sample_rate.html ought to be explicit about that, one way or the other. – Darren Cook Aug 23 '17 at 12:26
  • 1
    @DarrenCook Sampling is always without replacement in H2O -- it's just regular subsampling of the rows. You're right, we should note that in the docs (will do). – Erin LeDell Aug 23 '17 at 15:57