0

I want to tune the parameters of random forest regression. If I have $n$ rows of training data, is there a bound on the maximum depth in terms of n, to avoid overfitting problem? Is $2^d < n$ the correct bound, where $d$ is the maximum depth of a tree?

Ted
  • 123
  • 5

1 Answers1

1

Overfitting is best evaluated by k-fold cross validation, looking at how it performs on your validation set with chosen metric (like accuracy). It is data specific and not easy to predict using rules of thumb. In Random Forest the more important hyperparameter is usually the number of trees used, as the averaging across many trees reduces overfitting. Often the max_depth is left at infinite. Use gridsearch with cross validation on your problem and find out how it works for your particular problem.

Jon Nordby
  • 1,194
  • 5
  • 19
  • 2
    Tuning the number of trees wrt accuracy is ill-advised. https://stats.stackexchange.com/questions/348245/do-we-have-to-tune-the-number-of-trees-in-a-random-forest/348246#348246 – Sycorax May 27 '18 at 00:45
  • 2
    Yes, tuning like a regularization parameter etc is wrong. But one still needs to ensure that there are enough trees, and that has some interaction with other parameters like tree depth. Though one could say just always leave it very high, like 100-500 and take the perf hit. – Jon Nordby May 27 '18 at 11:56
  • 2
    Even so, grid search is not an efficient way to do that. It would be better to fit a single forest at the maximum number of trees, then plot a loss curve by the number of trees in the forest by truncating, then simply truncate the forest to a sensible number. – Matthew Drury Dec 23 '18 at 19:59