How important are parameters in random forest regression?

Question

So I'm using random forest regression to predict on some data and I usually go with 25 or 50 trees and 50 random features. Currently I have only one data set to care about, but in the future more will come for which I have to create new models, eventually so many it will be difficult to tune the parameters by hand.

So I thought about writing a program that tries out different parameter following a specific scheme. For example, if the results for params 50/50 are not so great, try again with 40/50, then 60/50, then 50/40 etc. Just an idea. But a Ph.D. in ML told me that this effort is in vain as the parameters only slightly influence random forest, as it is very stable. Can you confirm this?

I've read through this Question & Answers and I can confirm the variance reduction with my 25/50 vs. 50/50 approach, but it still doesn't really answers my question. Help is apprechiated :-)

you need the right kind of problem. When you need to consider tree count is when the data drives it. Consider the UCI data-sets and get something with ~50 columns, and ~1000 rows, and then look at how using the tree-count works. You can also stack boosted random forests when you are working on regression models. — EngrStudent, Jun 16 '16 at 02:56

score 2 · Accepted Answer · answered Jun 15 '16 at 15:33

"Importance" is not a clear-cut concept. However, a coauthor of mine recently varied the mtry parameter (that is, the number of candidate predictors randomly chosen at each split) and recorded the resulting out-of-bag classification accuracy.

Now, on the one hand, the accuracies differ by an amount that is probably not clinically different - just between 79.8% and 81.3%. On the other hand, the difference between mtry=8 and mtry=21 certainly is statistically significant.

So it seems like the parameter settings for your Random Forest can indeed have an impact on your accuracy. I was surprised at this myself.

(I'd rather not give details on the classification task here, because this is yet-unsubmitted work in progress, and I don't want my collaborators to kill me. I'm too young to die.)

I'm a big fan of teaching folks how to "prove it to yourself" and if you didn't give an answer of this form then I might have tried to. It would be good, after you have published, to make this something to revisit and expand for reproducibility with a request that if they publish work after using your method to certify, that they put your paper in their references. — EngrStudent, Jun 19 '16 at 21:29

How important are parameters in random forest regression?

1 Answers1