How is the best split point determined/predictor calculated in a regression random forest?

Question

In a regression tree (I am particularly interested in random forest regression, but it seems like this can be generalised to regression trees as a whole) a number of random variables is selected at a root node of the tree and the best variable split is selected in order to split the node into two daughter nodes. This continues until a set of criteria is fulfilled. My questions here are: How is the best variable split determined at each tree point in a regression tree? What are the criteria to stop? As far as I understand the default for random forest regression is that the best variable split should produce lowest mean squared error (a mean squared error of what exactly?) and the stopping criteria are either mean squared error is lower than the entire dataset (somehow I don’t think that is entirely correct) or the resulting leaf size would contain less than 5 variables.

Once it is completed, the algorithm outputs an ensemble of trees. The prediction can then be made. My other question here is; How is the final prediction made at each leaf in a regression tree? Of course the prediction is an average of the ensemble of the trees, but an average of what exactly? I think in the simplest case (called a constant estimate) it simply is the mean value of variables in that leaf averaged over all trees(?). Or is it a linear regression computed within that leaf averaged over an ensemble?

Both [the Wikipedia article](http://en.wikipedia.org/wiki/Random_forest) and [Breiman's original article](http://link.springer.com/article/10.1023%2FA%3A1010933404324) are freely available. Have you read them both? If so, can you be a bit more specific as to which aspect you struggle with? Right now, your question basically comes down to "please explain random forests to me", for which you will likely not get an answer here. Please feel free to edit your question. — Stephan Kolassa, Nov 28 '14 at 20:26
I don't necessarily have a problem w/ a basic thread outlining what a RF is. However, it seems to me that all of these questions are already answered on the site. Try clicking on the [tag:random-forest] tag & sorting by votes to see the best of CV. Here are some specific threads you might want to read: [1](http://stats.stackexchange.com/q/480/7290), [2](http://stats.stackexchange.com/q/53240/7290), & [3](http://stats.stackexchange.com/q/66757/7290). If you still have a question after reading, come back here & state what you've learned & what you still don't understand & we can help you. — gung - Reinstate Monica, Nov 28 '14 at 20:53
@StephanKolassa I don’t think my question was “explain random forest to me” as I did outline the broad principle of it at the very beginning. Perhaps my question was wrongly formulated and in fact should have a different title. Let me edit... — USER_1, Nov 29 '14 at 23:00
You may be confusing CART's splits with variable importance measures in RF. Individual 'leaves' and 'split points' are not consistent (or meaningful) in RF. Reading about variable importance measures would be a place to start, to further refine your question. The RT needs to be a separate question. — katya, Nov 30 '14 at 17:28
Possible duplicate of [How does random forest generate the random forest](https://stats.stackexchange.com/questions/480/how-does-random-forest-generate-the-random-forest) — Sycorax, May 23 '19 at 15:25

How is the best split point determined/predictor calculated in a regression random forest?

0 Answers0