Random forest algorithm builds many decision trees using bootstrap samples from the original dataset. (Individual trees use only a subset of the features but my question is not related to that). Each the tree gets a dataset with repeated examples. My question is this: do we need to use these repeated examples in building individual trees? In other words, what happens if we just use the unique examples in the bootstrapped sample? When I think of the decision tree algorithm, two different trees will be produced if we give one of the the bootstrapped sample and the other only the unique examples in it. Repeated examples will create a kind of weight on those examples. So, it makes a difference, I think, to use repeated examples, however, the main idea behind random forests is variance reduction, I do not understand how using repeated examples help reducing variance. So, from this perspective, the performance of a random forest will not be affected if repeated examples are not used. I appreciate any clarification on these issues. Thanks.
Asked
Active
Viewed 73 times
0
-
Bootstrap aggregation is one of the tools that gives the random forest its strength. The CART model is a "weak learner" while the RF is strong. If you make a collection of identical trees, the output is the same as a single tree: weak. – EngrStudent Dec 22 '20 at 15:09
-
@EngrStudent thanks, but how do repeated examples in the bootstrap sample help to build a weak learner? This part is not clear for me. – Sanyo Mn Dec 22 '20 at 15:36
-
The classification and regression tree (CART) is a [weak learner](https://stats.stackexchange.com/questions/82049/what-is-meant-by-weak-learner). Bias is the weakness of the CART. Bootstrapping overcomes that. – EngrStudent Dec 22 '20 at 16:03
-
@engrstudent you are confusing boosting (xgboost) with bagging (random forest) – seanv507 Dec 22 '20 at 19:57
-
If you have an iid sample and average it, the variance will be lower, but centred around the sample average. The idea of bootstrapping is to create new iid samples ' consistent' with the original sample. Taking unique values will not replicate the statistics of the original dataset, and therefore give you the wrong predictions – seanv507 Dec 22 '20 at 20:00
-
@seanv507 - you are humorous. The CART is a weak learner. The gradient boosted machine, of which XGboost is an instance, is a linear ensemble of CARTS with a weighting scheme. Random forest is a parallel ensemble of CARTS and it uses bootstrap sampling on samples, and in "dimensions" to overcome the weakness of the CART. The boost in boosted machine is from "adaboost", the series weighting scheme. If you have a bad estimate of the mean, and you use boosttrap resampling and out-of-bag error estimation to make a weighted ensemble, then you can be less wrong at it. – EngrStudent Dec 22 '20 at 22:23