Bootstrapping algorithm for random forest models

Question

Over the last few days I wrote linear regression models using different algorithms to better understand underlying principles, but now I feel like I want to move to bigger and better things and, in particular, I want to try and write my own random forest model.

I've been using RF models a bit in my work and normally I'd just use scikit-learn implementation, but I want to understand things in more detail and the best way to do it is to try and make my own model.

So the first thing I want to start working for the model is the implementing a bootstrapping algorithm. I had a look online but couldn't find any good resources on the practical implementation of bootstrapping - Wikipedia article on bootstrapping is interesting, but it's all about underlying maths, and most of resources I found through Google just have very basic explanations on the process.

Does anyone know of any resources talking about practical implementation of bootstrapping?

As for other things: in all examples on Wikipedia, if we have an original sample of size N, all the resampling should be done to the size N as well. Is this the standard approach? Or is it sometimes acceptable to create resampled data that has larger or smaller number of observations than the original sample?

And when we resample data for a random forest model, which metric of the original data do we look at when creating bootstrap samples? Once again, in Wiki article they talk a lot about variance, but could we use other dispersion metrics? As an example, could we for example use IQR and select bootstrap samples so that their IQR is close to that of the original sample? Or some other dispersion metric for that matter?

Finally, once we have chosen a specific metric, how do we define what is 'close enough'. I guess, it would be computationally very heavy to try and get a resampled data that matches original data exactly. So how do we define what is close enough for an acceptable resampling result?

Thanks in advance!

Sycorax · Accepted Answer · 2020-07-15T00:20:18.107

2

n all examples on Wikipedia, if we have an original sample of size N, all the resampling should be done to the size N as well. Is this the standard approach?

Yes. This is the method outlined in the original paper (Leo Breiman, "Random Forests", Machine Learning volume 45, pages 5-32 (2001)), or Hastie et al Elements of Statistical Learning (section 15.2).

Or is it sometimes acceptable to create resampled data that has larger or smaller number of observations than the original sample?

This is exposed as an option within sklearn via the max_samples parameter. Some related discussion: Can we use bootstrap samples that are smaller than original sample?

And when we resample data for a random forest model, which metric of the original data do we look at when creating bootstrap samples?

Bootstrap samples are constructed by sampling with replacement where each observation has probability $\frac{1}{n}$ of being selected. There's no metric construction involved to pick and choose samples.

edited Jul 15 '20 at 00:20

answered Jul 14 '20 at 23:59

Sycorax

76,417
20
189
313

Thanks for the reply and information. I'm reading the paper by Leo Breiman and I see the following: "*a random vector $\Theta_{k}$ is generated, independent of the past random vectors $\Theta_{1}$,...,$\Theta_{k−1}$ but with the same distribution; and a tree is grown using the training set and $\Theta_{k}$ , resulting in a classifier h(x, $\Theta_{k}$ ) where **x** is an input vector.*" So how do we ensure that the all vectors have the same distribution? – pavel Jul 15 '20 at 01:01
1

The distribution of $\Theta_k$ is the distribution of random trees: you bootstrap the data, then apply the tree-growing procedure to that data. The same randomization process is applied for each $k$, so the distribution is the same. – Sycorax Jul 15 '20 at 01:18
Ah, cool! I got a bit confused there. Thanks! – pavel Jul 15 '20 at 01:23

Bootstrapping algorithm for random forest models

1 Answers1