Over the last few days I wrote linear regression models using different algorithms to better understand underlying principles, but now I feel like I want to move to bigger and better things and, in particular, I want to try and write my own random forest model.
I've been using RF models a bit in my work and normally I'd just use scikit-learn
implementation, but I want to understand things in more detail and the best way to do it is to try and make my own model.
So the first thing I want to start working for the model is the implementing a bootstrapping algorithm. I had a look online but couldn't find any good resources on the practical implementation of bootstrapping - Wikipedia article on bootstrapping is interesting, but it's all about underlying maths, and most of resources I found through Google just have very basic explanations on the process.
Does anyone know of any resources talking about practical implementation of bootstrapping?
As for other things: in all examples on Wikipedia, if we have an original sample of size N, all the resampling should be done to the size N as well. Is this the standard approach? Or is it sometimes acceptable to create resampled data that has larger or smaller number of observations than the original sample?
And when we resample data for a random forest model, which metric of the original data do we look at when creating bootstrap samples? Once again, in Wiki article they talk a lot about variance, but could we use other dispersion metrics? As an example, could we for example use IQR and select bootstrap samples so that their IQR is close to that of the original sample? Or some other dispersion metric for that matter?
Finally, once we have chosen a specific metric, how do we define what is 'close enough'. I guess, it would be computationally very heavy to try and get a resampled data that matches original data exactly. So how do we define what is close enough for an acceptable resampling result?
Thanks in advance!