1

I want to know whether my understanding of RF algorithm is correct when using bootstrapping.

So, let's say I have a dataset of 100 observations. That dataset is then split into a 75/25 split of train and test observations.

Then the bootstrapping gets done and I "extract" 1000 different bootstrap samples of that same split (now I have 1000 different 75/25 splits in theory).

After that, a RF model gets trained on those bootstrap samples. Each RF has 1000 decision trees.

This is where I'm not quite sure if I understood the algorithm correctly; Essentially we now have 1000 different forests, and they all get aggregated to make a final decision? If it's a classification the most common class is selected and if it's regression the mean of the decisions is used? I'm aware different metrics can be used but this is a basic example.

Thanks!

Jamess11
  • 35
  • 5

1 Answers1

1

Your description is not very clear, presumably because what you are describing is not clear for you, hence your question. I'll try explaining the terms you described and the usual usage of them, as well the ways how to use those procedures and algorithms as usual. Pay attention to what I'm saying in case I misunderstood what you meant.

So, let's say I have a dataset of 100 observations. That dataset is then split into a 75/25 split of train and test sets.

You are describing using a hold-out validation. This means that you split the data to train and test set. Here you use the 75/25 proportions, hence 75% of your data goes into the train set and the remainder to the test set. 75% of 100 is 75 observations.

In some cases, you would split the data into three sets train, validation, and test. The validation set would be used for hyperparameter tuning.

Then the bootstrapping gets done and I "extract" 1000 different bootstrap samples of that same split (now I have 1000 different 75/25 splits in theory).

This paragraph is not clear to me. Random forest uses bootstrap to resample the data is trained on. In your example, this would mean that you would use bootstrap resampling on the 75 samples from the train set. During each bootstrap sampling, you would draw with replacement 75 out of 75 samples. The procedure would be repeated 1000 times, so you would have 1000 bootstrap samples of 75 observations each. Each of those bootstrap samples would consist of different (at random) combinations of the 75 observations from the training set. In this step, you don't touch the test set observations.

This is where I'm not quite sure if I understood the algorithm correctly; Essentially we now have 1000 different forests, and they all get aggregated to make a final decision?

Not a 1000 random forests, but a 1000 decision trees. You would make a prediction using each of those trees and then aggregate the individual predictions to make the final prediction. To aggregate them, you would use something like majority vote (for classification) or average (regression), but this doesn't matter that much.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • Thanks for the explanation. I have a better understanding now. So if I understood correctly, let's say I decide to use 100 samples and 1000 decision trees. I also set my 'mtry' (size of the random variable subset) and 'min_n' (minimal number of data points for each node). So now 1000 decision trees would be created from the 100 samples but with different variables used since those are selected at random (hence the mtry argument). I suppose this would leave a possibility to have duplicate trees. Am I wrong in my understanding? – Jamess11 Jul 02 '21 at 16:14
  • @Jamess11 each tree would be trained using different resample of the rows and different subsets of the features (columns) of the dataset. As you noticed, we are doing this at random, so there is no guarantee that there won't be duplicated. The chance of observing duplicates depends on the size of your dataset and the number of bootstrap samples you take. – Tim Jul 02 '21 at 16:20
  • Thank you very much, I understand the concept much clearer now. – Jamess11 Jul 02 '21 at 16:22