RandomForestRegressor behavior when increasing number of samples while restricting depth

Question

I am relatively new to machine learning in general, especially random forests.

Currently I am training a random regression forest with the scikit-learn library. My dataset has 500 features, the model produces a multi-output.

When training the model on 2 million and 3 million samples respectively, using n_estimators=3, max_depth=18, without touching the rest of the parameters, the saved model trained on 2 million samples is 50% larger than the one trained on 3 million samples. But not only the model size is larger, the accuracy is higher as well. This only happens, when I restrict the depth of the trees. When training the model without restricting the depth, then the accuracy increases with increasing number of samples.

I have read that deeper trees reduce the bias, while more trees reduce the variance. Does this consequently mean, that my dataset is somewhat biased and thus increasing the number of samples, increases the bias of the model, but this does not happen, when the depth is not restricted?

Is this behavior to be expected? Can anybody give me a hint to what is happening in the tree building process of RandomForestRegressor that results in this behavior?

If any additional information is needed to answer my question, please do not hesitate to ask in the comments.

EDIT1: I just fixed the seed as suggested by @RUser4512 and ran a test to reproduce the described behavior. All three test runs where trained with max_depth=15

n_samples=100,000: mean absolute error: 0.375; saved model size on disc: 4.6MB; Avg number of leaves per tree = 10,600

n_samples=200,000: mean absolute error: 0.312; saved model size on disc: 5.9MB; Avg number of leaves per tree = 12,610

n_samples=400,000: mean absolute error: 0.603; saved model size on disc: 1.3MB; Avg number of leaves per tree = 2,920

I thought about the nature of the prediction a little more. When doing inference, my model runs 2000 independent predictions which are later used to form a single final prediction. The 2000 single predictions have numerous outliers which are far from the ground truth. I am measuring the quality of a split with the criterion mean squared error, which penalizes outliers pretty heavily. Is there a chance, that this might be effected by increasing number of samples?

I further inspected my dataset and found out that about 2% of the samples had very large outliers which were created in the process of generating the dataset. After removing those outliers not only did the MAE go way down, the above described behavior also did not reoccur.

If anybody is willing to explain why and how big outliers result in such behavior, I am happy to except their answer.

Setting `n_estimators=3` means you're only training 3 trees. This is a very small number, and should be something more like 100. Given the randomness of each tree, the measured accuracy of the overall model could fluctuate wildly from run to run. So, it seems hard to read too much into the accuracy here. "The saved model trained on 2 million samples is 50% larger" What exactly are you measuring that's larger? — user20160, Jan 25 '18 at 16:01
I trained the forests with the same set of parameters 3-5 times without any noticeable difference in accuracy. The accuracy definitely gets worse with increasing number of samples while the `max_depth` is fixed. "The saved model trained on 2 million samples is 50% larger", I dump my model to a .pkl file using `joblib`. So I just stated that the size of the saved model decreases in size with increasing number of samples. But it is to be noted, that this only happens after around 2million samples with `max_depth=18`. — ritocesura, Jan 25 '18 at 16:17

Tim · Answer 1 · 2019-09-01T16:01:35.560

2

The number of trees in random forest is not something that you tune. You set it to some high value (like 100, 250, or more). Greater depth of the trees lets them learn more complicated patterns and eventually could lead to overfitting, so you could prevent this from happening by making it smaller.

edited Sep 01 '19 at 16:01

answered Sep 01 '19 at 15:23

Tim

108,699
20
212
390

score 1 · Answer 2 · edited Jun 11 '20 at 14:32

When training the model on 2 million and 3 million samples respectively, using n_estimators=3...

Tim's answer hits the nail on the head. See Do we have to tune the number of trees in a random forest? for some discussion.

Three trees will produce an ensemble with a very large variance, which is to say that there will be large variations in the model quality between alternative runs. You can verify this with a simple experiment: construct a holdout partition and fit several random forests with n_estimators=3 and then several random forests with n_estimators=1000. You'll find that the distribution of model statistics as measured by the holdout set will have a higher variance when n_estimators=3 than n_estimators=1000.

I further inspected my dataset and found out that about 2% of the samples had very large outliers which were created in the process of generating the dataset. After removing those outliers not only did the MAE go way down, the above described behavior also did not reoccur.

If anybody is willing to explain why and how big outliers result in such behavior, I am happy to except their answer.

The prediction of a single leaf of a random forest regression model is the average of the training samples in that node. If one of those samples is orders of magnitude larger in absolute value than the others, then the mean will obviously be sensitive to that single observation.

Moreover, the fewer samples in the terminal nodes, the larger this effect will be. For a fixed amount of training data, increasing the number of nodes in the tree will tend to decrease the number of samples in each leaf, so the sampling distribution of the sample mean for a leaf will tend to have a larger variance.

Another way to ameliorate this phenomenon is to set a minimum leaf size. This will force each leaf to have at least $k$ samples in it. Whenever this $k$ is larger than a leaf size when you are not setting this parameter, the sample mean of a leaf will be less sensitive to any single observation.

It's important to note that whether or not deleting outliers is a good strategy depends on what problem you're trying to solve. We have several threads on this topic such as Is it OK to remove outliers from data?

score 0 · Answer 3 · answered Jan 26 '18 at 12:43

I have read that deeper trees reduce the bias, while more trees reduce the variance. Does this consequently mean, that my dataset is somewhat biased and thus increasing the number of samples, increases the bias of the model, but this does not happen, when the depth is not restricted?

No. "my dataset is somewhat biased" is an assertion that needs to be clarified. On the other hand, a biased estimator is a notion that is clearly defined (Wikipedia definition). Therefore, the bias of the random forest is not related to a kind of bias in the data.

But not only the model size is larger, the accuracy is higher as well

Regarding the issue about accuracy of your model decreasing with the number of sample, it is quite hard to know where it comes from.

Was the sampling performed randomly ? If so, did you fix the seed you use ? (both before sampling and before training the random forest) ? Though they are not parameters of the model, they can impact the accuracy you measure.

I sampled the `n_samples` used for training out of a pool with 16 million samples. No, I did not fix the seed for sampling the training data, nor did I fix the seed before training the random forest. I will train the model on 2 million and 3 million samples respectively, with a fixed seed and post the results here, as soon as I have them. — ritocesura, Jan 26 '18 at 13:04

RandomForestRegressor behavior when increasing number of samples while restricting depth

3 Answers3

Linked