I am relatively new to machine learning in general, especially random forests.
Currently I am training a random regression forest with the scikit-learn library. My dataset has 500 features, the model produces a multi-output.
When training the model on 2 million and 3 million samples respectively, using n_estimators=3
, max_depth=18
, without touching the rest of the parameters, the saved model trained on 2 million samples is 50% larger than the one trained on 3 million samples. But not only the model size is larger, the accuracy is higher as well. This only happens, when I restrict the depth of the trees. When training the model without restricting the depth, then the accuracy increases with increasing number of samples.
I have read that deeper trees reduce the bias, while more trees reduce the variance. Does this consequently mean, that my dataset is somewhat biased and thus increasing the number of samples, increases the bias of the model, but this does not happen, when the depth is not restricted?
Is this behavior to be expected? Can anybody give me a hint to what is happening in the tree building process of RandomForestRegressor
that results in this behavior?
If any additional information is needed to answer my question, please do not hesitate to ask in the comments.
EDIT1:
I just fixed the seed as suggested by @RUser4512 and ran a test to reproduce the described behavior. All three test runs where trained with max_depth=15
n_samples=100,000:
mean absolute error: 0.375; saved model size on disc: 4.6MB; Avg number of leaves per tree = 10,600
n_samples=200,000:
mean absolute error: 0.312; saved model size on disc: 5.9MB; Avg number of leaves per tree = 12,610
n_samples=400,000:
mean absolute error: 0.603; saved model size on disc: 1.3MB; Avg number of leaves per tree = 2,920
I thought about the nature of the prediction a little more. When doing inference, my model runs 2000 independent predictions which are later used to form a single final prediction. The 2000 single predictions have numerous outliers which are far from the ground truth. I am measuring the quality of a split with the criterion mean squared error, which penalizes outliers pretty heavily. Is there a chance, that this might be effected by increasing number of samples?
I further inspected my dataset and found out that about 2% of the samples had very large outliers which were created in the process of generating the dataset. After removing those outliers not only did the MAE go way down, the above described behavior also did not reoccur.
If anybody is willing to explain why and how big outliers result in such behavior, I am happy to except their answer.