Does Random Forest model require similar sample size across different sample?

Question

Suppose I have a dataset consisting different fruits:

60 apples, 100 oranges, 120 bananas, 7 grapes, 900 pears,

I want to train a random forest model using these fruits, but what should i do with these large range number? So if I want to train on 80% of the data and test on the rest 20%. There is a high chance that randomly select fruit samples will contain lots of pears and that there may be a bias towards the pears...

what should i do in this case to overcome this problem?

Why is it a problem to think the fruit is more likely to be a pear? Isn’t that the case? — Dave, Jun 23 '21 at 11:40
You could use stratified sampling when dividing your data - when bootstrapping. — user2974951, Jun 23 '21 at 11:49
Picking a test set that reflects the original proportions of fruit could be a valid strategy, but it does not change the fact that pears will be more numerous than the other fruit types. My question is why that’s an issue. I see the distribution and think we *should* have more pears than the other fruits. — Dave, Jun 23 '21 at 11:53
@user2974951 I understand the part with stratified sampling; however, would the maximum number of samples I can take from each of the fruits to be 6? So just a total of 30 samples to train a model? 6 samples because the we need at least 1 to be in the testing set and there are only 7 samples for grapes — Math Avengers, Jun 23 '21 at 12:17
Why 30 samples to train the model? That sounds like artificially balancing the classes. — Dave, Jun 23 '21 at 12:22
@Dave Sort of, I want to have equal number of samples in each group to avoid bias. I've tried to train the model by randomly select 80% of the data but the results have a huge bias toward the pears. Literally almost all testing data got classified into the pears group. — Math Avengers, Jun 23 '21 at 13:14
Why do you believe that result to be incorrect? Shouldn't the model tend to think the objects are pears? // Remember that you get a probabilistic prediction, not just a hard classification. Software might not make it especially easy to get this probability, but I do know that it can be done in R. — Dave, Jun 23 '21 at 13:47
@Dave i agree that the model may be correct, supposing that there's a higher probability for the novel fruit to be pear. However, I want the model to "focus" on more the independent variable rather than the probability and a way to do this is to have the equal amount of samples in each group? — Math Avengers, Jun 23 '21 at 14:36
https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he?noredirect=1&lq=1 — Dave, Jun 23 '21 at 14:37

Does Random Forest model require similar sample size across different sample?

0 Answers0