0

Suppose I have a dataset consisting different fruits:

60 apples, 100 oranges, 120 bananas, 7 grapes, 900 pears,

I want to train a random forest model using these fruits, but what should i do with these large range number? So if I want to train on 80% of the data and test on the rest 20%. There is a high chance that randomly select fruit samples will contain lots of pears and that there may be a bias towards the pears...

what should i do in this case to overcome this problem?

Math Avengers
  • 497
  • 3
  • 7
  • Why is it a problem to think the fruit is more likely to be a pear? Isn’t that the case? – Dave Jun 23 '21 at 11:40
  • 1
    You could use stratified sampling when dividing your data - when bootstrapping. – user2974951 Jun 23 '21 at 11:49
  • 1
    Picking a test set that reflects the original proportions of fruit could be a valid strategy, but it does not change the fact that pears will be more numerous than the other fruit types. My question is why that’s an issue. I see the distribution and think we *should* have more pears than the other fruits. – Dave Jun 23 '21 at 11:53
  • @user2974951 I understand the part with stratified sampling; however, would the maximum number of samples I can take from each of the fruits to be 6? So just a total of 30 samples to train a model? 6 samples because the we need at least 1 to be in the testing set and there are only 7 samples for grapes – Math Avengers Jun 23 '21 at 12:17
  • 1
    Why 30 samples to train the model? That sounds like artificially balancing the classes. – Dave Jun 23 '21 at 12:22
  • @Dave Sort of, I want to have equal number of samples in each group to avoid bias. I've tried to train the model by randomly select 80% of the data but the results have a huge bias toward the pears. Literally almost all testing data got classified into the pears group. – Math Avengers Jun 23 '21 at 13:14
  • Why do you believe that result to be incorrect? Shouldn't the model tend to think the objects are pears? // Remember that you get a probabilistic prediction, not just a hard classification. Software might not make it especially easy to get this probability, but I do know that it can be done in R. – Dave Jun 23 '21 at 13:47
  • @Dave i agree that the model may be correct, supposing that there's a higher probability for the novel fruit to be pear. However, I want the model to "focus" on more the independent variable rather than the probability and a way to do this is to have the equal amount of samples in each group? – Math Avengers Jun 23 '21 at 14:36
  • https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he?noredirect=1&lq=1 – Dave Jun 23 '21 at 14:37
  • @Dave thank you! – Math Avengers Jun 24 '21 at 12:47

0 Answers0