Random forest classifier. Some of my data is overrepresented. Is this an issue?

Question

I am using a random forest classifier to predict plant color in my study species, using a variety of environmental variables. My data comes from citizen scientists and I am worried that the class imbalance I'm seeing between my color categories may be due to a sampling bias from the observers. For example, the same flower may have been documented twice by different people. Or within an area, maybe color 1 gets documented more than color 2, even if color 1 isn't the majority color in the area (Because maybe it's a prettier color).

There is also an issue of areas with high human population density having overrepresentation than areas with lower human population density (or bad cell coverage).

Is this something I should worry about when using random forest? I'm worried that maybe this could cause a higher emphasis on a predictor? If it is something I should be concerned about, what can I do?

score 1 · Answer 1 · answered Jun 15 '21 at 14:35

Random forests shouldn't be immune to this kind of bias. An overrepresented data segment will be overrepresented in the splitting criterion, and so the trees will tend to favor splits that perform well for that segment at the expense of other segments. That's not to say the result will be poor, but there will be a bias. In the particular case of the classes, the final leaf scores will (on average) be biased in exactly the same way your data is.

If you can quantify the extent to which data segments are under/over-represented, then you can add weights to the random forest to counteract that effect (see wikipedia). Similarly, if you can quantify the class balance change that arises from sampling, you can apply class weights to get the leaf scores back to the right proportions. You can also apply a post-model adjustment for the class balance issues on the final scores, see e.g. Convert predicted probabilities after downsampling to actual probabilities in classification, but I don't think there's an analogue for data segments.

Random forest classifier. Some of my data is overrepresented. Is this an issue?

1 Answers1