I am using a random forest classifier to predict plant color in my study species, using a variety of environmental variables. My data comes from citizen scientists and I am worried that the class imbalance I'm seeing between my color categories may be due to a sampling bias from the observers. For example, the same flower may have been documented twice by different people. Or within an area, maybe color 1 gets documented more than color 2, even if color 1 isn't the majority color in the area (Because maybe it's a prettier color).
There is also an issue of areas with high human population density having overrepresentation than areas with lower human population density (or bad cell coverage).
Is this something I should worry about when using random forest? I'm worried that maybe this could cause a higher emphasis on a predictor? If it is something I should be concerned about, what can I do?