I have a response variable that can be $A,B,C$. It is very sparse, meaning 99% of the sample is $B$ and the rest is approximately evenly divided between $A$ and $C$.
How do I predict this variable in a random classification forest? I am looking for guidelines:
- Can I use the standard classification splitting criterion with such a sparse response variable?
- Given the asymmetric damage an out of sample misclassification would do (i.e. classifying A or C correctly is most important and B correctly is a lower priority), how do I apply some kind of asymmetric loss function here?
- Are there other special things I need to take into consideration when modelling such a sparse response variable?
Related but not duplicated: Is there a Random Forest implementation that works well with very sparse data?