6

Background

I am using Random Forest to classify ~900 objects based on a large number (> 80) predictors. I split these 70:30 for training and testing. The overall model does fairly well, giving an error rate of 80% on the testing set.

However, accuracy varies quite a lot among classes, for some it is >95%, while others are trailing at 50 - 60%. This is because some classes are very different from one another, while others are very similar. I can achieve higher accuracy for these troublesome classes if I build forests using a subset with just these classes. i.e. the sub-forests are able to draw out the fine scale differences, but by using very different predictors.

I eventually want to extrapolate the model to predict the classes of another 40,000 objects, so I need the accuracy to be as high as possible for all classes. I'm using R and the randomForest package.

Questions

  1. Is there a way to force the random forest to recognise that the similar classes should be split using a different set of predictors? (My limited understanding of random forests was that this should happen anyway?). See this post for a similar question.

  2. Would it be reasonable to build a coarse random forest, (e.g. separating between 3 classes, where 2 are highly distinct and the third is an amalgamation of the similar objects), and then run additional random forests on the third class? So if an object was classified as one of the distinct classes it would be left out, while if it was one of the third it would go on to the further sub-forests. I can envisage that would work for the training set, but I'm not sure how I would be able to get the testing set through the same process. Perhaps as per the linked question, I would then create new variables based on the outcome of the coarser random forest. However, extrapolating the results to the other 40,000 objects would become impossible.

Apologies in advance for these questions. I am still quite new to using random forests, this post may just reveal my ignorance of the whole approach. Or maybe this approach has a name already and I just don't know what to search for.


Edit

I've just found this post which perhaps covers a similar situation. The answer suggests a voting system to amalgamate multiple random forests, could I adapt this approach to my situation? So when extrapolating to 40,000 objects, run them through the same steps as when training my initial forests? And then use the outcomes to inform the final class? I'd appreciate any thoughts folk might have about this.

EcologyTom
  • 163
  • 5
  • This paper seems relevant: https://www.tandfonline.com/doi/abs/10.1080/00949655.2012.741599 – mkt May 15 '19 at 12:35
  • An error rate of 80% looks like a pretty bad thing. I would like an accuracy of 80% and an error rate of 20% in order to call it "fairly well". How do you make a "defining contrast"? If your target variable is uniform, then there is no training to be done. There are many random forests, and though Briemann's was the first it isn't necessarily the best. I like the h2o.ai version, but ranger and party also have versions that are not a bad thing. Have you looked at simplifying your input space using something like Boruta? After you do that, an alternate method like gbm can work well. – EngrStudent Jan 11 '21 at 17:34

0 Answers0