I have a classification problem (A or B or C). I am currently evaluating test set results from the trained random forest, neural net, and logistic regression models.
Any one model works pretty well on test set when I require an A or a B call to exceed some probability threshold, for instance 80%. Otherwise it's C.
I was a little surprised that averaging and majority vote based on these three models didn't improve the performance over NN alone.
The only thing that did help, substantially, was if I let any one model to be sufficient for an A or B classification.
While it appears to work, I'm unsure about this in as much as everything I read tends to suggest majority vote or averaging, and this one trigger as sufficient never seems to be mentioned.
Interested in people's thoughts on this before I start working on stacking methods, perhaps unnecessarily. Thank you!