1

I have a binary classification problem (lets say, whether or not an observation will experience action x). I train a random forest model on a training set where about 50% have done action x and 50% have not. I test the model on a test set (again, about 50% did action x, 50% did not), and its about 85% or so accurate and has an overall error rate of about 15% or so. A year passes and I get new data and I want to see how the model performed. It predicted that about 9% of the data will experience action x. 9% of the data did in fact experience action x but it failed to accurately predict the individual observations that would experience action x. In order words, the individual observations it predicted would experience action x did not actually experience action x. And the individual observations that did in fact experience action x, the model did not predict it so.

So essentially what does it mean that the model gets the aggregate correct but fails to make accurate predictions at the micro level? Is there a mathematical explanation on how this might occur? Maybe something to do with aggregating the probabilities? Is it still useful at predicting totals?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Easthaven
  • 11
  • 1
  • Could you tell us more? I'd say that this is something that I'd totally expect. – Tim Aug 20 '16 at 18:20
  • Not much to add, really. I used the randomForest package in R. I also did a k-fold cross validation as well, and the accuracy held at the same rate. – Easthaven Aug 26 '16 at 14:47
  • So *maybe* you simply cannot get more out of it..? See http://stats.stackexchange.com/questions/222179/how-to-know-that-your-machine-learning-problem-is-hopeless – Tim Aug 26 '16 at 14:48
  • The total estimate can still be useful. I just want to know if it makes sense that this would occur. Some consequence of aggregating or averaging the entire set of probabilities. – Easthaven Aug 26 '16 at 17:29

0 Answers0